Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning

Sohail, Saud; Sajjad, Syed Muhammad; Zafar, Adeel; Iqbal, Zafar; Muhammad, Zia; Kazim, Muhammad

doi:10.3390/info16040270

Open AccessArticle

Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning

by

Saud Sohail

¹,

Syed Muhammad Sajjad

¹,

Adeel Zafar

²

,

Zafar Iqbal

³

,

Zia Muhammad

^4,*

and

Muhammad Kazim

⁵

¹

Department of Cybersecurity, Air University, Islamabad 44000, Pakistan

²

School of Information Technology, Halmstad University, 30250 Halmstad, Sweden

³

Department of Cyber Security, National University of Computer and Emerging Sciences, Islamabad 44000, Pakistan

⁴

Department of Computing, Design, and Communication, University of Jamestown, Jamestown, ND 58405, USA

⁵

Industrial and Manufacturing Engineering Department, North Dakota State University, Fargo, ND 58105, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 270; https://doi.org/10.3390/info16040270

Submission received: 6 February 2025 / Revised: 11 March 2025 / Accepted: 17 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Real-World Applications of Machine Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

This research focuses on the detection of deepfake images and videos for forensic analysis using deep learning techniques. It highlights the importance of preserving privacy and authenticity in digital media. The background of the study emphasizes the growing threat of deepfakes, which pose significant challenges in various domains, including social media, politics, and entertainment. Current methodologies primarily rely on visual features that are specific to the dataset and fail to generalize well across varying manipulation techniques. However, these techniques focus on either spatial or temporal features individually and lack robustness in handling complex deepfake artifacts that involve fused facial regions such as eyes, nose, and mouth. Key approaches include the use of CNNs, RNNs, and hybrid models like CNN-LSTM, CNN-GRU, and temporal convolutional networks (TCNs) to capture both spatial and temporal features during the detection of deepfake videos and images. The research incorporates data augmentation with GANs to enhance model performance and proposes an innovative fusion of artifact inspection and facial landmark detection for improved accuracy. The experimental results show near-perfect detection accuracy across diverse datasets, demonstrating the effectiveness of these models. However, challenges remain, such as the difficulty of detecting deepfakes in compressed video formats, the need for handling noise and addressing dataset imbalances. The research presents an enhanced hybrid model that improves detection accuracy while maintaining performance across various datasets. Future work includes improving model generalization to detect emerging deepfake techniques better. The experimental results reveal a near-perfect accuracy of over 99% across different architectures, highlighting their effectiveness in forensic investigations.

Keywords:

deepfake detection; image forensics; video forensics; deep learning; facial landmark detection; cybersecurity and privacy protection; digital media authenticity; AI-generated content detection; manipulated media detection; real-time deepfake detection

1. Introduction

With the help of smart devices such as computers, smartphones, laptops, and others, new technologies have made it possible to obtain simple access to social networks and a variety of media. They also make it possible to create “deepfakes” because content sourced on the internet can appear as real or fake [1]. These deepfakes are multimedia content that has attributes of the genuine media type and are produced through AI with the primary intent of being misleading and are commonly used to spread fake news, for identity theft, and for creation of malicious or potentially objectionable content [2]. This controlled content is used to spread misinformation and fake news (e.g., child pornography by face swapping, controversial videos, human trafficking video synthesis, and identity theft) over the internet and social media (including Facebook, Twitter, Instagram, and YouTube). Deepfake images and videos are also involved in identity theft by face swapping and synthesis [3].

Although identifying deepfake pictures is challenging, there are some techniques and methodologies to reduce and delete the deep artificial features in videos, images, and audio. Although researchers used fake photos and video data in [4], they could not find the counterfeit features of the pictures and videos. The idea of deepfakes was first recognized by the public in the year 2017, when a Reddit user used face-swapping technology to make fairly believable adult videos. This was quite worrisome as it indicated that the fame and availability of such technologies had risen. Since then, deepfakes have increased, with the reported cases as follows: 7964 in December 2018 to an astonishing 85,047 in December 2020 [5,6]. It has been shown that the number of deepfake images and videos doubles every six months worldwide, as shown below in Figure 1. In this figure, the X-axis of the graph represents specific time periods, with data points recorded semi-annually: July 2019, December 2019, June 2020, December 2020, and December 2018. The number of deepfake cases, from 0 to 100,000, is indicated on the Y-axis. After the creation of deepfake multimedia content spread over the internet and social media, doubling every six months, in December 2018, the number of deepfake images was 7964, which doubled in July 2019 to 14,678. According to the statistics below, the latest number of malicious deep counterfeit images and videos was 85,047 in December 2020 [7]. This has generated so much need and can be evidenced by a recent scam in the year 2019 whereby fake audio was used to con USD 24,300 from the recipients of the fake message [8].

Tools like FakeApp [9] and DeepFaceLab [10] have also been refined and made available to the public, increasing the rates of fabricated multimedia. These have not only been used to target popular individuals but are gradually threatening civilians, raising concerns about fake news and impersonation. In turn, these activities’ adverse consequences have heightened the need to develop effective digital media forensic techniques, such as the FaceForensics++ database, which consists of 1000 real and 4000 fake videos [4,11]. However, beyond increasing ethical considerations and security concerns related to deepfakes, these technological strides have also enhanced the quality and accessibility of deepfake generation tools. As deepfake generation becomes increasingly sophisticated and accessible, the development of robust deepfake detection and forensic tools remains essential to safeguarding the integrity of digital media [12]. These advancements not only demonstrate the potential of AI-driven media synthesis but also highlight the critical need to develop and deploy responsible technological countermeasures.

To address these challenges, our research explores how deep learning tools, more specifically deep neural networks (DNNs) and generative adversarial networks (GANs), have become crucial in 2020 for generating and identifying deepfake multimedia objects: audio, video, and picture files. These technologies have the potential to provide new approaches when used with autoencoder networks to down-sample data and generate feature space that can be highly effective in identifying deepfakes. Furthermore, a CNN with XceptionNet, that has already been used on YouTube datasets [13,14], has been proposed to refine categorization and detection capabilities for falsified content.

1.1. Contributions

Our research focuses on several key contributions to the field:

We focused on data augmentation and utilized a variety of techniques to enhance the training of deep learning models, making them scalable and adaptable to different types of media.
We employed multimedia analysis to develop effective methods for distinguishing between real and fake images, thereby ensuring the reliability of information.

1.2. Artifacts Captured

CNN: CNNs are used to extract spatial features from images. This can be helpful for texture detection, and identifying anomalies and inconsistencies in facial landmarks (eyes, nose, mouth).
RNN (LSTM/GRU): These models are used to capture temporal inconsistencies including unnatural eye blinking and mismatched lip movement.
GAN-based autoencoders: These models help to detect manipulated features by reconstructing input data and comparing discrepancies.
TCNs (temporal convolutional networks): TCNs are a great way to analyze long-range dependencies in video sequences.

1.3. Feature Fusion

The extracted spatial features (from CNN) and temporal features (from RNN/TCN) are fused using a concatenation layer, followed by fully connected layers for classification.
This fusion enhances robustness by combining frame-level features (textures, distortions) with sequence-level cues (motion inconsistencies).

In scope, the paper systematically addresses the critical aspects of deepfake detection. Section 1 discusses the societal impact of deepfakes and the need for robust detection technologies. This is followed by Section 2 and Section 3 on background and related work, which explore the advancements in deepfake creation and detection, with a particular focus on the progress made in deep learning models such as convolutional neural networks (CNNs), generative adversarial networks (GANs), and recurrent neural networks (RNNs). Section 4 proposes models and forensics techniques, including data preprocessing, artifact detection, and feature extraction techniques (such as landmark detection and data balancing using SMOTE). For deepfake distinction, we specifically highlight model architectures like CNN-LSTM, CNN-GRU, and temporal convolutional networks (TCNs). The empirical validation of the models is presented in Section 5, where we demonstrate that the models can accurately detect temporal inconsistencies in videos. Finally, Section 6 presents key findings, discusses ongoing limitations, and outlines future directions aimed at advancing deepfake detection methodologies to stay ahead of emerging manipulation techniques.

2. Background

With the passage of time, deepfake content has been produced and spread in large volumes. Unlike other FaceForensics methods that are unsupervised, advanced deep learning models like CNN-LSTM and CNN-GRU combine convolutional neural networks (CNNs), for spatial feature extraction, with long short-term memory (LSTM) or gated recurrent units (GRUs) to effectively distinguish between real and fake content based on both temporal and spatial patterns. Specifically, the temporal convolutional network (TCN) utilizes dilated convolutions to efficiently recognize long-term temporal patterns, while the GAN-Autoencoded Features model integrates generative adversarial networks (GANs) with autoencoders to reconstruct input data and enhance deepfake-specific artifact detection, contributing to its robustness against a variety of manipulations. These models are parameterized with a batch size of 32 and trained for 10 epochs using the Adam optimizer, with standardized loss functions: binary cross-entropy for CNN-based models and mean squared error (MSE) for GANs. Accurate, generalizable, and robust models that combat the deepfake threat and recover authenticity online are achieved through comprehensive evaluation metrics such as accuracy, precision, recall, and F1 score.

(a): CNN-LSTM:

The CNN-LSTM model uses CNN and LSTM for spatial and temporal characteristics, respectively. The CNN layers have the role of extracting the spatial features from the input frames and locating defects and artifacts within the frame, as shown in Table 1. Afterward, the extracted features are fed to LSTM layers for capturing the sequence; the features are then used to capture temporal behaviors at different time instances. These features enable the model to differentiate real content from fake content.

(b): CNN-GRU:

Another category of architecture is the CNN-GRU architecture, which combines CNN layers with gated recurrent units, a specific type of RNN, as in Table 2. Like the CNN-LSTM model, the CNN layers in the CNN-GRU architecture also learn spatial features of the input frames. Temporal features are then computed on these features by the GRU layers. Long short-term memory, GRUs are efficient when it comes to long-term dependencies, and this proves that the CNN-GRU is appropriate for the identification of temporal disparities in deepfake videos.

(c): TCN (temporal convolutional network):

In the TCN model, there are different convolutional layers that work for sequences of data over time, which captures temporal patterns into the future. The dilated convolutions used in TCNs make it possible for the model to cover large spans of time using fewer layers than those used in conventional LSTM networks. This architecture is very efficient for recognizing temporal artifacts in deepfake videos because it can operate with longer sequences and may notice the discrepancies that may appear during more protracted time periods, as in Table 3.

(d): GAN-Autoencoded Features:

The GAN-Autoencoded Features model utilizes GANs to improve on the features extracted by the AE model. The autoencoder part of the GAN is used to analyze the input and reconstruct it for the identification of deepfake-related features. In the next step, the GAN component is used to produce realistic-looking images. The parameters and values of the GAN are shown in Table 4. This approach enhances the model’s resistance to diverse deepfake artifacts while also enhancing the model’s performance of determining whether synthetic data are genuine or fake.

The final step is the detailed evaluation of the trained models using unseen test data. This step involves predicting and classifying deepfake artifacts and is evaluated using several metrics to ensure comprehensive performance analysis

Accuracy: Accuracy measures the proportion of true results (both true positives and true negatives) compared to the total number of cases. Equation (1) defines the accuracy metric, which calculates correctly predicted cases among all available instances. The formula is

$Accuracy = \frac{T_{p} + T_{N}}{T_{p} + T_{N} + F_{p} + F_{N}}$

(1)
Precision: This indicates the proportion of true positives among the total predicted positives. It is defined in Equation (2) as

$Precision = \frac{T_{p}}{T_{p} + F_{p}}$

(2)
Recall: Recall measures the proportion of true positives identified among the actual positives. Recall is measured using the following Equation (3):

$Recall = \frac{T_{p}}{T_{p} + F_{n}}$

(3)
F1 score: This is the harmonic mean of precision and recall, providing a single metric that balances both. The F1 score, represented in Equation (4) below, gives a realistic measure to handle the tradeoff between precision and recall.

$F 1 - Score = 2 \times \frac{Precision}{T_{p} + F_{n}}$

(4)

This comprehensive evaluation ensures that the models are not only accurate but also generalizable, capable of performing well on new, unseen data.

3. Related Work

Deepfake technology, primarily based on deep neural networks (DNNs) and generative adversarial networks (GANs), has made significant progress in recent years, impacting various areas through the creation of fake multimedia content. Deep learning plays a crucial role in both the creation and detection of these deepfakes. The most popular methods for distinguishing legitimate content from fake content include techniques for image, text, video, and facial recognition [11].

A combination of deep learning algorithms has led to an important development in identifying deepfakes in photos and videos. To distinguish between original and modified features in the Fisherface dataset, for example, ST et al. used methods such as deep belief networks (DBNs) and local binary pattern histogram (FF-LPBH) [15]. Comparably, Ismail et al. investigated the use of convolutional recurrent neural networks (CRNNs) in conjunction with the You Only Look Once (YOLO) framework to identify deepfakes through the inspection of temporal and spatial video characteristics [16].

Deepfake detection methods have also been improved to handle large datasets. Chauhan et al. [17] highlighted the necessity for increased accuracy through different model upgrades. Investigating both automatic and manual identification techniques, Groha et al. [18] noted that the emotional and behavioral refinements in movies provide practical application issues.

On the anomaly detection front, various studies have focused on categorizing video anomalies using spatial–temporal criteria, illustrating the complexity of distinguishing deepfakes from authentic content [19]. Raza et al. integrated blockchain and cloud technologies with deep learning frameworks like VGG16 and CNN for a robust detection system, aiming to safeguard genuine multimedia content [20].

Moreover, the development of deepfake detection tools and techniques continues to evolve. Khochare et al. [21] focused on audio deepfakes, transforming audio data into spectral features and employing machine learning and deep learning models for high-accuracy classification. The successful use of ensembled models in identifying fake features has also been demonstrated through an integrated strategy utilizing several deep learning models, as verified in research by Andrew H. Sung and Md. Shohel Rana [22].

The main goal of the current research is to identify deepfake pictures, which is difficult because of the minor evidence of manipulation, such as color modifications and face splicing. Because these alterations are frequently undetectable to human sight and make use of advanced artificial neural network algorithms, it is extremely difficult to identify fakes without the help of technology. However, the study also notes a significant limitation related to physiological factors. For example, the correlation between blinking and underlying mental health issues suggests that certain detection methods based on eye movement may not be universally applicable. Individuals with mental health conditions have different blinking patterns, which can lead to false positives or negatives in deepfake detection.

Rapid advancements in deepfake generation have necessitated the development of robust forensic detection techniques. Several recent studies have introduced innovative approaches to deepfake detection by leveraging different machine learning and deep learning architectures.

Huang et al. [23] proposed an identity-driven approach to the detection of deepfake face swapping, focusing on capturing implicit identity inconsistencies introduced during manipulation. This method is particularly effective in detecting subtle variations in facial identity that arise due to generative adversarial network (GAN) transformations.

Kong et al. [24] introduced a novel technique that detects manipulated faces by analyzing both semantic and noise-level signals. Their approach effectively identifies false sets by examining inconsistencies in lighting, texture, and pixel-level noise, which align with our convolutional neural network (CNN)-based spatial feature extraction method.

Yan et al. [25] proposed UCF, a framework designed to uncover common deepfake features in various datasets, ensuring better generalizability. This research highlights the importance of designing detection models that can adapt to new deepfake manipulations, which aligns with our approach’s goal of improving cross-dataset robustness.

Luo et al. [26] explored the use of Vision Transformers (ViTs) for deepfake detection, introducing a forgery-aware adaptive ViT model. Their findings demonstrate that Transformer-based architectures can enhance the ability to capture both local and global forgery artifacts. In comparison, our method utilizes a hybrid framework that incorporates CNNs for spatial analysis and recurrent neural networks (RNNs) for temporal inconsistencies, offering a complementary approach to ViT-based models.

Jia et al. [27] investigated the effectiveness of multimodal large language models (LLMs), such as ChatGPT, in detecting deepfake content. Their work provides insights into the potential of AI-driven forensics in media authenticity verification. While their approach focuses on multimodal feature extraction, our method employs a hybrid deep learning model that explicitly captures spatial–temporal inconsistencies in videos.

Table 5 provides a comparative overview of various methodologies for deepfake detection, focusing on their feature-based approaches, classifiers, and performance metrics across different datasets. The first method [28] utilizes a combination of visual features of eyes and teeth and uses logistic regression and MLP classifiers. The approach achieves an AUC of 0.851 and an accuracy of 0.854 on the FaceForensics++ dataset. The second approach [29] is based on deep learning features with a capsule network. It achieves a high accuracy of 0.91 and an F1 score of 0.91. However, it has a notably low recall of 0.08. The CNN + RNN architecture in the third method [30] incorporates both image and temporal features. It performs well, with an AUC of 0.93 and an accuracy of 0.939, on a low-quality FF++ dataset subset. The fourth technique [31] employs a dynamic prototype network for image and temporal features, with moderate effectiveness, reflected in an AUC of 0.718 and accuracy of 0.72 on high-quality FF++ data. Techniques focused on eye blinking, such as the LRCN [32] and distance-based classifier [33], demonstrate the utility of analyzing temporal patterns for detection. The distance-based method achieves an AUC of 0.875 and an accuracy of 0.85 on datasets with unnatural eye movements, illustrating its ability to capture subtle deepfake artifacts. Overall, the table highlights the effectiveness of different deepfake detection techniques, with both spatial and temporal features proving valuable for improved detection performance.

Using a CNN XceptionNet for facial feature extraction and stacking multiple convolution modules to obtain audio embeddings, Ref. [34] shows that spatiotemporal features with LSTM and convolutional bidirectional recurrent LSTM network performs well. The authors use two loss functions, cross-entropy and Kullback–Leibler divergence. Afchar et al. introduce two deep networks, i.e., Meso-4 and MesoInception-4, to analyze deepfake videos at the mesoscopic level. The accuracy on the deepfake and the FaceForensics dataset is 98% and 95%, respectively [35]. Features are extracted using 68 landmarks of the face region. Yang et al. (2019) use SVM for classification, using the extracted head pose features [36].

In [37], the authors described developing an algorithm using a convolutional neural network (CNN) to detect forgery images and videos. There are 26 distinct deep convolutional models for detecting deepfake videos, photos, and their fake feature classifications. The author highlighted the CNN model in detecting deep counterfeit videos and pictures by altering the top layer with a sigmoid layer or activation functions, which the generative adversarial network produces. Rana et al. [38] studied deepfake detection to identify counterfeit images, videos, and audio. The authors provide an overview of deepfake detection of videos and pictures. The authors summarize 112 articles on deep counterfeit video detection between 2018 and 2020.

In [39], the authors used the Xception method to predict high accuracy on two different but common datasets: DeepFakeTIMIT and Faced Forensics++. Zil et al. [40] researched images that had been manipulated using deep learning methods. Two of the most commonly used individual datasets in this article were Deep-Fake-Detection and FaceForensics++. Additionally, the authors utilized a large dataset known as Wild Deep Fake detections, consisting of 7314 face sequences from 707 deepfake videos. The authors also proposed two attention-based deepfake detection models (2D and 3D), called AddNets, which apply attention masks to faces for improved detection. The primary focus of this research was to identify variations in the frames. The authors further compared wild deepfake videos and images with existing datasets. According to Shahzad et al. [41], deepfake content, including audio, video, and images, has been generated through various deep learning approaches. These models, including convolutional neural networks (CNNs), GANs, and other deep neural networks (DNNs), are applied to different types of datasets. In this study, the authors also explored the use of traditional machine learning models and discussed employing advanced deep learning techniques to develop a more effective deepfake detection system.

Luisa et al. [12] provided an overview of several methods and strategies for assessing the integrity and verification of media content. The authors concentrated on the significance of disseminating deep learning-generated fake media content. The neural networks developed by H. Khalid et al. [42] are known as OC-FakeDetect. They utilize variational autoencoder (VAE) models trained solely on legitimate face images to detect fake images by identifying artificial features in photos and videos. Y.S. Malik et al. [43] proposed two techniques: XceptionNet, which achieves 95% accuracy in detecting fake features in images and videos; and C-GANs, which is used for generating counterfeit images. Additionally, a study [44] introduced a time-based method for detecting fake videos using a small contrast in video frames.

4. Proposed Models and Forensics Techniques

The proposed methodology specifically revolves around the identification of deepfake images and videos with the help of state-of-the-art deep learning approaches. This comprehensive approach is structured as follows.

Deepfake image and video forensics involve deep learning techniques to combine artifact and landmark detection, and a systematic approach to deepfake image and video forensics, as depicted in Figure 2, is outlined.

4.1. Dataset Description

The primary source for training and evaluation of deepfake detection models chosen in this study was the FaceForensics++ dataset [45]. This dataset is widely known in the research community due to the large and diverse variety of manipulated videos it contains, which is essential for building and testing robust detection algorithms. In this paper, we summarize the key attributes of the dataset and explain why it was chosen and developed. FaceForensics++ is a dataset specifically designed for video and image forgeries. It provides a mix of real and manipulated videos, enabling model training focused on distinguishing between real and fake content.

4.1.1. Manipulation Methods

The dataset [45] contains 1000 authentic videos collected from the internet that display individuals in different scenes and lighting conditions. This allows the dataset to be applied to a broader range of real-world scenarios through diversity. To simulate different styles of video manipulation, FaceForensics++ includes several forgery methods applied to each original video:

Deepfakes: Utilizes deep learning for face swapping, replacing a target face with one from another video.
Face2Face: Alters facial expressions in the target video to match those of a source actor.
FaceSwap: A traditional face-swapping technique that does not rely on deep learning.
Neural Textures: Uses GAN-based techniques to manipulate facial features, producing highly realistic details.

4.1.2. Dataset Scale

FaceForensics++ provides videos at four different compression levels (from raw to heavily compressed) to replicate real-world conditions where video quality varies. This allows models trained on this dataset to be more resilient and effective across diverse video qualities. For each forgery method, the dataset contains approximately 4000 manipulated videos across all quality levels, resulting in a total of around 100,000 samples. This balance of manipulated and original videos ensures that the dataset can support comprehensive training and testing of detection algorithms.

Each video is labeled to indicate whether it is real or manipulated and specifies the type of manipulation used. These labels enable the application of supervised learning techniques.

The FaceForensics++ dataset was chosen due to its extensive variety and high-quality annotations, which are critical for developing and validating deepfake detection models. Its range of manipulation methods, compression levels, and detailed labels make it well suited for training models to generalize effectively to real-world forgeries, making it an invaluable asset in the ongoing advancement of deepfake detection research.

4.2. Data Preprocessing

The DeepFake Video dataset serves as the primary input for both original and manipulated videos and the process begins with this dataset. For manual analysis of individual frames, frames are extracted from the videos and combined to create the DeepFake Frame dataset, which provides individual frames for more granular analysis. Additionally, the DeepFake Facial Regions dataset is formed by extracting frames from the videos and processing them to isolate specific regions of interest (ROIs), which include the eye, nose, and mouth areas, where forensic details are critical.

4.3. Artifact Landmark Detection

In Table 6, the facial features and their parameters are presented. Facial recognition algorithms are used to identify fundamental features like eyes, nose, and mouth. The analysis of facial feature landmarks in the absence of an active humanoid head allows for the identification of reference points which aid with further facial feature analysis, forming a structured map of any nearby unnatural modifications.

4.4. Correlation Between the Artifacts to Identify Correlated Pairs

Figure 3 highlights a feature correlation matrix that visually illustrates the relationship between different facial features and images, plotted on the X-axis and Y-axis, respectively. On the X- and Y-axes are features such as positions of the mouth corners (left and right), the dimensions of the lips, the dimensions of the nose, the dimensions of the eyes, and other structural and proportional facial elements. The same features are present in each axis, resulting in a symmetric matrix, with each cell representing a correlation coefficient between two features. These coefficients, ranging from −1 to 1, are color-coded for clarity: features that increase together (positive correlation) are denoted by deep red, features that decrease together (negative correlation) are deep blue, and weak or no correlation is represented by lighter shades. The matrix groups and correlates features naturally (e.g., a strong correlation between related features such as the corners of the mouth, nasal, and eye dimensions, which typically move together in genuine facial expressions). The derived patterns reflect the natural synchronization of human facial movements. However, disruptions to this natural correlation structure can indicate tampering, as deepfake manipulations are unlikely to accurately replicate these subtle phenomenological connections. For instance, a mismatch between mouth movement and eye behavior may suggest manipulation. Forensic models can effectively identify deepfakes by using expected correlations and detecting deviations. This detailed analysis highlights the importance of understanding both spatial relationships and the natural harmony of facial features when analyzing potential manipulations.

A feature correlation matrix is presented in Figure 3, where both the X-axis and the Y-axis correspond to different facial features or artifacts.

A strong positive correlation (features move together) is represented by deep red.
Deep blue means a strong negative correlation (the first feature increases while the second decreases).
Strong or positive correlation is indicated by negative shades, while lighter shades suggest weak or no correlation.

Individual feature analysis:
- Nose: It is apparent that parameters such as width, height, tip location, and nostril symmetry correlate amazingly, which means that these facial dimensions often change at the same time during various movements and facial expressions.
- Mouth: The coordinated variations in the upper and the lower jaw’s height and width, and the changes occurring during the mouth movements (speaking or smiling) suggest very strong correlations among these parameters.
- Eyes: The eye-determined indicators such as eye aspect ratio (EAR), blink frequency, amplitude, and duration, as well as the pupils’ sizes and movements, typically exhibit high correlations, implying that blinks and eye movements are closely related.
Inter-feature correlation analysis:
- Nose and eyes: Exploring the relationship between nose position/dimensions and eye movements/closures can reveal coordination between these features during blinks or facial expressions.
- Nose and mouth: This analysis checks whether movements of the mouth correlate with changes in the nose area, which might occur during various expressions.
- Eyes and mouth: The focus here is on whether movements in the eyes (like blinking) are synchronized with mouth movements, which would be common during expressions or speech.
Strength of correlations:
- Strong correlation (>0.7): This indicates that features move in tandem. For example, a strong correlation between the position of the nose tip X and nose bridge X suggests synchronized movements in these features.
- Moderate correlation (0.3 to 0.7): This suggests a relationship but with less consistency. For instance, a moderate correlation between mouth aspect ratio and average eyelid movement might indicate that certain expressions affecting the mouth could also impact eyelid movements.
- Weak correlation (<0.3): This shows little to no linear relationship. For example, a weak correlation between left EAR and nose shape Y implies that eye closures do not consistently correlate with the nose’s vertical dimensions.

This streamlined analysis provides a clear understanding of how different facial features interact and correlate within the dataset, useful for developing more accurate models in facial recognition and deepfake detection.

4.5. Artifact Annotations

Artifact annotations use the detected landmarks to label each detected artifact with great precision. This process focuses in on specific areas of tampering where deepfakes can occur for targeted analysis.

The annotated data undergo a comprehensive data preparation pipeline, which includes four key stages:

Collect: Lists relevant artifacts and features along with the similar or related information necessary for deepfake detection.
Noise Remover: Removes noise and irrelevant data and improves the overall data quality.
Transform: Formats and standardizes the data to make the data consistent across the dataset.
Enrich: Augments the dataset to increase the robustness of the deep learning models.

4.6. Data Preparation in Deepfake Forensics

The data preparation stage also plays an important role in developing good models for deepfake image and video forensics, so the input data are clean, consistent, and enriched to best support learning. This phase involves three key components: The quality and usability of the dataset depend on the application of three processes:

Noise removal;
Data transformation;
Data enrichment.

4.6.1. Noise Removal

The main task of noise removal is to eliminate irrelevant information in the data. In the context of deepfake forensics, this includes removing artifacts such as blurry or missing frames, unnecessary background elements, and low-quality images that could reduce the model’s accuracy. Techniques like filtering, denoising, and thresholding are applied to isolate only the high-quality and relevant facial regions for analysis. This step improves the clarity of the dataset and prevents the model from learning incorrect patterns or overfitting on irrelevant features.

4.6.2. Data Transformation

The dataset is transformed following noise removal, making it consistent and standardized across all samples. This transformation allows efficient data processing without making changes in lighting, scale, and orientation. Methods like standard scaling, normalization, and geometry corrections are applied to keep things uniform; for example, we reduce variability by resizing and aligning facial regions to a common frame of reference, allowing the model to focus more on manipulative cues.

4.6.3. Data Enrichment

Data enrichment extends the dataset by adding synthetic examples so as to make it more robust and diverse. In this step, given that there may be limited or imbalanced data, new samples are created, simulating real-world scenarios. Further, synthetic manipulated samples, synthesized utilizing advanced methods like generative adversarial networks (GANs), are utilized to generate artificial deepfake artifacts. This step enriches the dataset so that the model sees more types of manipulations; thus, it will have better generalization to unseen data and will better be able to detect both subtle and sophisticated deepfake techniques.

Overall, noise removal, data transformation, and data enrichment constitute a data preparation pipeline that becomes a foundation for efficient deepfake detection.

4.7. Artifact Sample Augmentation

Before fine-tuning the model, there is a need for data augmentation. The strategies for augmentation involve flipping both horizontally and vertically, rotation, scaling and cropping, enhancing the brightness and contrast of the images, applying some noise such as Gaussian noise, blurring, and applying elastics transformation. This step makes certain that more than one scene is depicted in the model, so that it will be able to generalize on the different scenes that are subjected to deepfake manipulation

4.8. Artifact Balancing

The classes in the dataset are imbalanced, which can lead to faulty predictions, and hence it is necessary to handle this issue.

Synthetic handling of imbalanced features with SMOTE:
Artifact tags balancing is carried out for handling imbalanced classes. This step deals with removing any possible class imbalances in the dataset, so the model will receive an equal number of authentic and manipulated artifacts. To improve model generalizability, we apply techniques like the Synthetic Minority Over-sampling Technique (SMOTE) for under-represented classes, hence generating synthetic samples.
The Synthetic Minority Over-sampling Technique (SMOTE) is applied to the training set. This also entails the creation of synthetic samples in the minority class, thus bringing balance in the composition of the dataset. This happens only for the training data to allow the model to learn the real unbalanced data and unseen data environments.
Artifact distributions:
For analytical or modeling tasks, the data are split into train and test sets. The train set is used to train the model, while the test set evaluates its performance. Table 7 shows that data augmentation techniques have been used to balance the dataset, providing a significantly larger training sample size.

4.8.1. K-Fold Cross-Validation

The dataset is evaluated through K-fold cross-validation (k = 5), as in Table 7, where we divide the data into 5 subsets, with each serving as a validation set for 1 training run, with the rest of the subset used for training. This method helps us prevent overfitting.

4.8.2. Artifact Transformation

A notable component in this methodology is artifact transformation. It utilizes autoencoders to freeze selected features into a different format that is more suitable for model training. Autoencoders consist of an encoder component that transforms the input data into a lower-dimensional form, as shown in Table 8, and a decoder part that maps this form back to the input data. The layers used in the autoencoder model include dense layers for the encoder, which use ReLU activation functions. For the decoder, as shown in Table 9, we have dense layers with both ReLU and sigmoid functions.

4.9. Workflow Using Models

After preparing the data, generative adversarial networks (GANs) are utilized to generate additional fake samples, helping the model improve its ability to recognize manipulated content. GANs achieve this by having two networks (generator and discriminator) compete, creating high-quality fake samples that enhance model training. Deep learning models, such as convolutional neural networks (CNNs) and temporal models then extract patterns from the data. During the evaluation phase, the trained model applies these learned patterns to classify new media as real or fake accurately. Overall, this architecture provides a comprehensive workflow for detecting deepfakes, combining landmark detection, GAN-based augmentation, and thorough data preparation, enabling the model to identify even subtle manipulations in images and videos.

4.10. Model Training

The model training process followed in this research is a complete process, combining generative adversarial networks (GANs) and deep learning models to increase deepfake detection capabilities. GANs produce the first synthetic manipulated samples, adding to the training data so that the model can learn to detect subtle manipulations within various settings. Using this GAN framework, the generator part also codes with high-quality fake samples, which helps model learning. The training then incorporates convolutional and temporal models, including CNN, CNN-LSTM, CNN-GRU, and TCN, each selected for their effectiveness in capturing specific features: Spatial models consider spatial artifacts within frames, and temporal models (LSTM, GRU, TCN) capture the temporal artifacts across video frames, leading to the refinement of temporal artifact detection. Model reliability is rigorously evaluated against metrics like accuracy, precision, recall, and F1 score to separate real (actual) from fake (false) media. Moreover, we also apply advanced methods, such as K-fold cross-validation and SMOTE to balance training, prevent overfitting, and improve robustness on various forgery styles.

Model Training Pseudocode

To elucidate the computational steps involved in training our generative adversarial networks (GANs) and deep learning models, we present a detailed pseudocode. This pseudocode outlines the processes, from building the models to training and evaluating them, ensuring clarity in the methodological framework used in our research.

The system for deepfake image and video forensics based on a generative adversarial network (GAN) framework includes two components, a generator and a discriminator, as illustrated in Algorithm 1. The BUILD_GENERATOR function sets up a generator model according to the latent dimension given. Then, first it creates a sequential model and adds the dense layers with activation functions (ReLU activation function at the middle layers and tanh activation function at the output layer) to generate the output that is very much like the real data generated from the random noise. The data that were generated are simulating potential deepfake manipulations. In the pseudocode Algorithm 1, this functionality is implemented between lines 2 and 7. Specifically, line 3 initializes the sequential model and then lines 4–6 add dense layers, with relu activation in the intermediate layers and tanh activation in the output layer. After that, line 7 returns the completed generator model.

The discriminator model (that tries to distinguish between real and fake samples) is built using the BUILD_DISCRIMINATOR function. Like the generator, we apply a sequential model when it starts, but now it contains an input layer matching the dimensions of the data, dense layers with ‘relu‘ activations, and a final layer with ‘sigmoid‘ activation, that outputs a binary classifier with respect to the input being real or fake. In the pseudocode Algorithm 1, this functionality is implemented between lines 9 and 15. Specifically, line 10 initializes the sequential model and then line 11 adds the input layer with a shape matching the data dimensions. After that, lines 12–14 add dense layers with relu activation for intermediate layers, and finally line 15 adds the final dense layer with sigmoid activation for binary classification.

Algorithm 1 Detailed Pseudocode for GAN and Deep Learning Model Operations

1:: Import Libraries
2:: function build_generator(latent_dim)
3:: Initialize a sequential model
4:: Add dense layer (128 neurons, ’relu’, input_dim = latent_dim)
5:: Add dense layer (256 neurons, ’relu’)
6:: Add output dense layer (feature columns, ’tanh’)
7:: return model
8:: end function
9:: function build_discriminator(input_dim)
10:: Initialize a sequential model
11:: Add input layer (shape = input_dim)
12:: Add dense layer (256 neurons, ’relu’)
13:: Add dense layer (128 neurons, ’relu’)
14:: Add output dense layer (1 neuron, ’sigmoid’)
15:: return model
16:: end function
17:: function train_gan(generator, discriminator, gan, epochs, batch_size, latent_dim, X_train)
18:: for epoch in 1 to epochs do
19:: Generate noise (normal distribution)
20:: Generate fake data from noise using generator
21:: Select random batch of real data from $X_t r a i n$
22:: Train discriminator on real data as ’real’
23:: Train discriminator on fake data as ’fake’
24:: Train generator via GAN to classify fake as ’real’
25:: Optional: Print training progress
26:: end for
27:: end function

The adversarial training process is run by the TRAIN_GAN function, where the two models learn iteratively. In each epoch, noise is generated and fed into the generator to create synthetic data. The generator is trained separately on fake samples it makes and real samples in the training set by fitting the discriminator for both real samples and fake samples. Finally, the whole GAN is trained, and in its loop, the generator’s parameters are updated in reaction to feedback from the discriminator, aiming at pushing the generator towards producing more ’realistic’ fakes. The hyperparameters epochs and batch size control an iteration process that enables deeper fake detection and contributes to the robustness of the deepfake detection system in the image and video forensics approach. In the pseudocode Algorithm 1, this functionality is implemented between lines 17 and 27. Specifically, line 18 starts the loop for a given number of epochs and generates random noise as input for the generator. Line 20 produces synthetic data by passing the noise to the generator. In line 21, random real data samples are selected from the training dataset and lines 22–23 train the discriminator on both real samples (as “real”) and synthetic data (as “fake”). Line 24 trains the GAN system, allowing the generator to learn to produce data classified as “real” by the discriminator. Line 25 optionally prints the training progress for monitoring.

The complete advanced pipeline for deepfake detection, with synthetic data generation, data preprocessing, training of the models, and model evaluation, is shown in Algorithm 2. The GENERATE_SYNTHETIC_DATA function issues synthetic samples from a generative model. It then usually generates noise according to the normal distribution (line 2) and makes data that look as if they belong to real images or videos (line 3). This is important since if we want the model to be able to detect different types of deepfake media we need to teach it how to modify different types of realistic content (lines 1–4).

The CREATE_AUTOENCODER function creates a model of the autoencoder to perform feature enhancement. The input layer (line 7) is initialized, and dense layers for encoding and decoding (line 8) are incorporated; then, autoencoder is compiled using the adam optimizer and the mse loss function (line 9). Then, the encoder extracted for feature representation (line 11) improves the model’s ability to detect the relevant artifacts in deepfake media (lines 6–12).

Algorithm 2 Detailed Pseudocode for GAN and Deep Learning Model Operations—Part 2

1:: function Generate_Synthetic_Data(generator, latent_dim, num_samples)
2:: Generate noise (normal distribution)
3:: Generate synthetic data from noise using the generator
4:: Return synthetic data
5:: end function
6:: function Create_Autoencoder(input_dim)
7:: Create input layer (shape = input dim)
8:: Add encoded and decoded layers to build an autoencoder
9:: Compile autoencoder (adam, mse)
10:: Extract encoder part from autoencoder model
11:: Return autoencoder, encoder
12:: end function
13:: function Plot_Confusion_Matrix(cm, classes, model_name)
14:: {Set plot titles and labels}
15:: Save plot as an image
16:: Show plot
17:: end function
18:: function Handle_Outliers(dataframe, column_name)
19:: Identify and replace outliers with median values
20:: Return modified data frame
21:: end function
22:: procedure Perform Cross-Validation
23:: for each fold in stratified K-fold do do
24:: Prepare data for training and validation
25:: Transform features using autoencoder
26:: Balance dataset using SMOTE
27:: for each model (CNN, CNN-LSTM, CNN-GRU, TCN) do do
28:: Train model on training data
29:: Evaluate model on validation data
30:: Compute and store metrics (accuracy, precision, recall, F1-score)
31:: Plot and save ROC curve and performance metrics
32:: end for
33:: end for
34:: Compute aggregate results (mean, std dev) across folds
35:: Save aggregate results to CSV file
36:: end procedure

The PLOT_CONFUSION_MATRIX function plots model performance as a confusion matrix. First, it sets plot titles and labels (line 14), then saves the plot as an image (line 15), and finally, it displays the plot, showing key metrics such as true and false positives and misclassification patterns for real and fake data (lines 13–16).

On line 18, the HANDLE_OUTLIERS function finds a way to solve data quality issues by identifying the outliers using the IQR (interquartile range) method. The dataset is normalized (lines 17–19) by replacing these outliers with median values to make the model more robust.

The core of the training process is the loop that implements the cross-validation, and cross-validation is performed here via stratified K-fold cross-validation. Data are prepared for training and validation for each fold (lines 21), and features are transformed with an autoencoder (lines 22). Using SMOTE (line 23), the dataset is made balanced. Accuracy, precision, recall, F1 score, and AUC are used as metrics, and the architectures such as CNN, CNN-LSTM, CNN-GRU, and TCN (lines 24–34) are trained and evaluated. The aggregated metrics are saved in a CSV file for detailed analysis (lines 36).

Finally, we save the mean and standard deviation of each metric to a CSV file for detailed analysis. Algorithm 2 presents a structured approach that offers a rigorous and multi-faceted strategy for deepfake detection using synthetic data generation, robust preprocessing, and cross-validation for an effective and generalizable detection performance.

5. Results and Discussion

This study demonstrates that the use of GAN augmentation significantly improves performance in detecting deepfake artifacts in models for different facial landmarks and configurations. The CNN-LSTM and TCN models outperform others for modeling spatiotemporal features as they generalize and are stable when trained on spatiotemporal features. Multiple experiments were performed using these models. These experiments are as follows:

Experiment 1: Eye landmarks;
Experiment 2: Fusion of eyes and nose landmark facial region;
Experiment 3: Fusion of eyes, nose, and mouth landmark facial region.

5.1. Experiment 1: Eye Landmarks

One of the core artifacts in deepfake detection is the temporal pattern of eye blinking, which serves as a good artifact to distinguish between an image/video that is real versus an image/video that is manipulated. In this work, the model’s ability to predict accurate eye-blinking patterns both with and without using GAN augmentation is investigated.

From Table 10, it can be observed that all models improved precision, recall, and F1 score using GANs. For example, the F1 score improved from 0.906 for the CNN-LSTM model when GAN was not used to 0.922 when GAN was used. This work demonstrates the importance of GANs in helping enhance the training of a model via synthetic artifacts.

The GAN achieved the highest performance of 0.927 F1 score with the CNN_LSTM model, as shown in Figure 4, whereas TCN was closely behind with an F1 score of 0.927. This shows that temporal convolutional networks, well suited for detecting time dependencies, are especially powerful at the detection of blinking patterns in deepfake analysis.

Without the GAN model, the CNN-LSTM has one of the highest accuracies and lowest loss values and is free of significant overfitting. That it is consistent in both metrics gives evidence that it is a good model for the task, as in Figure 5.

Interpreting from the accuracy versus epoch curves, TCN, similar to CNN-LSTM, gives high accuracy with low overfitting, as observed by the proximity of the training and validation curves. This suggests that TCN is useful for the identification of spatiotemporal artifacts. So, comparing these models, CNN-LSTM and TCN stand out as the top performers:

The training and validation curves of CNN-LSTM demonstrate the highest accuracy and the least gap between the training and validation curves, which further signifies that there is less overfitting occurring.
Next is TCN, which performs nearly as well and shows stable and reliable learning for the temporal analysis of artifacts.

The best overall results are observed for CNN-LSTM with GAN, which has highly aligned training and validation curves for both accuracy and loss. Compared to the previous models, this is the best model for generalization and stability for artifact detection, which aids in GAN augmentation. GAN is a close competitor again, with similar training and validation curves. In this sense, it sacrifices less temporal features and is a strong candidate where temporal dependency is important.

5.2. Experiment 2: Fusion of Eyes and Nose Landmark Facial Region

Deepfake videos have the potential to present fused eyes and nose landmarks for determining unnatural variations and inconsistencies within the video that could indicate the video has been manipulated. An analysis of model performance in detecting eye and nose artifacts using multiple facial regions is demonstrated.

The results in Table 11 show that when using GAN augmentation, all the models achieved higher scores, with the TCN model reaching an F1 score of 0.917, as in Figure 6, the highest of all models tested for this artifact.

First, by processing temporal and spatial features simultaneously, the TCN and CNN-LSTM achieved competitive performance (Table 11), demonstrating they can differentiate inconsistent features derived from eyes and nose fusion. The TCN model scores higher, translating to its efficiency in handling temporal irregularities better.

The CNN-LSTM model achieves the highest performance in this analysis. The curves between training and validation accuracy and loss are almost perfectly aligned, showing training of temporal features while not overfitting. CNN-LSTM proved strong, suggesting that it is an appropriate choice for tasks that involve temporal and spatial feature extraction. CNN-LSTM is a very close second to TCN, as shown in Figure 7, with both having similarly aligned training and validation curves, implying that TCN generalizes well. Then, due to TCN’s structure being relatively robust to sequential data, its performance on artifact detection can still be significant, especially when GAN augmentation is not performed. CNN-GRU also exhibits good performance in learning the next word and has a minimal gap between the training and validation metrics. While it does not reach the same level of performance as CNN-LSTM or TCN, it effectively captures these dependencies in the data. Moderate performance, with a large difference between training and validation metrics, is demonstrated by CNN. It is generalization-free due to overfitting but cannot model complex relationships in the fused artifact data owing to its simplicity of structure.

Detection of fused artifacts of eyes and nose with no GAN augmentation is achieved best by CNN-LSTM, followed closely by TCN, which again shows excellent generalization. Although CNN-GRU offers good stability, CNN remains a fairly good model and requires a bit more complex treatment to match the other models for capturing the details of the dataset.

5.3. Experiment 3: Fusion of Eyes, Nose, and Mouth Landmark Facial Region

Artifacts involving the eyes, nose, and mouth together offer a deeper understanding of deepfake detection, as it is a common technique to modify these features to create more realistic fake content. In this section, we ask how well the models found inconsistencies in these fused facial features.

As in the previous artifact investigations, GAN-augmented models performed better on our metrics. The results show the TCN model is still the best choice, with an F1 score of 0.912, as in Table 12, indicating its capability in dealing with complex, multi-facial region artifacts.

Challenges in multi-region detection: CNN-LSTM performed robustly, with an F1 score of 0.902 with GAN, still fused artifacts found in multiple regions must be detected accurately by the model. This task benefited from the structure of the CNN-LSTM, which is specifically designed to capture spatial–temporal correlations.

The best overall results were obtained by CNN-LSTM with GAN, with near-perfect matching between training and validation on both accuracy and loss. One key reason is that CNN-LSTM achieves a high degree of generalization and minimal divergence, as in Figure 8, which makes it an ideal choice for discovering GAN-generated fused artifacts around the eyes, nose, and mouth landmark. A close second is TCN with GAN, having strong generalization and stability. TCN shows a good fit for temporal dependencies, which makes it a better alternative compared to CNN-LSTM when we look at sequential data.

TCN and CNN-LSTM with GAN are the second-best models for detecting GAN-generated fused artifacts in facial features, as shown in Figure 8. There is excellent alignment in the accuracy and loss curves of both models, showing strong generalization. Although both CNN-GRU and CNN offer reliable stability, CNN-GRU does not reach the same accuracy levels, while CNN achieves reasonable performance but could benefit from additional improvements to work well with the complexity of fused GAN-generated features.

5.4. Ablation Studies

To further justify the effectiveness of our proposed framework, we conducted an ablation study analyzing the contribution of each key component—CNN, LSTM, GAN, and TCN—to the overall model performance. This study provides a detailed breakdown of how each network enhances deepfake detection.

Experimental setup: We performed ablation experiments by systematically removing individual components from our architecture and evaluating their impact on detection accuracy. The following model variations were tested:

CNN-only: Extracts spatial features such as texture anomalies and inconsistencies in facial landmarks but lacks temporal awareness.
CNN + RNN (LSTM/GRU): Incorporates temporal inconsistencies (e.g., unnatural blinking and lip-sync issues) but lacks generative reconstruction for deeper forgery detection.
CNN + GAN: Detects manipulation artifacts through reconstruction loss but does not capture temporal dependencies.
CNN + TCN: Captures long-range dependencies in video sequences but lacks generative forgery detection.
Full model (CNN + RNN + GAN + TCN): Integrates all components to leverage spatial, temporal, generative, and sequential dependencies.

The ablation study results confirm that each component contributes uniquely to deepfake detection performance. The CNN alone achieved moderate accuracy by detecting spatial artifacts, but the introduction of RNN improved the temporal consistency. The GAN module further enhanced performance by identifying forged regions through reconstruction discrepancies. The TCN contributed by capturing long-range dependencies, further strengthening detection accuracy. The full model achieved the highest accuracy, confirming that the fusion of CNN (spatial features), RNN (temporal inconsistencies), GAN (manipulated feature detection), and TCN (long-range dependencies) significantly enhances deepfake detection robustness.

These findings validate our architectural choices and demonstrate the necessity of combining multiple neural network approaches to address the complex challenges posed by deepfake content.

5.5. Comparative Study of Landmark-Based Deepfake Detection Techniques

Figure 9 shows that models without GAN show higher performance when measured against their counterparts. Different levels of feature fusion generate distinct ROC curves in the analyses where the model includes only eyes; or both eyes and nose; or eyes, nose, and mouth data. The AUC value indicator demonstrated superior model performance in distinguishing between negative and positive cases, with the measurement reaching 0.98 for eyes alone, while the combination of eyes and nose resulted in 0.97, and all three features collectively produced a value near to 1.00. These results show that performance improves as the model receives further facial components for detection purposes, leading to superior artifact detection abilities. These results show that combining all three features produces the maximum AUC value, indicating multi-feature fusion improves prediction performance.

In Figure 10, ROC curves reveal that the overall model performance drops below a baseline model, which involved no GAN during training. The AUC values reach 0.88 for eyes and 0.87 for the fusion of eyes + nose, while moving to 0.88 for the fusion of eyes + nose + mouth, thus demonstrating performance improvement through fusion despite a minor reduction when using GAN-generated samples for training. GAN-generated data lead to decreased model effectiveness because they introduce distribution shifts along with noise during sample generation. Although degradation occurs with this method, the feature fusion technique remains the most effective approach to detect artifacts, which demonstrates how multi-feature combination enhances detection performance.

5.6. State-of-the-Art Table

GAN augmentation had a positive impact on the model accuracy for all artifact types, with the largest effect for temporal models such as TCN and CNN-LSTM. We found that the TCN model was the most effective, obtaining the highest F1 scores for all combinations of audio latency and duration. We observed that incorporating more facial landmarks (e.g., eyes, nose, mouth) improved detection accuracy for multi region artifacts for both the TCN and CNN-LSTM models, as in Table 13.

Along with another method [28], combining eyes and teeth visual features through logistic regression and MLP, we achieve a respectable AUC of 0.851 and an accuracy of 0.854 on the FaceForensics++ dataset. The second approach [29], using deep learning features with a capsule network, had a high accuracy (0.91) and F1 score (0.91), but it had a very low recall of 0.08. In the third method [30], the architecture of CNN + RNN, integrating the image and the temporal features, achieved a high accuracy of 0.939, and an AUC of 0.93 on a low-quality FF++ dataset subset. The fourth technique [31] uses a dynamic prototype network for image and temporal features, resulting in an accuracy of 0.72 and AUC of 0.718 on high-quality FF++ data, with moderate effectiveness. Analysis of temporal patterns is found to be useful when applied with the LRCN [32] and distance-based classifier [33], two techniques adapted to deal with eye blinking. On datasets with unnatural eye movements, this leads to an AUC of 0.875 and 0.85 accuracy for the distance-based method, demonstrating its power to detect subtly deepfake artifacts. Spatiotemporal features combined with augmented facial landmarks and GAN-based data augmentation are proposed as a method for deepfake detection. This enriches datasets with synthetic variations (e.g., eyes, nose, mouth), incorporating spatial analysis to further refine static distributions or temporal analysis to further improve sequential distributions, such that datasets are more robust to subtle manipulations in both static and sorting inconsistencies. The model generates diverse, high-quality synthetic samples, with GANs surpassing existing approaches in accuracy (96%) and F1 score (98%). This accomplishes an appropriate compromise between precision and recall, and works for a fair variety of forgeries and temporal deepfake techniques, making it a robust and flexible deepfake forensics solution. Finally, overall, the table illustrates that different deepfake detection techniques exhibit different performance advantages in general, and in terms of spatial and temporal features, deepfake detection works well.

5.7. Real-World Applicability

The increasing prevalence of deepfake content poses significant challenges to digital media integrity, necessitating effective detection mechanisms. Our proposed framework, which integrates CNNs, RNNs, GANs, and TCNs, has the potential to be deployed in various real-world applications to enhance digital content verification and forensic analysis.

Social media moderation: Given the rapid dissemination of deepfake content on social media platforms, our model can be integrated into automated moderation systems to detect and flag manipulated media in real time. By leveraging both spatial and temporal inconsistencies, our approach enhances the ability of content moderation algorithms to identify fake videos and images before they spread widely.

Forensic investigations: Digital forensic analysts can utilize our method for identifying synthetic media in cybercrime investigations. The fusion of facial landmarks (eyes, nose, and mouth) allows for more accurate detection of tampered identities, which is crucial in legal proceedings and criminal cases involving misinformation or identity fraud.

Digital content verification: Our framework can be employed by news agencies, media organizations, and fact-checking services to verify the authenticity of multimedia content. By analyzing artifacts left by deepfake generation techniques, our model provides an additional layer of verification to prevent the spread of false information.

Additionally, we discuss the feasibility of deploying our framework in real-world settings by considering key factors such as computational cost, inference speed, and adaptability to emerging deepfake generation techniques. The lightweight design of our approach enables efficient processing, making it suitable for integration with cloud-based or edge-computing systems. Future enhancements could involve optimizing the model for mobile and embedded systems to expand accessibility and deployment capabilities.

The potential integration of our method with existing automated content moderation systems used by social media platforms further strengthens its practical applicability. By collaborating with industry stakeholders, the framework can be fine-tuned to meet platform-specific requirements, ensuring its effectiveness in mitigating the risks associated with AI-generated media.

6. Conclusions

This study investigates various deep learning models for detecting deepfakes, particularly focusing on deepfake images or video forensics for the facial regions of eyes, nose, and mouth fusion. This is crucial as deepfake technology becomes more sophisticated and harder to identify. The research shows that the performance of models like CNN, CNN-LSTM, CNN-GRU, and TCN depends heavily on the types of features they analyze and the use of advanced techniques like GANs. Key findings include that CNNs are effective with original features but tend to overfit, whereas using autoencoded features provides more consistent results though they are slightly less accurate. Models that combine CNNs with LSTM or GRU are better suited for these features, showing a superior ability to process data over time and adapt to synthetic variations introduced by GANs. The study highlights the importance of selecting the right features and model design, incorporating methods like GANs to improve detection. It provides a basis for further research and practical approaches, suggesting a strategic selection of models and features to effectively combat deepfakes.

Future Work

In future work on deepfake creation, deep learning techniques will significantly enhance realism, stability, and controllability. With fine-grained control over facial expressions, textures, and lighting, GANs (such as StyleGAN2 and StyleGAN3) have ushered in a new era in both stability and artifact control. The residual artifact method can be further researched in order to remove them and achieve stability for seamless video deepfakes [46]. Furthermore, Transformer-based architectures such as Vision Transformers (ViTs) provide a relatively robust framework to generate temporally coherent video sequences through learning long-range dependency and fine-tuning sequences. In the future, these models could be optimized for more realistic facial movements and speech syncing, but also for other previously hard-to-capture complex dynamic behaviors in video deepfakes [47]. Another promising direction is self-supervised learning-based approaches like BYOL, which allow training from minimal labeled data and provide a basis for building scalable and reconfigurable deepfake generation systems [48]. Diffusion models, which have an iterative noise reduction process, also represent an intriguing possibility to increase image quality and express subtle changes in deepfakes [49]. Taken together, these research directions collectively offer a path toward increasing fidelity, coherence, and accessibility and mitigation of and response to ethical questions and the rampant misuse of deepfake technologies.

Author Contributions

Conceptualization, S.S. and S.M.S.; methodology, S.S.; software S.S.; validation, A.Z., S.S. and Z.I.; formal analysis, A.Z.; investigation, S.M.S.; resources, Z.M.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, Z.I., Z.M. and M.K.; visualization, S.S. and Z.I.; supervision, S.M.S.; project administration, A.Z.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the results reported in this paper is available within the article itself. Queries regarding the data in this article can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional neural network
GRU	Gated recurrent unit
GANs	Generative adversarial networks
TCN	Temporal convolutional network
AUC	Area under the curve
RNN	Recurrent neural network
LSTM	Long short-term memory
VAE	Variational autoencoder
MLP	Multi-layer perceptron
SMOTE	Synthetic Minority Oversampling Technique
FF++	FaceForensics++
MAR	Mouth aspect ratio
EAR	Eye aspect ratio
DNN	Deep neural network
ROC	Receiver operating characteristic
MSE	Mean squared error
FF-LPBH	Fisherface Local Binary Pattern Histogram
YOLO	You Only Look Once
FCC-GAN	Fully connected convolutional generative adversarial network
PGGAN	Progressive Growing of GANs
CRNN	Convolutional recurrent neural network
DBN	Deep belief network
OC-FakeDetect	One-Class Fake Detection
C-GAN	Conditional generative adversarial network
AddNets	Attention-based deepfake detection networks
KL-Divergence	Kullback–Leibler divergence
IQR	Interquartile range
CSV	Comma-separated value
ReLU	Rectified linear unit
SVM	Support vector machine

References

Koopman, M.; Rodriguez, A.M.; Geradts, Z. Detection of deepfake video manipulation. In Proceedings of the 20th Irish Machine Vision and Image Processing Conference (IMVIP), Belfast, Northern Ireland, 10–12 September 2018; pp. 133–136. [Google Scholar]
Chesney, B.; Citron, D. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. Law Rev. 2019, 107, 1753. [Google Scholar] [CrossRef]
Harris, D. Deepfakes: False pornography is here and the law cannot protect you. Duke Law Technol. Rev. 2018, 17, 99. [Google Scholar]
Masood, M.; Nawaz, M.; Malik, K.M.; Javed, A.; Irtaza, A.; Malik, H. Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 2022, 53, 3974–4026. [Google Scholar] [CrossRef]
Guarnera, L.; Giudice, O.; Battiato, S.; Guarnera, F.; Ortis, A.; Puglisi, G.; Paratore, A.; Bui, L.M.Q.; Fontani, M.; Coccomini, D.A.; et al. The Face Deepfake Detection Challenge. J. Imaging 2022, 8, 263. [Google Scholar] [CrossRef] [PubMed]
Patel, M.; Gupta, A.; Tanwar, S.; Obaidat, M. Trans-DF: A transfer learning-based end-to-end deepfake detector. In Proceedings of the 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 30–31 October 2020; pp. 796–801. [Google Scholar]
CB Insights. The Future of Information Warfare. 2024. Available online: https://www.cbinsights.com/research/future-of-information-warfare/ (accessed on 3 November 2024).
Bracken, B. Deepfake Attacks Are About to Surge, Experts Warn. 2021. Available online: https://threatpost.com/deepfake-attacks-surge-experts-warn/165798/ (accessed on 30 November 2022).
FakeApp. Available online: https://www.fakeapp.org/ (accessed on 30 November 2022).
FaceApp. Available online: https://www.faceapp.com/ (accessed on 30 November 2022).
Korshunov, P.; Marcel, S. Deepfakes: A new threat to face recognition? Assessment and detection. arXiv 2018, arXiv:1812.08685. [Google Scholar]
Verdoliva, L. Media forensics and deepfakes: An overview. IEEE J. Sel. Top. Signal Process. 2020, 14, 910–932. [Google Scholar] [CrossRef]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5001–5010. [Google Scholar]
Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; Jain, A.K. On the detection of digital face manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5781–5790. [Google Scholar]
Suganthi, S.T.; Ayoobkhan, M.U.A.; Bacanin, N.; Venkatachalam, K.; Štěpán, H.; Pavel, T. Deep learning model for deep fake face recognition and detection. PeerJ Comput. Sci. 2022, 8, e881. [Google Scholar]
Ismail, A.; Elpeltagy, M.; Zaki, M.; ElDahshan, K.A. Deepfake video detection: YOLO-Face convolution recurrent approach. PeerJ Comput. Sci. 2021, 7, e730. [Google Scholar] [CrossRef] [PubMed]
Chauhan, S.S.; Jain, N.; Pandey, S.C.; Chabaque, A. Deepfake Detection in Videos and Pictures: Analysis of Deep Learning Models and Dataset. In Proceedings of the 2022 IEEE International Conference on Data Science and Information System (ICDSIS), Hassan, India, 29–30 July 2022; pp. 1–5. [Google Scholar]
Groh, M.; Epstein, Z.; Firestone, C.; Picard, R. Deepfake detection by human crowds, machines, and machine-informed crowds. Proc. Natl. Acad. Sci. USA 2022, 119, e2110013119. [Google Scholar] [CrossRef]
Kiran, B.R.; Thomas, D.M.; Parakkal, R. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. J. Imaging 2018, 4, 36. [Google Scholar] [CrossRef]
Raza, A.; Munir, K.; Almutairi, M. A Novel Deep Learning Approach for Deepfake Image Detection. Appl. Sci. 2022, 12, 9820. [Google Scholar] [CrossRef]
Khochare, J.; Joshi, C.; Yenarkar, B.; Suratkar, S.; Kazi, F. A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 2022, 47, 3447–3458. [Google Scholar]
Rana, M.S.; Sung, A.H. Deepfakestack: A deep ensemble-based learning technique for deepfake detection. In Proceedings of the 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), New York, NY, USA, 1–3 August 2020; pp. 70–75. [Google Scholar]
Huang, B.; Wang, Z.; Yang, J.; Ai, J.; Zou, Q.; Wang, Q.; Ye, D. Implicit identity driven deepfake face swapping detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4490–4499. [Google Scholar]
Kong, C.; Chen, B.; Li, H.; Wang, S.; Rocha, A.; Kwong, S. Detect and locate: Exposing face manipulation by semantic-and noise-level telltales. IEEE Trans. Inf. Forensics Secur. 2022, 17, 1741–1756. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, Y.; Fan, Y.; Wu, B. UCF: Uncovering common features for generalizable deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 22412–22423. [Google Scholar]
Luo, A.; Cai, R.; Kong, C.; Kang, X.; Huang, J.; Kot, A.C. Forgery-aware adaptive vision transformer for face forgery detection. arXiv 2023, arXiv:2309.11092. [Google Scholar]
Jia, S.; Lyu, R.; Zhao, K.; Chen, Y.; Yan, Z.; Ju, Y.; Hu, C.; Li, X.; Wu, B.; Lyu, S. Can ChatGPT detect deepfakes? A study of using multimodal large language models for media forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4324–4333. [Google Scholar]
Matern, F.; Riess, C.; Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Use of a capsule network to detect fake images and videos. arXiv 2019, arXiv:1910.12467. [Google Scholar]
Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, I.; Natarajan, P. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 2019, 3, 80–87. [Google Scholar]
Trinh, L.; Tsang, M.; Rambhatla, S.; Liu, Y. Interpretable and trustworthy deepfake detection via dynamic prototypes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar]
Li, Y.; Chang, M.C.; Lyu, S. In ictu oculi: Exposing AI-generated fake face videos by detecting eye blinking. arXiv 2018, arXiv:1806.02877. [Google Scholar]
Alshaikh, A. Application of Cortical Learning Algorithms to Movement Classification Towards Automated Video Forensics; Diss. Staffordshire University: Stafford, UK, 2019; Available online: https://eprints.staffs.ac.uk/5577/ (accessed on 3 November 2024).
Chintha, A.; Thai, B.; Sohrawardi, S.J.; Bhatt, K.; Hickerson, A.; Wright, M.; Ptucha, R. Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Sel. Top. Signal Process. 2020, 14, 1024–1037. [Google Scholar] [CrossRef]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, 11–13 December 2018; pp. 1–7. [Google Scholar]
Yang, X.; Li, Y.; Lyu, S. Exposing deepfakes using inconsistent head poses. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8261–8265. [Google Scholar]
Kshirsagar, M.; Suratkar, S.; Kazi, F. Deepfake Video Detection Methods using Deep Neural Networks. In Proceedings of the 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India, 11–12 August 2022; pp. 27–34. [Google Scholar]
Rana, M.S.; Nobi, M.N.; Murali, B.; Sung, A.H. Deepfake detection: A systematic literature review. IEEE Access 2022, 10, 25494–25513. [Google Scholar]
KoÇak, A.; Alkan, M. Deepfake Generation, Detection and Datasets: A Rapid-review. In Proceedings of the 2022 15th International Conference on Information Security and Cryptography (ISCTURKEY), Ankara, Turkey, 19–20 October 2022; pp. 86–91. [Google Scholar]
Zi, B.; Chang, M.; Chen, J.; Ma, X.; Jiang, Y.G. Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2382–2390. [Google Scholar]
Shahzad, H.F.; Rustam, F.; Flores, E.S.; Mazón, J.L.V.; Diez, I.T.; Ashraf, I. A Review of Image Processing Techniques for Deepfakes. Sensors 2022, 22, 4556. [Google Scholar] [CrossRef]
Khalid, H.F.; Woo, S.S. OC-FakeDect: Classifying deepfakes using one-class variational autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 656–657. [Google Scholar]
Malik, Y.S.; Sabahat, N.; Moazzam, M.O. Image Animations on Driving Videos with DeepFakes and Detecting DeepFakes Generated Animations. In Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020; pp. 1–6. [Google Scholar]
Hashmi, M.F.; Ashish, B.K.K.; Keskar, A.G.; Bokde, N.D.; Yoon, J.H.; Geem, Z.W. An exploratory analysis on visual counterfeits using conv-lstm hybrid architecture. IEEE Access 2020, 8, 101293–101308. [Google Scholar] [CrossRef]
Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. 2020. Available online: http://github.com/ondyari/FaceForensics (accessed on 3 November 2024).
Karras, T.; Laine, S.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2020, arXiv:1912.04958. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
Ho, J.; Salimans, T.; Chan, W.; Chen, B.; Schulman, J.; Sutskever, I.; Abbeel, P. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv 2021, arXiv:2102.00732. [Google Scholar]

Figure 1. Deep fake images double every six months [7].

Figure 2. Deepfake image and video forensics architecture using deep learning techniques.

Figure 3. Pairwise correlation between features.

Figure 4. Learning curve of eye landmarks for artifact investigation using CNN-LSTM with GAN model, showing high accuracy and low loss, without significant overfitting.

Figure 5. Learning curve of eye landmarks for artifact investigation using CNN-LSTM, comparing the performance of temporal convolutional networks (TCNs) in identifying spatiotemporal artifacts with low overfitting and high accuracy.

Figure 6. Learning curve of eyes and nose landmark for artifact investigation using TCN with GAN.

Figure 7. Learning curve of eyes and nose landmark for artifact investigation using TCN without GAN.

Figure 8. Learning curve of eyes, nose, and mouth landmark for artifact investigation using CNN_LSTM with GAN.

Figure 9. ROC curves without GAN. (The black dotted lines represent a reference or line of equality, where the values on the x-axis and y-axis are equal.)

Figure 10. ROC curves with GAN (lower performance).

Table 1. Training parameters and values of CNN-LSTM.

Training Parameter	Value
Epochs	10
Batch Size	32
Optimizer	Adam
Loss Function	Binary cross-entropy
Metric	Accuracy

Table 2. Training parameters and values of CNN-GRU.

Training Parameter	Value
Epochs	10
Batch Size	32
Optimizer	Adam
Loss Function	Binary cross-entropy
Metric	Accuracy

Table 3. Training parameters and values of TCN.

Training Parameter	Value
Epochs	10
Batch Size	32
Optimizer	Adam
Loss Function	Binary cross-entropy
Metric	Accuracy

Table 4. Training parameters and values of GAN-Autoencoded Features.

Training Parameter	Value
Epochs	10
Batch Size	32
Optimizer	Adam
Loss Function	Mean squared error (MSE)
Metric	Accuracy

Table 5. Performance comparison of different deepfake detection techniques.

Ref.	Featured-Based Methodology	Classifier	Best Performance	Datasets
[28]	Combined Visual Features of eyes and teeth	Logistic Regression, MLP	AUC = 0.851 Accuracy = 0.854 Precision = 0.807 Recall = 0.849 F1 Score = 0.828	FaceForensics++
[29]	Deep learning features	Capsule Network	AUC = 0.91 Accuracy = 0.91 F1 Score = 0.91 Precision = 0.92 Recall = 0.08	FaceForensics++
[30]	Image + Temporal features	CNN + RNN	AUC = 0.93 Accuracy = 0.939 Precision = 0.92 Recall = 0.08 F1 Score = 0.91	FF++ (FaceSwap, DeepFakes, LQ)
[31]	Image + Temporal features	Dynamic Prototype Network	AUC = 0.718 Accuracy = 0.72 Precision = 0.73 Recall = 0.26 F1 Score = 0.73	FF++ (Face2Face, FaceSwap, HQ)
[32]	Eye-blinking features	LRCN	AUC = 0.78 Accuracy = 0.76 Precision = 0.77 Recall = 0.22	FaceForensics++ (Face Synthesis)
[33]	Eye-blinking features	Distance	AUC = 0.875 Precision = 0.875 Recall = 0.778 F1 Score = 0.824 Accuracy = 0.85	FaceForensics++ (Face Synthesis with the unnatural movement of the eye)

Table 6. Facial features and parameters.

Eyes	Nose	Mouth
Eye Aspect Ratio (EAR)	Nose Tip	Mouth Aspect Ratio (MAR)
Blink Frequency and Amplitude	Nostril Symmetry	Mouth Symmetry
Pupil Dilation	Nasal Base	Mouth Position (X, Y)
Eyelid Creases and Movement	Nasal Sides	Lip Spacing
Iris Texture and Diameter	Nasal Septum	Lip Boundary
Eye Position and Aspect Ratio	Nasal Shape	Mouth Shape Dynamics
Sclera-to-Iris Ratio	Nostrils Position (X, Y)	Mouth-to-Face Proportion
Pupil-to-Iris Ratio	Nose Bridge	Corner of Mouth (Left X, Y; Right X, Y)

Table 7. Dataset distribution before and after augmentation.

Dataset	Ratio	Samples (Before Augmentation)	Samples (After Augmentation)
Training set	80%	33246	66492
Testing set	20%	7379	7379

Table 8. Model architecture and parameters of autoencoder.

Layer Type	Parameters
Input	shape = (input_dim,)
Dense	units = 64, activation = ‘relu’
Dense	units = 32, activation = ‘relu’
Dense	units = 64, activation = ‘relu’
Dense	units = input_dim, activation = ‘sigmoid’

Table 9. Training parameters and values of autoencoder.

Training Parameter	Value
Epochs	50
Batch Size	256
Optimizer	Adam
Loss Function	Mean squared error (MSE)
Metric	Accuracy

Table 10. Eye-blinking landmark artifact detection with and without GAN.

Model	Precision (without GAN)	Recall (without GAN)	F1 Score (without GAN)	Precision (with GAN)	Recall (with GAN)	F1 Score (with GAN)
CNN	0.896	0.884	0.890	0.915	0.902	0.908
CNN-GRU	0.902	0.890	0.896	0.920	0.910	0.915
CNN-LSTM	0.910	0.902	0.906	0.928	0.916	0.922
TCN	0.917	0.910	0.913	0.935	0.920	0.927

Table 11. Eyes and nose landmark artifact detection with and without GAN.

Model	Precision (without GAN)	Recall (without GAN)	F1 Score (without GAN)	Precision (with GAN)	Recall (with GAN)	F1 Score (with GAN)
CNN	0.875	0.860	0.867	0.895	0.880	0.887
CNN-GRU	0.890	0.875	0.882	0.910	0.895	0.902
CNN-LSTM	0.898	0.882	0.890	0.918	0.902	0.910
TCN	0.905	0.890	0.897	0.925	0.910	0.917

Table 12. Eyes, nose, and mouth landmark artifact detection with and without GAN.

Model	Precision (without GAN)	Recall (without GAN)	F1 Score (without GAN)	Precision (with GAN)	Recall (with GAN)	F1 Score (with GAN)
CNN	0.865	0.850	0.857	0.885	0.870	0.877
CNN-GRU	0.880	0.865	0.872	0.900	0.885	0.892
CNN-LSTM	0.890	0.875	0.882	0.910	0.895	0.902
TCN	0.900	0.885	0.892	0.920	0.905	0.912

Table 13. Performance comparison of state-of-the-art deepfake detection techniques.

Ref.	Feature-Based Methodology	Classifier	Best Performance	Datasets
[31]	Image + temporal features	Dynamic Prototype Network	AUC = 0.718; Accuracy = 0.72; Precision = 0.73; Recall = 0.26; F1 score = 0.73	FF++ (Face2Face, FaceSwap, HQ)
[32]	Eye-blinking features	LRCN	AUC = 0.78; Accuracy = 0.76; Precision = 0.77; Recall = 0.22	FaceForensics++ (Face Synthesis)
[28]	Combined visual features of eyes and teeth	Logistic Regression, MLP	AUC = 0.851; Accuracy = 0.854; Precision = 0.807; Recall = 0.849; F1 Score = 0.828	FaceForensics++
[33]	Eye-blinking features	Distance	AUC = 0.875; Precision = 0.875; Recall = 0.778; F1 Score = 0.824; Accuracy = 0.85	FaceForensics++ (Face Synthesis with unnatural movement of the eye)
[29]	Deep learning features	Capsule Network	AUC = 0.91; Accuracy = 0.91; F1 Score = 0.91; Precision = 0.92; Recall = 0.08	FaceForensics++
[30]	Image + temporal features	CNN + RNN	AUC = 0.93; Accuracy = 0.939; Precision = 0.92; Recall = 0.08; F1 score = 0.91	FF++ (FaceSwap, DeepFakes, LQ)
This work	Spatiotemporal features + augmented facial landmarks with GAN model	TCN model for spatiotemporal analysis with augmentation + GAN	AUC = 0.93; Accuracy = 0.96; Precision = 0.98; F1 score = 0.98	FF++

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sohail, S.; Sajjad, S.M.; Zafar, A.; Iqbal, Z.; Muhammad, Z.; Kazim, M. Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning. Information 2025, 16, 270. https://doi.org/10.3390/info16040270

AMA Style

Sohail S, Sajjad SM, Zafar A, Iqbal Z, Muhammad Z, Kazim M. Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning. Information. 2025; 16(4):270. https://doi.org/10.3390/info16040270

Chicago/Turabian Style

Sohail, Saud, Syed Muhammad Sajjad, Adeel Zafar, Zafar Iqbal, Zia Muhammad, and Muhammad Kazim. 2025. "Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning" Information 16, no. 4: 270. https://doi.org/10.3390/info16040270

APA Style

Sohail, S., Sajjad, S. M., Zafar, A., Iqbal, Z., Muhammad, Z., & Kazim, M. (2025). Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning. Information, 16(4), 270. https://doi.org/10.3390/info16040270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning

Abstract

1. Introduction

1.1. Contributions

1.2. Artifacts Captured

1.3. Feature Fusion

2. Background

3. Related Work

4. Proposed Models and Forensics Techniques

4.1. Dataset Description

4.1.1. Manipulation Methods

4.1.2. Dataset Scale

4.2. Data Preprocessing

4.3. Artifact Landmark Detection

4.4. Correlation Between the Artifacts to Identify Correlated Pairs

4.5. Artifact Annotations

4.6. Data Preparation in Deepfake Forensics

4.6.1. Noise Removal

4.6.2. Data Transformation

4.6.3. Data Enrichment

4.7. Artifact Sample Augmentation

4.8. Artifact Balancing

4.8.1. K-Fold Cross-Validation

4.8.2. Artifact Transformation

4.9. Workflow Using Models

4.10. Model Training

Model Training Pseudocode

5. Results and Discussion

5.1. Experiment 1: Eye Landmarks

5.2. Experiment 2: Fusion of Eyes and Nose Landmark Facial Region

5.3. Experiment 3: Fusion of Eyes, Nose, and Mouth Landmark Facial Region

5.4. Ablation Studies

5.5. Comparative Study of Landmark-Based Deepfake Detection Techniques

5.6. State-of-the-Art Table

5.7. Real-World Applicability

6. Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI