1. Introduction
Currently, we are in the fourth generation of technology, i.e., Industry 4.0, and gradually, we advancing towards the fifth generation, i.e., Industry 5.0. We have come across a lot of things in this technical era, including the development of many technologies like IoT, AI, machine learning, Android, Blockchain, Cybersecurity, and many more. In this technological age, we have encountered many things, such as the advancement of numerous technologies including blockchain, IoT, AI, machine learning, Android, cybersecurity, and many more. They have been shown to benefit humanity and have made significant strides in human progress. However, even if they are growing, improving, and benefiting humanity every day, they are also endangering people. They have been proven to help humanity and have facilitated huge advancements in the human race. But as they are emerging and updating day by day and helping humanity, they are also becoming a threat to people. With this advancement in technology, we are also facing many disadvantages, and one of the main threats is artificial intelligent contains media, which contains fake or modified material, and poses harm to anyone’s lifes. Deep learning algorithms are used to create these edited images, which allow a real photo of a person to have certain features of their face or body altered. A video including some odd remarks made by the first Black president, Barack Obama, became viral on social media during his administration. However, a great deal of inquiry and analysis revealed that the media had been manipulated—a phenomenon known as “deepfakes.”. We have employed several machine learning and deep learning algorithms, such as ResNet, MTCNN, and others, in our deepfake recognition system to identify artificial intelligence (AI) or human-generated modified photos. Because the Multitask Cascaded Convolutional Neural Network (MTCNN) produces more accurate results than the standard CNN algorithm, we chose it for facial recognition. Compared to a standard CNN, the MTCNN is a bit more complex. Next, we employed InceptionResNetV1 to detect each image, i.e., determine if it is authentic or not. Additionally, we made extensive use of Python (version 3.13) packages, such as pandas and Facenet_pytorch. The major goal of this paper is to illustrate the relevance of the need for a deep-fake recognition system, as AI becomes more and more harmful every day.
2. Literature Review
Many individuals have tried to create a proper deepfake recognition system. Hanqing Zhao et al. [
1] created a multi-attentional deepfake detection framework that can detect subtle and localized artefacts in fake photos. Developed developedDevelopedThey introduced components like multiple attention maps, textural feature enhancement, and bilinear attention pooling to extract fine-grained details. Bojia Zi1 et al. [
2] employed 2D and 3D attention-based deepfake detection networks (ADDNets) to improve detection accuracy by leveraging attention masks on facial landmarks. Their work addresses the challenge of detecting real-world deepfakes and proposes ADDNets for enhanced detection performance across existing datasets and their Wild Deepfakes dataset. Brian Dolhansky et al. [
3] applied diverse methods, such as Deepfake Autoencoder (DFAE), GAN-based models, and Style GAN, to create realistic face-swapped videos. These methods support training scalable detection models for identifying manipulated videos in real-world scenarios.
Zaynab Almutairi and Hebah Elgibreen et al. [
4] used logistic regression with entropy features, Quadratic SVM for binary classification, and SVM with Random Forest using short-term and long-term features for fake audio recognition. They also used DL models like CNNs to outperform traditional ML methods for feature extraction, and these showed more robustness. Md Shohel Rana et al. [
5] used two neural networks, a generative network and a discriminative network with a face-swap technique, to generate such counterfeit videos. They created a summary with a total of 112 articles on how to fight deepfakes. Arash Heidari et al. [
6] created a technique using a binary classifier trained by a CNN and achieved an accuracy of 97%, where the AUC stood at 97.6%, and the main challenge in their research was poor robustness, where the dataset was from the University of Capetown. Bird, J. J., and Lotfi, A. et al. [
7] evaluated the model accuracy considerable framework CNN, Alex net, and VGG 16 than after they accessed the accuracy and precision of three frameworks. Tested Aryaf Al-Adwan et al. [
8] highlighted that the suggested deep learning model which combines CNN and RNN optimizes with PSO, considerable enhances accuracy of deepfake detection. The method’s effectiveness was demonstrated through experiments on datasets such as Celeb-DF and the Deepfake Detection Challenge Dataset (DFDC), showing superior performance compared to existing approaches. Aya Ismail et al. [
9] emphasized that the proposed YIX method, which integrates the YOLO face detector, InceptionResNetV2 for feature extraction, and XGBoost for classification, offers a highly accurate solution for deepfake detection. Through comparative analysis, YIX has demonstrated superior performance over existing methods. This paper discusses how deepfake detection is possible using CNN and transformers [
10]. Sreeraj Ramachandran et al. [
11] used a different dataset and identified loss functions and deepfake generation techniques.
3. Proposed Model
Our model consists of four different models.
Phase 1: While creating the deepfake recognition system, we used many machine learning and deep learning algorithms. The first challenge was to collect and gather the dataset to train the model. So, we took some measures to create a dataset on our own by taking some pictures and manipulating them by using artificial intelligence and some editing software. After that, we also took the datasets available online from the Kaggle Website, where the data were split into real and manipulated or fake images. The following dataset was used.
https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images (accessed on 25 July 2024).
Phase 2: Response:
The dataset that was used in this model was a combination of the Kaggle dataset and a self-created dataset using GAN. There were a total of 190 K images used for the training, testing, and validation of the model. The number of training data was 140 K; i.e., 70 k were real images and 70 k were fake images. Then, for testing purposes, we used 10,984 data points; i.e., 5492 were for real images and 5492 were for fake images. Finally, we used 39,200 images for validation, i.e., 19,600 real images and 19,600 fake images.
The GAN-generated images were created using DeepFaceLab and Faceswap, and the reason for choosing them was due to their availability as they are open-source.
The Kaggle dataset was taken from this website:
Once we had collected all the data from the website and created some of our own, we merged them, and then we conducted the preprocessing analysis, where we normalized the pixels of the images. After that, we resized the images to one specific size so that there would not be any kind of lagging in the model during the comparison of the real and manipulated images and to ensure uniform input dimensions for the neural network.
After completing these steps, we completed the facial recognition part, where we checked whether the given image had a picture of a human face or not by using an advanced version of the Convolutional Neural Network, i.e., the MTCNN (Multi-Task Cascaded Convolutional Neural Network). The MTCNN was also used to detect facial landmarks like the eyes, nose, mouth, etc., which were very useful for further processing steps. The detected landmarks were then used by the machine learning algorithms to align the faces to a standard pose, which reduces the variability from different head poses and facial posiand makes it easier for the model for the feature extraction and recognition part.
We also proposed an application for this model. So, after using the MTCNN, we used InceptionResNetV1, which was used for the extraction of the key features of the face images. InceptionResNetV1 is a combined architecture of Inception and ResNet, whichleverages the strengths of both to capture details and the best features from the facial images.
As it has a deep architecture and residual connections, it helps to obtain a robust representation of faces, which is essential for differentiating between real and manipulated media.
After completing the work with InceptionResNetV1, we moved on to using Facenet implemented by PyTorch (version 2.5) for face verification and the recognition of the deepfake media. It measures the similarity between the data to determine whether the two faces are of the same person or not. Facenet uses distance metrics like the Euclidean distance or the Manhattan distance to compare the given data. If the distance between the data is below a certain value, then the faces are considered to be of the same person. And from this, it can differentiate between real and deepfake media based on our data. Overall, the model uses the MTCNN for face detection and alignment. Then, InceptionResNetV1 was used for feature extraction purposes, followed by Facenet, which was used for the comparison of the real and deepfake media. After these processes, the decision is made; i.e., either the input image is a real image or a manipulated image.
Phase 3: In the third phase, our model was trained using all the deep learning algorithms. Its effectiveness was tested through the calculation of its performance metrics, like accuracy, precision, recall score, etc. By performing all the above operations, we were able to implement the task using the model, and the results were accurate as per our expectations.
Phase 4: Moving on to the fourth phase, in this phase, the model was validated using the performance evaluation parameters for the test datasets. The datasets were originally collected from the Kaggle website, and we also added some of our own datasets, which were created with the help of artificial intelligence, i.e., GAN-generated images.
In
Figure 1, we show the complete proposed model that we implemented for effective deepfake recognition. Here, the dataset was processed through a DPA to clean the data, and then the MTCNN algorithm was used for face detection. Afterward, InceptionResNetV1 was used for feature extraction purposes, and then, at last, we used Facenet_pytorch for face verification and classification purposes. All deep learning techniques were used for classifying the real and fake images.
Figure 2, depicts the interface of our proposed model, where uses the Gradio interface. Here, one can enter the image to determine whether it is real or fake. Then, the model beginsits work by analyzing the image, i.e., by extracting the features from the image and then checking for the main part, i.e., if it is real or fake.
Figure 3 presents the outcome of the model’s analysis. After processing the image, the model predicts whether it is real or fake, along with a confidence percentage.
4. Result and Discussion
In this section, we discuss the specific predictions that were made by the deepfake recognition system.
We used three types of algorithms to make this model, i.e., the MTCNN and InceptionResNetV1.
The accuracy of the MTCNN model for face detection is 95%. For landmark localization, it is 92%. In the InceptionResNetV1 model, the accuracy for feature extraction is about 98%. The accuracy of face verification by Facenet_pytorch is 98.65%. And for face detection by Facenet_pytorch, the accuracy is 98%.
By using all the algorithms together, we successfully created this model for the detection of manipulated media.
In
Table 1 below, we show the different accuracies that we achieved with these algorithms.
Figure 4 shows a horizontal bar plot of the comparison table. The accuracy of the different components of the models is plotted in the graph. The accuracy of the different components of the models is plotted in the graph. The X-axis represents the components of the models, while the Y-axis indicates the accuracy in terms of percentage. A bar plot is used to help to analyze the dataset more easily.
Figure 5 below represents the variability between precision and recall.
RQ1: What are the performance consequences of utilizing the MTCNN for both face detection and landmark localization in deepfake detection tasks? How can variations in detection accuracy and precision/recall affect the efficacy of the following stages in the pipeline?
In
Figure 6, the impact of MTCNN metrics on the downstream task is shown. We found that face detection accuracy = 95% and landmark localization precision/recall = 92%. The derived metrics were calculated based on the values provided. The downstream metrics (feature extraction and verification accuracy) were recalculated based on the assumption that each stage has a linear dependency on its input metrics. In the figure, the yellow bar highlights the metrics derived during the pipeline stages. It provides the relationship between the MTCNN’s performance metrics and its effect on subsequent tasks like feature extraction and verification.
Table 2 discusses the impact on downstream tasks.
RQ2: What makes InceptionResNetV1 suited for high-precision feature extraction in deepfake detection, and how does its verification accuracy compare with other cutting-edge models?
In
Figure 7, InceptionResNetV1 combines Inception modules and Residual Networks (ResNet) to extract complex features. We made a hybrid architecture that allows us to capture both fine-grained and abstract features, which are critical in differentiating real faces from deepfake manipulations.
Table 3 shows that InceptionResNetV1 achieves a high accuracy of 98%, demonstrating its effectiveness in extracting complex features. On the other hand, Facenet_pytorch achieves a higher verification accuracy of 98.65% compared to InceptionResNetV1. In
Table 3, we show a verification accuracy comparison between InceptionResNetV1 (original), InceptionResNetV1 (optimized), and Facenet_PyTorch.
For overall system accuracy and to verify the models’ real-world potential, we documented the end-to-end accuracy, F1-score, and ROC-AUC of the models. The end-to-end accuracy was 93.10% (aligned with feature extraction accuracy), and the F1-score was 0.91 (derived from verification performance). The ROC-AUC was 0.96 (adjusted based on component verification accuracy). The end-to-end accuracy is influenced by the feature extraction stage, which has an accuracy of 93.10%. The F1-score reflects both feature extraction (98.00%) and verification (89.24%) performances. The ROC-AUC was estimated based on face verification accuracy (98.65%), showing the system’s discrimination capability.
Table 4 emphasizes the performance metrics of the proposed system, along with highlighting both end-to-end results and component-specific accuracy. This table shows the overall system performance as well as the contributions of the individual components (e.g., feature extraction, and verification modules) to the final results. Similarly, in
Table 5 we present our proposed model and compare it with existing works.
Higher feature extraction accuracy: In our experimental observations, it was observed that our model achieves 93.10%, outperforming the deep face recognition model (92.02%) and the CNN and transformer models (88.74%). Our model also performs better in terms of face verification, i.e., 89.24%. Face verification performance: Our verification accuracy (89.24%) is superior to that of previous models, proving its reliability in deepfake detection. Robustness in real-world scenarios: Unlike other existing models, which focus on CNNs or transformers, our model integrates the MTCNN, InceptionResNetV1, and Facenet_PyTorch, enhancing its robustness. So, to make our model simple and robust, we used the MTCNN for face detection purposes in an image, and after it detected a human face, we used InceptionResNetV1 for feature extraction purposes. Then, after extracting features, Facenet_pytorch was used for face verification in deepfake classification.
RQ3: How do different components (MTCNN, InceptionResNetV1, and Facenet_pytorch) combine to construct an end-to-end pipeline for deepfake detection?
In
Figure 8, we use a radar chart to compare the performance of the MTCNN, InceptionResNetV1, and Facenet_PyTorch. It demonstrates the area where each component increased. The title and labels show the components, as well as their metrics. The figure depicts the performance metrics of different components on the circular grid, and the values are normalized here from a 0–100 scale of comparison.
RQ 4: How do the selected models stack up against emerging deepfake detection approaches in terms of computing efficiency and accuracy?
Figure 9 depicts a direct comparison of different algorithms as well as provides providing the strengths and weaknesses of different models concerning accuracy and efficiency. Emerging techniques may improve accuracy (e.g., 99.2%) but might compromise efficiency (e.g., 88%).
5. Conclusions
The deepfake detection model that was developed in this study has a lot of potential to detect manipulated images and help take necessary steps against crime. Further development in this field can be achieved by further developing the algorithm so more accuracy can be achieved. MTCNN achieves the highest accuracy to date (95%), out performing both InceptionResNetV1 and FaceNet_PyTorch. Our model provides real and fake percentages, helping users understand the likelihood of an image being fake. It achieves high accuracy while maintaining performance efficiency, making it suitable for real-time applications. It does have some limitations, like the overfitting of some of the GAN architecture, which we plan to overcome by using some adversarial training and more powerful algorithms. Our future work will include exploring a hybrid CNN–transformer architecture so that our model’s robustness can be increased.