Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection

Lipianina-Honcharenko, Khrystyna; Melnyk, Nazar; Ivasechko, Andriy; Telka, Mykola; Illiashenko, Oleg

doi:10.3390/bdcc9040109

Open AccessArticle

Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection

by

Khrystyna Lipianina-Honcharenko

^1,*,

Nazar Melnyk

¹,

Andriy Ivasechko

¹,

Mykola Telka

¹ and

Oleg Illiashenko

^2,3

¹

Department of Information and Computing Systems and Control, Faculty of Computer Information Technologies, West Ukrainian National University, 46000 Ternopil, Ukraine

²

Department of Computer Systems, Networks and Cybersecurity, Faculty of Radio Electronics, Computer Systems and Infocommunications, National Aerospace University “KhAI”, 61000 Kharkiv, Ukraine

³

The Institute of Informatics and Telematics of the National Research Council (IIT-CNR), Via Giuseppe Moruzzi 1, 56124 Pisa, Italy

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(4), 109; https://doi.org/10.3390/bdcc9040109

Submission received: 5 March 2025 / Revised: 7 April 2025 / Accepted: 11 April 2025 / Published: 21 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deepfake technology poses significant threats in various domains, including politics, cybersecurity, and social media. This study uses the golden frame selection technique to present a neural network ensemble method for deepfake classification. The proposed approach optimizes computational resources by extracting the most informative video frames, improving detection accuracy. We integrate multiple deep learning models, including ResNet50, EfficientNetB0, Xception, InceptionV3, and Facenet, with an XGBoost meta-model for enhanced classification performance. Experimental results demonstrate a 91% accuracy rate, outperforming traditional deepfake detection models. Additionally, feature importance analysis using Grad-CAM highlights how different architectures focus on distinct facial regions, enhancing overall model interpretability. The findings contribute to of robust and efficient deepfake detection techniques, with potential applications in digital forensics, media verification, and cybersecurity.

Keywords:

deepfake detection; neural network ensemble; golden frame selection; ResNet50; EfficientNetB0; Xception; InceptionV3; Facenet; XGBoost; Grad-CAM

1. Introduction

The relevance of deepfake detection research in the modern world, especially in the context of hybrid warfare, is exceptionally high. Using deepfakes to manipulate public opinion, political decisions, and international relations has already become a reality, highlighting potential risks to state stability and personal security. Deepfake technologies can be employed to create falsified evidence that distorts real events or the behavior of influential figures, potentially leading to international conflicts or internal misunderstandings within countries [1,2].

At the tactical level, deepfakes can sow confusion among military personnel, undermining morale or even causing unwarranted retreats due to visually verified but falsified enemy incursions. It highlights the serious threats that deepfakes pose to military strategies and the actual security of nations [3]. For instance, Russia employs hybrid warfare tactics, including cyberattacks and disinformation, to destabilize society in Ukraine, demonstrating how technology can be leveraged as a tool of militarized aggression [4,5].

In global politics, deepfakes are used to discredit political leaders, manipulate public opinion, and even interfere in electoral processes, as seen in various cases during elections in the United States and other countries. Raising awareness and advancing detection technologies are crucial for protecting democratic processes and national security from such threats [1,6].

The development and implementation of regulatory measures, such as the European Digital Act, which imposes fines on social media platforms for the inadequate moderation of artificially generated disinformation, highlight the global recognition of the need to control the spread and use of deepfake technologies. These measures underscore the severity of the issue and the necessity of international cooperation to combat emerging forms of cyber threats [1].

To enhance trust in AI-driven deepfake detection in the critical applications, it is crucial to consider its impact on functional safety, e.g., the safety assessment methodologies used in critical systems. The XMECA and EUMECA techniques proposed by [7] offer valuable insights into expert- and tool-based approaches that can be adapted to assess the robustness and trustworthiness of neural network ensembles used in security-sensitive applications.

Given these challenges, this study formulates the following hypotheses:

Hypothesis 1:

The use of an ensemble neural network approach will improve the accuracy of deepfake video classification by combining the strengths of different neural network architectures.

Hypothesis 2:

Selecting “golden frames”, which are the most informative for analysis, will reduce computational costs without compromising classification quality.

Hypothesis 3:

Model activation analysis using Grad-CAM will demonstrate that different architectures focus on various other features, which can be leveraged to enhance deepfake detection.

Thus, this study aims is to develop advanced deepfake detection technologies, specifically through the use of a hybrid neural network ensemble based on the selection of “golden frames”. This approach aims to significantly improve the accuracy of deepfake identification and ensure effective detection amid growing threats related to hybrid warfare and information manipulation. The research focuses on developing methods capable of addressing contemporary challenges, particularly in the context of ongoing international conflicts, such as the war in Ukraine, where deepfake technologies are used for disinformation and societal destabilization. The findings of this study could serve as the foundation for practical solutions that enhance awareness and resilience against information threats across various domains, including media and social networks.

The main contributions of this study are summarized as follows:

A golden frame selection technique is introducedidentifying the most informative video frames based on grayscale intensity differences between consecutive frames. This approach improves training efficiency while maintaining a good classification performance.
An ensemble of deep learning models is constructed, combining ResNet50, EfficientNetB0, Xception, InceptionV3, and FaceNet architectures to capture diverse spatial and textural features relevant to deepfake detection.
A meta-classification model based on XGBoost is applied to aggregate predictions from the base models, enhancing overall accuracy and robustness.
The application of Grad-CAM enables the interpretation of model behavior by visualizing the regions of interest across different neural networks.

The remainder of this paper is organized as follows. Section 2 provides an overview of related work in the field of deepfake detection. Section 3 presents the proposed methodology, including the golden frame selection technique and ensemble model architecture. Section 4 describes the implementation details and experimental setup, followed by an evaluation of the results, the ablation study, and the comparative analysis. Section 5 discusses findings, limitations, and future research directions. Finally, Section 6 concludes the paper with a summary of key contributions and practical implications.

2. Literature Review

In recent years, deepfake technologies, which enable the manipulation of multimedia content, particularly videos and images, have attracted significant interest from both the scientific community and the general public due to their potential risks to personal security and informational integrity. The development of effective deepfake detection methods has become a key research direction in computer vision and artificial intelligence. High-performance deepfake video detection using CNN focusing on specific target regions and manual feature distillation, proposed by Tran et al. [8], represents one of the latest approaches in this area. This method stands out by utilizing specific video regions to enhance model accuracy while maintaining an optimal model size.

On the other hand, Zhang et al. [9] highlight the need for a heterogeneous ensemble of features to improve the accuracy of deepfake detectors. They proposed a method that integrates various features, such as grayscale gradient, spectral features, and texture features. This approach outperforms several state-of-the-art deepfake detectors, emphasizing the importance of feature ensembles in detecting manipulations.

Methods based on fine-grained classification using subtle, globally relevant features, as demonstrated in the study by Nadimpalli and Rattani [10], achieve high effectiveness across different datasets and manipulation techniques. Their work underscores the advantages of approaches that focus on finer, discriminative features, highlighting the importance of considering local variations in developing deepfake detectors.

Another innovative approach is presented in the work by Guan et al. [11], which uses a multi-functional weighted model based on meta-learning. This model enhances the overall detection capability by optimizing the ability to identify features in different regions.

Finally, it is essential to highlight the work of Khan et al. [12], who conducted a comparative analysis of different deepfake video detection techniques, emphasizing the strengths and weaknesses of each approach and contributing to the improvement of hyperparameter settings to enhance the overall accuracy and efficiency of models.

Chakraborty and Naskar [13] investigate the impact of human physiology and facial biomechanics on the development of reliable deepfake detectors. They analyze how physiological signals can be used to improve the accuracy of detectors, which often face challenges related to demographic and social biases. This approach helps to reduce errors typically encountered in other detection methods that rely on general image or video characteristics.

In their review, Abbas and Taeihagh [14] systematically examine methods for detecting and generating deepfakes using artificial intelligence. The authors explore key algorithms, platforms, and tools, revealing how these technologies can be implemented in various contexts to combat the spread of fake news. They also highlight practical challenges and trends in the implementation of policies aimed at countering deepfake dissemination.

The research by Casu et al. [15] sheds light on a new aspect of the challenges in deepfake detection, focusing on the impact of cognitive biases on decision-making in digital forensics. They introduce the term “Impostor Bias”, which describes the systematic tendency to doubt the authenticity of multimedia content, often assuming it was generated by artificial intelligence. This bias can lead to false judgments and wrongful accusations, undermining the reliability and credibility of forensic evidence. The study emphasizes how the realism of AI-generated multimedia products can amplify the impact of this bias, pointing to the need to develop strategies to prevent and counteract it.

Meanwhile, Firc et al. [16] examine deepfakes as a threat to facial and voice recognition systems, providing a detailed overview of tools and attack vectors in the visual and audio domains. They analyze how deepfakes can affect biometric systems, particularly through spoofing, and categorize deepfakes with corresponding creation tools, datasets, and detection methods. Their primary contribution is the analysis of attack vectors, considering differences between deepfake categories and reports of actual attacks to assess the threats they pose to selected biometric systems.

In their work “ClueCatcher”, Lee et al. [17] use an innovative approach to deepfake detection, focusing on independent features across different domains. They identify and use features such as facial color mismatches, synthesis boundary artifacts, and quality differences between facial and non-facial regions. Using multi-stream convolutional neural networks and inter-patch dissimilarity evaluators allows this model to effectively detect unique deepfake features, significantly enhancing overall detection efficiency and generalizability.

Naitali et al. [18] review a wide range of topics related to deepfakes, including generation methods, detection, available datasets, challenges, and future research directions. Their work provides a foundational understanding of the deepfake problem and highlights its potential risks in areas such as disinformation, political manipulation, reputational damage, and fraud. The authors also focus on current detection methods and identify the need for further research to develop strategies to mitigate the threats posed by deepfakes.

Dincer et al. [19] presented an innovative method for detecting deepfake videos utilizing the golden ratio for frame selection. This approach allows for selecting selecting specific frames, rather than random ones, which can potentially enhance the detection performance by allowing researchers to focus on significant facial regions. The method uses three different feature extraction techniques (VGG19, EfficientNet B0, and EfficientNet B4) and two capsule network models (CapsuleNet and ArCapsNet), demonstrating the ability for deeper and more accurate data analysis. Performance evaluations on two challenging deepfake detection datasets, Celeb-DF and DFDC-P, showed high results, with the fusion of the best models yielding high accuracy and area-under-the-curve (AUC) values, with 93.63% accuracy and 99.14% AUC for Celeb-DF.

Sumanth et al. [20] proposed a hybrid method that combines a Temporal Convolutional Network (TCN) with semantically based content-based frame selection for video summarisation. The method’s novelty lies in integrating the TCN to consider temporal information, which improves the quality of golden frame selection. The advantage is a 6.8% improvement in the F1-score compared to the baseline methods, although the algorithm displayshigh computational complexity for long videos.

In his PhD research, Yousefi [20] ideveloped a method for selecting key frames in egocentric videos with a focus on semantic diversity. The author implemented a network model that minimizes redundancy and highlights the most informative frames. The method ensured the accuracy of frame classification was over 85%, but showed decreased efficiency in videos with low scene contrast.

Gong et al. [21] presented the Diverse Sequential Subset Selection method for the trained generation of video summaries based on human golden annotations. Their model ensures balanced coverage and the diversity of frames in the video, which increases the agreement with human ratings by 12%. However, pre-training is required for each new video, which limits the method’s use in real-time.

Leszczuk and Duplaga [22] proposed a specialized algorithm for the video review of bronchoscopic procedures that automatically identifies golden shots for medical analysis. The uniqueness of the method is its adaptation to the medical domain, with an accuracy of over 92%. However, the algorithm shows lower sensitivity to small details, which is critical for some diagnoses.

Kim and Kim [23] created a conceptual model for the interactive selection of key frames based on the semantics of the object (for example, recognizable symbols such as the Golden Gate Bridge). The authors achieved 80% coverage of significant events involving users, although the method depends on data pre-processing of, which reduces the automaticity of the process.

Alarfaj and Khan [24] conducted a detailed study on fake news classification using machine and deep learning methods. They employed various machine learning models, such as multinomial, Gaussian, and Bernoulli naive Bayes classifiers, logistic regression, and an aggressive classifier without passive learning. In addition to traditional models, the authors explored deep learning models such as LSTM and CNN-LSTM, which showed higher accuracy and more stable classification than traditional methods. These results highlight the potential of deep learning models for effectively combating the spread of fake news.

By examining existing solutions in deepfake detection (Table 1), it can be noted that our method, based on the use of a neural network ensemble with golden frame selection, demonstrates significant advantages compared to other approaches. In particular, a similar method developed by Guan et al. [11], which also employs complex model ensembles to improve accuracy, is worth mentioning. Both methods show high effectiveness under various conditions; however, our approach stands out due to the better optimization of computational resources and adaptability, thanks to targeted frame selection, only focusing on the most informative parts of the video.

Despite the significant progress in developing deepfake detection methods, current approaches still have certain limitations.

First, most deep learning models, particularly CNN architectures, demonstrate high accuracy on specific datasets but significantly lose effectiveness when the data source or video quality changes.
Second, some approaches, such as methods based on spectral analysis or manual feature extraction, are sensitive to lighting changes and variations in video streams.
Third, most methods process videos frame by frame, which results in significant computational costs and limits their practical applicability in real-time scenarios.

These issues justify the need to develop more efficient methods that can maintain high accuracy with lower resource consumption, which is the primary goal of this study.

Therefore, the novelty of our method lies in its ability to effectively utilize golden frames to train models, which significantly enhances processing speed and accuracy compared to traditional methods that analyze each video frame. This approach not only improves efficiency but also provides better generalization when detecting deepfakes across various scenarios, making it ideally suited for applications in media and social networks.

3. Method Description

To improve deepfake recognition accuracy, a neural network ensemble method (Figure 1) is proposed that is based on the extraction of “golden frames”—the most informative frames that reflect significant changes in the scene. This approach allows for a substantial reduction in data volume by focusing on key moments, ensuring more accurate deep learning model training, and enhancing the effectiveness of the final prediction. Figure 1 illustrates the process of extracting golden frames from video and subsequent ensemble training, which enables higher classification accuracy for deepfake content. The following steps outline the proposed method:

Step 1: Data collection. The initial video processing stage involves collecting and organizing data from relevant sources. Video data are stored as files, each containing metadata with class information (e.g., “FAKE” or “REAL”). These metadata are used to label the class tags during the preparation of training and validation datasets. M represents the set of metadata containing information about each video file and its class, while V refers to the directory containing video files.

Step 2: The function for the selection of golden frames from video [19]. This function identifies and selects the most informative frames from the video. This is determined based on significant changes in the scene. The frame selection process is carried out by comparing each frame with the previous one, using the absolute difference, converted into grayscale, to compute the overall intensity of the scene change. Mathematically, this can be represented as calculating the average sum of the intensity difference between two consecutive frames; if it exceeds a given threshold, the frame is added to the set of golden frames. This process continues until the specified number of frames is selected or until the end of the video, ensuring that the most significant scenes are chosen. V represent the video consisting of N frames. The function selects a set of frames F, which significantly differ from each other based on the given threshold of scene change intensity

θ

.

While the grayscale intensity change between frames may be influenced by factors such as object or camera motion, our goal in using this metric is to identify visually dynamic frames likely to contain facial deformations or artifacts. These transitions often coincide with the manifestation of deepfake inconsistencies, thereby increasing the chances of capturing critical evidence in the selected golden frames.

2.1. Initialization. Initialize the set F with the first frame

f_{0}

, which we read from V and add to F. Accordingly, we denote

f_{0}

as

f p r e v

.

2.2. Iterative process. We iterate through each

i - t h

frame

V

, where i changes from

Δ i

into

N

with a step of

Δ i = \max (1, ⌊\frac{N}{n u m_{f r a m e s}}⌋),

i - t h

frame

f_{i}

from

V

.

Compute the absolute difference D between

f p r e v

and

f_{i}

:

D_{i} = G r a y (|f_{i} - f_{i - 1}|)

(1)

where

f_{i}

is the current frame,

f_{i - 1}

is the previous frame, and Gray denotes the grayscale transformation function. The implementation details (such as the use of OpenCV) are described in the main text and/or code comments.

Calculate the difference metric

S

as the average pixel value in

D

:

S = \frac{\sum D}{m \times n}

(2)

where

m \times n

is the frame size after scaling.

If

S > θ

, add

f_{i}

to

F

and update

f_{p r e v} = f_{i}

.

2.3. Termination condition. The iteration process ends when the number of frames in

F

reaches

n u m_{f r a m e s}

or when all frames in

V

have been checked.

Step 3: Preprocessing using golden frames involves the function of first checking the presence of each video file listed in the metadata and then extracting a specified number of golden frames that exhibit significant scene changes from each video. Mathematically, the process can be described as iterating through the metadata

M

, which contains pairs of filenames and additional data, followed by calling the function

F (f_{i}, k)

to extract

k

golden frames from each video.

Let

M

denote the metadata for video files, where each element

M_{i} = (f_{i}, d_{i})

, with

f_{i}

is being the filename and

d_{i}

as the additional data (such as class labels). The set V denotes the directory where the video files are stored. Let

F (f_{i}, k)

represent the function for selecting golden frames from the video file

f_{i}

with a maximum of k frames.

3.1. Frame and label extraction. For each tuple

(f_{i}, d_{i}) \in M

, we perform the following:

Check whether the file $f_{i}$ exists in directory $V$ .
If the file exists, apply the golden frame extraction function $F (f_{i}, k)$ , where $k$ is the maximum number of frames, to extract the following:

$G_{i} = F (f_{i}, k)$

(3)
Assign the class label $y_{i}$ based on the label field in $d_{i}$ :

$y_{i} = \{\begin{matrix} 1, i f d_{i} [l a b e l] = F A K E \\ 0, o t h e r w i s e \end{matrix}$

(4)
Append all frames in $G_{i}$ to the list $X$ , and append the label $y_{i}$ to list $Y$ for each frame.

3.2. Returning the results. After processing all videos, the collected data are converted into NumPy arrays to prepare for model training:

X = n p . a r r a y (X), Y = n p . a r r a y (Y) .

(5)

Step 4: Loading and preprocessing the training data, as well as splitting them into training and validation sets, can be described using the mathematical form shown below:

4.1. Loading and preprocessing the data. Let M be the metadata (as defined previously) and

V

be the directory containing the video files. We define a preprocessing function that returns the feature set

X

and corresponding labels

Y

:

(X, Y) = P r e p r o c e s s (M, V)

(6)

where

X \in R^{n \times h \times w \times c}

represents the set of processed golden frames, nnn is the number of frames, and hhh, www, and ccc are the frame height, width, and number of channels, respectively.

Y \in {0,1}^{n}

is the binary label vector corresponding to each frame.

4.2. Splitting the data into training and validation sets. We split the data using an 80/20 ratio to train and validate the model. The split is defined as follows:

(X_{t r a i n}, X_{v a l}, Y_{t r a i n}, Y_{v a l}) = S p l i t (X, Y, t e s {t_}_{_} s i z e = 0.2, r a n d o {m_}_{_} s t a t e = 42)

(7)

where 20% of the data are allocated to the validation set, while 80% are used for training. The parameter random_state = 42 ensures the reproducibility of the split.

Step 5: The creation of base models based on ResNet50 [25], EfficientNetB0 [26], Xception [27], InceptionV3 [28], and Facenet [29], which is based on InceptionResNetV2 [30], can be described as follows: for each model, a pre-trained architecture [31,32] is used to perform initial image processing through convolutional layers for feature extraction (Table 2). Each model [33] has a structure that includes a processing sequence consisting of pre-trained convolutional layers, fully connected layers with Swish activation, and dropout layers for regularization, and an output using sigmoid activation for binary class prediction. Thus, the following mathematical formulation can be used:

5.1. Definition of input data. Let

X \in R^{N \times H \times W \times C}

represent the input tensor, where

N

is the number of samples,

H = 224, W = 224

, i

C = 3

are the dimensions of the image (height, width, number of channels).

5.2. Using pre-trained base models. For each model architecture, we define a feature extraction [34] function

F_{M o d e l} : R^{H \times W \times C} \to R^{d}

, which maps a single image to a high-dimensional feature vector. The top classification layers are excluded in all models. The functions are defined as follows:

ResNet50 ₃ LSTM:

F_{R e s N e t} (X) = R e s N e t 50 (X), d = 2048

EfficientNetB0:

F_{E f f N e t} (X) = E f f i c i e n t N e t B 0 (X), d = 1280

Xception:

F_{X c e p t i o n} (X) = X c e p t i o n (X), d = 2048

InceptionV3:

F_{I n c e p t i o n} (X) = I n c e p t i o n V 3 (X), d = 2048

Facenet:

F_{F a c e n e t} (X) = I n c e p t i o n R e s N e t V 2 (X), d = 512

5.3. Flattening the output of the base model. The feature tensor FModel (X) obtained from the base model is converted into a flat vector

Z = F l a t t e n (F_{M o d e l} (X)) \in R^{d}

(8)

where d—is the dimensionality of the flattened feature vector, which depends on the architecture of the base model.

5.4. Fully connected layer with Swish activation. A fully connected layer with 512 neurons and a Swish activation function is applied to the flattened vector:

Z^{'} = S w i s h (W 1 Z + b 1)

(9)

where

W 1 \in R 512

×d is the weight matrix and;

b 1 \in R 512

is the bias vector.

The Swish function is defined as

S w i s h (x) = x \cdot σ (x)

, where

σ (x)

is the sigmoid function.

5.5. Dropout layer for regularization. To prevent overfitting, a dropout layer with a neuron dropout probability of p = 0.5 is applied:

Z^{″} = D r o p o u t (Z^{'}, p)

(10)

5.6. Additional layers for ResNet50 with LSTM. The following operations are only performed for the ResNet50 model with LSTM:

Dimensionality expansion:

Z^{' ″} = E x p a n d D i m s (Z^{″}, a x i s = 1)

(11)

This converts the vector Z′′ into a tensor for compatibility with the LSTM layer.

LSTM layer:

Z L S T M = L S T M (256) (Z^{' ″})

(12)

where the LSTM has 256 memory units.

Distinct properties: the use of LSTM allows the model to account for sequential or temporal dependencies in the input data.

Step 6: Output layer with sigmoid activation

A fully connected output layer with one neuron and sigmoid activation is applied for binary classification.

For models without LSTM, the following holds:

\hat{y} = σ (W_{2} Z^{″} + b_{2})

(13)

For ResNet50 with LSTM, the following holds:

\hat{y} = σ (W_{2} Z_{L S T M} + b_{2})

(14)

W_{2} \in R^{1 \times n}, n = 512

for models without LSTM and

n = 256

for the model with LSTM.

b_{2} \in R

is the bias term.

σ (x) = \frac{1}{1 + e^{- x}}

is the sigmoid function.

Step 6: The training of the base models based on ResNet50, EfficientNetB0, Xception, InceptionV3, and Facenet can be mathematically represented as shown:

6.1. Model compilation. Let

M_{R e s N e t}, M_{E f f i c i e n t N e t}, M_{X c e p t i o n}, M_{I n c e p t i o n V 3} {, M}_{F a c e n e t}

represent the models based on ResNet50, EfficientNetB0, Xception, InceptionV3, and Facenet, respectively. For each of these models, optimization is performed using the Adam algorithm, and the loss function is defined as binary cross-entropy:

L_{b i n a r y} (y, \hat{y}) = - (y l o g (\hat{y}) + (1 - y) \log (1 - \hat{y}))

(15)

where

y

the true label, and

\hat{y}

is the model’s prediction. The objective is to minimize

L_{b i n a r y}

using the Adam optimizer:

θ_{n e w} = θ_{o l d} - η \nabla_{θ} L_{b i n a r y}

(16)

where

θ

represents the model parameters, and

η

is the learning rate.

6.2. Model training. For each model M with training data

(X_{t r a i n}, Y_{t r a i n})

and validation data

(X_{v a l}, Y_{v a l}),

the model training process can be described as the minimization of the loss function

L_{b i n a r y}

over 10 epochs using a batch size of 8. This can be written as follows:

\min_{θ} \frac{1}{N} \sum_{i = 1}^{N} L_{b i n a r y} (y_{i}, M (X_{i}; θ))

(17)

where

N

is the number of samples in the training set,

X_{i} i s

the input data,

y_{i}

is the label for the

i - t h

sample, and

θ

represents the model’s parameters.

The training process is performed for each model, with the parameters updated after each epoch based on the gradient of the loss function. The accuracy is evaluated based on the validation data after every epoch.

Step 7: The formation of meta-features (predictions of base models) is carried out as follows:

7.1. Predictions for the training dataset. For each base model

M_{R e s N e t}, M_{E f f i c i e n t N e t}, M_{X c e p t i o n}

,

M_{I n c e p t i o n V 3} {, M}_{F a c e n e t},

predictions for the training data

X_{t r a i n}

are computed. Let

{\hat{y}}_{R e s N e t} (X_{t r a i n}), {\hat{y}}_{E f f i c i e n t N e t} (X_{t r a i n}), {\hat{y}}_{X c e p t i o n} (X_{t r a i n}), {\hat{y}}_{I n c e p t i o n V 3} (X_{t r a i n}), {\hat{y}}_{F a c e n e t} (X_{t r a i n})

represent the predicted values derived using the base models for the training data:

{\hat{y}}_{R e s N e t} = M_{R e s N e t} (X_{t r a i n})

(18)

{\hat{y}}_{E f f i c i e n t N e t} = M_{E f f i c i e n t N e t} (X_{t r a i n})

(19)

{\hat{y}}_{X c e p t i o n} = M_{X c e p t i o n} (X_{t r a i n})

(20)

{\hat{y}}_{I n c e p t i o n V 3} = M_{I n c e p t i o n V 3} (X_{t r a i n})

(21)

{\hat{y}}_{F a c e n e t} = M_{F a c e n e t} (X_{t r a i n})

(22)

7.2. Predictions for the validation dataset. Similarly, predictions are obtained for the validation data

X_{v a l}

:

{\hat{y}}_{R e s N e t} (X_{v a l}), {\hat{y}}_{E f f i c i e n t N e t} (X_{v a l}), {\hat{y}}_{X c e p t i o n} (X_{v a l}), {\hat{y}}_{I n c e p t i o n V 3} (X_{v a l}), {\hat{y}}_{F a c e n e t} (X_{v a l})

(23)

These are the predicted values derived from the base models for the validation data.

7.3. The production of input data for the meta-model. The input data for the meta-model (stacking) are produced by horizontally concatenating the predictions of the base models for the training and validation datasets:

X_{m e t a}^{t r a i n} = h s t a c k ({\hat{y}}_{R e s N e t} (X_{t r a i n}), {\hat{y}}_{E f f i c i e n t N e t} (X_{t r a i n}), {\hat{y}}_{X c e p t i o n} (X_{t r a i n}), {\hat{y}}_{I n c e p t i o n V 3} (X_{t r a i n}), {\hat{y}}_{F a c e n e t} (X_{t r a i n}))

(24)

X_{m e t a}^{v a l} = h s t a c k ({\hat{y}}_{R e s N e t} (X_{v a l}), {\hat{y}}_{E f f i c i e n t N e t} (X_{v a l}), {\hat{y}}_{X c e p t i o n} (X_{v a l}), {\hat{y}}_{I n c e p t i o n V 3} (X_{v a l}), {\hat{y}}_{F a c e n e t} (X_{v a l}))

(25)

The new feature matrices

X_{m e t a}^{t r a i n}

and

X_{m e t a}^{v a l}

are used as input data for the meta-model, which is trained on the predictions of the base models for final predictions.

Step 8: Training the meta-model involves training it based on the predictions of the base models, using stacked features formed from the predictions for each sample. Mathematically, this process is described below.

8.1. Creating the meta-model. Let

M_{m e t a}

be the meta-model that is trained on the features derived from the predictions of the base models. These features are formed by stacking the predictions

{\hat{y}}_{R e s N e t}

,

{\hat{y}}_{E f f i c i e n t N e t}

,

{\hat{y}}_{X c e p t i o n}, {\hat{y}}_{I n c e p t i o n V 3}, {\hat{y}}_{F a c e n e t}

for each sample:

X_{m e t a} = h s t a c k ({\hat{y}}_{R e s N e t}, {\hat{y}}_{E f f i c i e n t N e t}, {\hat{y}}_{X c e p t i o n}, {\hat{y}}_{I n c e p t i o n V 3}, {\hat{y}}_{F a c e n e t})

(26)

8.2. Training the meta-model. The meta-model is trained on the training meta-features

X_{m e t a}^{t r a i n}

with the corresponding labels

y_{t r a i n}

minimizing the loss function

L (y, \hat{y})

, where

\hat{y}

represents the predictions of the meta-model:

\min_{θ} \frac{1}{N} \sum_{i = 1}^{N} L (y_{i}, M_{m e t a} (X_{m e t a}^{t r a i n}; θ_{m e t a}))

(27)

8.3. Predictions for the validation set. The meta-model makes predictions on the validation set of features

X_{m e t a}^{v a l}

, yielding

{\hat{y}}_{m e t a} (X_{m e t a}^{v a l})

.

8.4. Accuracy evaluation of the meta-model. The accuracy of the meta-model is evaluated by comparing

{\hat{y}}_{m e t a}

with the validation labels

y_{v a l}

:

A c c u r a c y = \frac{1}{N_{v a l}} \sum_{i = 1}^{N_{v a l}} I (y_{i} = {\hat{y}}_{i})

(28)

8.5. Confusion matrix formation and visualization. The confusion matrix is defined as follows:

C = [\begin{matrix} T P & F P \\ F N & T N \end{matrix}]

(29)

where

T P, F P, F N, T N

represent the counts of true positives, false positives, false negatives, and true negatives, respectively. This matrix is visualized using a heatmap.

4. Implementation

4.1. Case Study

The implementation of the proposed method, the dataset from the Deepfake Detection Challenge on the Kaggle platform (Kaggle, 2020) [35] was used. This is currently the largest dataset for deepfake detection. It contains over 100,000 video fragments created with the participation of 3426 actors using various deepfake generation methods, including GAN-based techniques. All the videos were generated with the full consent of the participants, making this dataset unique in its kind.

The training dataset consists of 400 videos, of which approximately 209 original videos were used to create these 400 videos. Among these, 323 videos are labeled as “FAKE” and 77 as “REAL”, corresponding to about 80% of videos being fake (Figure 2). This confirms that the dataset contains a large number of modified videos compared to the real ones. Several original videos were used multiple times to generate fake videos, with some originals generating up to six fake videos.

Next, the golden frame selection method is used to reduce the volume of processed video data and highlight the most informative moments in the videos. The method is based on analyzing changes between consecutive frames: frames are extracted from each video at fixed intervals, and then the average brightness difference between the current and previous frames is calculated using the absolute pixel difference in grayscale. A frame is considered significantly different if the value of this difference exceeds a set threshold (30.0), indicating a significant change in the scene. For example (Figure 3), frames from fake videos, such as eekozbeafq.mp4 or dcuiiorugd.mp4, may show more artifacts or lighting changes compared to real videos, like avmjormvsx.mp4.

Data preprocessing includes resizing the selected frames to a standard size of 224 × 224 pixels to meet the input layer requirements of deep learning models, such as ResNet, EfficientNet, and Xception. After that, pixel values are normalized to the range of [0,1], which helps with the stability and efficiency of model training. The frames and their corresponding labels (0 for real videos, 1 for fake ones) are organized into 4D tensors with the shape (num_videos, num_frames, 224, 224, 3) for input data X, and (num_videos) for labels y, ensuring the correct representation of video data for model training. After data preparation, data are split into training and validation sets in an 80% training/20% validation ratio.

A model architecture was developed to analyze video frame sequences, combining pre-trained models ResNet50, EfficientNetB0, and Xception with LSTM to process temporal dependencies. The input consists of 10 frames, each of 224 × 224 pixels, allowing the models to capture both spatial and temporal features. Each frame is processed separately using the TimeDistributed layer, followed by the use of an LSTM with 256 neurons to model dependencies between frames. Training was performed for up to 15 epochs, with early stopping to prevent overfitting, using a batch size of 4 and the binary_crossentropy loss function to classify videos as fake or real. This approach helps the model better understand spatiotemporal patterns to detect manipulations.

The accuracy graph (Figure 4) shows the performance of different models during training and validation. ResNet50 achieved a highest validation accuracy of 68.1%, showing good generalization. FaceNet reached 94.7% training accuracy but only 61.1% validation accuracy, indicating some overfitting. EfficientNet demonstrated stable training progress, finishing with 59.7% validation accuracy. Xception and Inception followed a similar trend, with Xception reaching 56.9% validation accuracy.

The loss graph (Figure 4) shows that the training loss decreased for all models, but validation loss varied. Some models, like FaceNet and Inception, displayed a more significant gap between training and validation loss, suggesting overfitting. However, early stopping was applied to reduce this effect. EfficientNet showed the most stable loss trend, while ResNet50 balanced training and validation loss best, indicating good generalization. Future improvements could include additional regularization techniques or ensemble methods to further enhance performance.

4.2. Feature Impact

To analyze essential features that affect the classification of deepfake videos, the Grad-CAM method was used, allowing the visualization of the activation of neural networks (Figure 5). The resulting heatmaps showed that different neural network architectures focus on different areas of the face (Table 3). ResNet50 predominantly focuses on the textural features of the skin, particularly the forehead and cheeks, indicating its sensitivity to global distortions. EfficientNetB0 places more emphasis on the contours of the face, especially the lines around the mouth and eyes, making it effective at detecting structural anomalies. The Xception model focuses on the regions around the eyes and lips, which are key areas for deepfake identification, as they often contain unnatural artifacts related to movement and texture. The analysis of the obtained activations suggests that combining different architectures can enhance the effectiveness of deepfake detection by utilizing various approaches to processing visual information.

The obtained results confirm that different neural network architectures employ distinct approaches to identifying features characteristic of deepfakes. The use of Grad-CAM not only facilitates the interpretation of model decisions but also helps to identify key areas influencing classification, such as texture features, facial contours, and motion artifacts. This opens up possibilities for further improving deepfake detection methods by combining models with different sensitivity zones to achieve greater accuracy and reliability.

4.3. Accuracy Evaluation

The analysis of feature importance in the meta-model (Figure 6) demonstrates that the most significant factor in deepfake classification is the average prediction value of the Xception model, which has the highest weight (5.1123). The second most important feature is the median value of the ResNet output predictions (4.8745), confirming the stability and relevance of this architecture in the ensemble approach. Slightly lower but still substantial importance is observed for the mean values of ResNet (1.5623) and the standard deviation of EfficientNet (1.1345), indicating their role in detecting local variations in videos. Features associated with the Inception model (std = 0.2678, median = 0.1452), as well as the mean value of FaceNet (0.0817), had significantly lower weights, suggesting their less significant informativeness in the classification process compared to other factors.

The accuracy evaluation of different meta-models confirms the superiority of XGBoost [36], which achieved a maximum accuracy of 0.91, significantly outperforming Gradient Boosting (0.85) and Random Forest (0.825). This highlights the effectiveness of XGBoost in combining base model predictions, leading to more precise generalization. Interestingly, some features, such as EfficientNet_mean (−0.5352), Xception_std (−0.4789), and ResNet_std (−0.4156), had negative importance values, which may indicate their potential impact in reducing classification efficiency.

The obtained results confirm the effectiveness of ensemble models for deepfake detection and highlight the feasibility of further optimizing the feature weight coefficients in the meta-model.

4.4. Experimental Results

To understand how different models and training settings affect deepfake detection, we tested five models: ResNet50, FaceNet, Xception, Inception, and EfficientNet. We analyzed their accuracy and loss under different conditions:

With golden frames: using selected keyframes from videos that provide the most important visual features.
Without golden frames: training without selected keyframes.
Without batch optimization: turning off batch-level optimizations during training.
Without dropout: removing dropout layers to see how it affects regularization.
Without Mish activation: replacing Mish activation functions with ReLU to check their impact.

The results for each setting are summarized in Table 4.

The results (Table 4) show that using golden frames improved accuracy for all models, with EfficientNet performing best (66.67%). Removing golden frames reduced accuracy, especially for ResNet50 (59.72%) and Xception (62.50%). Turning off batch optimization lowered performance, while dropout and Mish activation helped maintain stable accuracy. We also tested an XGBoost meta-model using features from ResNet50, EfficientNet, and Xception. With golden frames, XGBoost reached 91.14% accuracy, but accuracy dropped slightly to 90.54% without them. Removing batch optimization and dropout also impacted performance, with accuracy decreasing to 90.27% and 88.54%, respectively. Switching Mish to ReLU led to the lowest accuracy for XGBoost at 87.41%. These results highlight the importance of selecting the proper preprocessing steps and model settings for deepfake detection. Choosing the best frames and using batch optimization, dropout, and Mish activation can significantly improve accuracy and stability.

4.5. Module Description and Operation

The TruScanAI module [37] is designed to automatically analyze video files to detect deepfakes. The user interface allows for video uploads, after which the system processes the content using the proposed ensemble neural network approach, incorporating the golden frame selection method. The module automatically extracts key frames during the analysis and examines textural, contour, and kinematic facial features. Upon completion, the system provides the user with a classification result, labeling the video as REAL or FAKE, supplemented with a visualization of the analysis and key indicators that influenced the model’s decision.

This approach ensures an effective and accessible way to verify the authenticity of video content, catering to both regular users and experts in the field of digital security.

5. Discussion

Analysis of the results: The results confirm the effectiveness of the proposed ensemble approach for DeepFake classification, particularly due to integrating predictions from ResNet50, EfficientNetB0, and Xception into the XGBoost meta-model. The best accuracy among base models was achieved by EfficientNet (66.67%) when using golden frames, while XGBoost with golden frames reached an accuracy of 91.14%, confirming the effectiveness of this technique. In contrast, removing golden frames led to a drop in accuracy for all models, particularly for Xception (48.61%) and ResNet50 (48.61%). Batch optimization and dropout proved essential for stable performance—disabling them caused a noticeable decline in accuracy across all models.

Interpretation of neural network activations: The analysis of model activations using Grad-CAM confirmed that different neural network architectures have specific attention areas when classifying deepfakes. ResNet50 primarily focuses on the textural features of the skin, particularly the forehead and cheeks, which explains its effectiveness in detecting global artifacts. EfficientNetB0 concentrates on the contours of the mouth and eyes, providing more accurate detection of structural anomalies. Xception pays the most attention to the regions of the lips and eyes, which are key areas for detecting anomalies in movement and texture. Combining these models in an ensemble approach helps compensate for their shortcomings and improves deepfake detection accuracy.

Limitations and scalability challenges: Despite the high accuracy rates, the proposed method has certain limitations. One of the challenges is the models’ sensitivity to the quality of input videos: with low resolution or significant compression of video files, classification accuracy may decrease due to the loss of key textural features. Additionally, while the ensemble model demonstrates high overall accuracy, false positives and false negatives remain an issue, particularly in videos that use the latest deepfake generation techniques with enhanced realism. Furthermore, the computational complexity of the ensemble approach may pose a problem when integrating it into real-world applications that require real-time video processing.

Given the parallels between cybersecurity challenges in critical infrastructure and deepfake detection, the GA-and-IMECA-based cybersecurity assessment techniques for FPGA-based I&C systems presented by [38] provide a structured framework that could inform the future risk assessment of AI-based media authentication systems. Applying such rigorous security lifecycle approaches helps align deepfake detection mechanisms with best practices from high-assurance domains.

Opportunities for future research: Future research could focus on improving the model’s resilience to new generative methods, such as videos created using StyleGAN3 or Diffusion Models. A promising direction is the integration of additional features, particularly audio analysis, which would allow for the detection of inconsistencies between the video and audio tracks. Another essential area of development is optimizing computational costs, which could be achieved through the use of transfer learning or model compression without significant loss of accuracy. Additionally, extended testing on independent datasets from real-world sources would help assess the model’s generalizability in practical scenarios, such as fact-checking and digital forensic analysis.

Moreover, extending the current system to support multimodal detection—combining visual and auditory cues—could improve accuracy and robustness, particularly against deepfakes with well-synchronized but artificially generated audio tracks.

Ethical considerations. As deepfake detection systems become more widespread, the ethical balance between detection and censorship must be considered. While the technology aims to protect individuals and institutions from misinformation and fraud, there is a risk of misuse—e.g., automated filtering of legitimate satirical content or overclocking of dissenting voices in political contexts. Transparency, explainability, and human-in-the-loop validation should be prioritized in deploying such systems to mitigate potential misuse and maintain trust.

6. Conclusions

The proposed neural network ensemble method for deepfake classification using the golden frame selection technique demonstrated high efficiency in detecting fake videos. The combination of base models ResNet50, EfficientNetB0, and Xception in an ensemble with the meta-model XGBoost achieved the highest accuracy among the tested methods, reaching 91% accuracy on the validation set, surpassing the results of Random Forest (82.5%) and Gradient Boosting (85%). Feature importance analysis showed that the most significant contributions to classification came from the average predictions of Xception (5.1123) and ResNet (4.8745), while other features had lower significance. Additional activation analysis using Grad-CAM confirmed that different neural network architectures focus on various aspects of the face, making their combination effective at detecting anomalies typical of deepfakes. The golden frame selection method significantly reduced computational costs, which is a critical factor for real-world applications, while maintaining high levels of accuracy and model generalization. The practical application of the developed system is possible in several key areas. First, it can be integrated into fact-checking and media monitoring platforms to verify the authenticity of video materials used in news and social media. This would enable journalists and independent researchers to quickly assess the credibility of content, preventing the spread of misinformation. Furthermore, the system could be applied in digital forensic analysis, helping to detect fake video evidence in criminal cases. In the future, integration of this approach into automatic video stream monitoring systems could allow for real-time manipulation detection, enhancing information security. Despite the results obtained, the research revealed certain challenges, including the model’s sensitivity to the quality of input videos and the difficulty of processing cutting-edge generative methods. Further research should improve model generalization, integrate audio analysis, and optimize computational resources to ensure effective real-time operation. The results of this work may be valuable for developing digital forensic tools, media fact-checking, and ensuring information security in the face of increasing threats related to deepfake technologies.

Author Contributions

Conceptualization, K.L.-H. and N.M.; methodology, K.L.-H.; software, N.M.; validation, N.M., A.I. and M.T.; formal analysis, K.L.-H.; investigation, K.L.-H.; resources, A.I.; data curation, M.T.; writing—original draft preparation, K.L.-H.; writing—review and editing, O.I.; visualization, A.I.; supervision, K.L.-H.; project administration, K.L.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is publicly available from the Deepfake Detection Challenge on Kaggle (2020) at: https://www.kaggle.com/c/deepfake-detection-challenge/data (accessed on 24 February 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Teneo. Deepfakes in 2024 Are Suddenly Deeply Real: An Executive Briefing on the Threat and Trends. 2024. Available online: https://www.teneo.com/insights/articles/deepfakes-in-2024-are-suddenly-deeply-real-an-executive-briefing-on-the-threat-and-trends/ (accessed on 24 February 2025).
Centre for Strategic & International Studies (CSIS). The Future of Hybrid Warfare. 2024. Available online: https://www.csis.org/analysis/future-hybrid-warfare (accessed on 24 February 2025).
Brookings. Deepfakes and International Conflict. 2023. Available online: https://www.brookings.edu/wp-content/uploads/2023/01/FP_20230105_deepfakes_international_conflict.pdf (accessed on 24 February 2025).
Geneva Centre for Security Policy (GCSP). The War in Ukraine: Reality Check for Emerging Technologies and the Future of Warfare. 2024. Available online: https://www.gcsp.ch/publications/war-ukraine-reality-check-emerging-technologies-and-future-warfare (accessed on 24 February 2025).
RAND Corporation. Ukraine’s Lessons for the Future of Hybrid Warfare. 2024. Available online: https://www.rand.org/pubs/commentary/2022/11/ukraines-lessons-for-the-future-of-hybrid-warfare.html (accessed on 24 February 2025).
Lipianina-Honcharenko, K.; Maika, N.; Sachenko, S.; Kopania, L.; Soia, M. A Cyclical Approach to Legal Document Analysis: Leveraging AI for Strategic Policy Evaluation. CEUR-WS 2024, 3736, 201–211. [Google Scholar]
Babeshko, I.; Illiashenko, O.; Kharchenko, V.; Leontiev, K. Towards Trustworthy Safety Assessment by Providing Expert and Tool-Based XMECA Techniques. Mathematics 2022, 10, 2297. [Google Scholar] [CrossRef]
Tran, V.-N.; Lee, S.-H.; Le, H.-S.; Kwon, K.-R. High Performance DeepFake Video Detection on CNN-Based with Attention Target-Specific Regions and Manual Distillation Extraction. Appl. Sci. 2021, 11, 7678. [Google Scholar] [CrossRef]
Zhang, J.; Cheng, K.; Sovernigo, G.; Lin, X. A Heterogeneous Feature Ensemble Learning Based Deepfake Detection Method. In Proceedings of the ICC 2022—IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 2084–2089. [Google Scholar] [CrossRef]
Nadimpalli, A.V.; Rattani, A. Facial Forgery-Based Deepfake Detection Using Fine-Grained Features. In Proceedings of the 2023 International Conference on Machine Learning and Applications, Jacksonville, FL, USA, 15–17 December 2023; pp. 2174–2181. [Google Scholar] [CrossRef]
Guan, L.; Liu, F.; Zhang, R.; Liu, J.; Tang, Y. MCW: A Generalizable Deepfake Detection Method for Few-Shot Learning. Sensors 2023, 23, 8763. [Google Scholar] [CrossRef] [PubMed]
Khan, R.; Sohail, M.; Usman, I.; Sandhu, M.; Raza, M.; Yaqub, M.A.; Liotta, A. Comparative study of deep learning techniques for DeepFake video detection. ICT Express 2024, 10, 1226–1239. [Google Scholar] [CrossRef]
Chakraborty, R.; Naskar, R. Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysis. Comput. Sci. Rev. 2024, 54, 100677. [Google Scholar] [CrossRef]
Abbas, F.; Taeihagh, A. Unmasking deepfakes: A systematic review of deepfake detection and generation techniques using artificial intelligence. Expert Syst. Appl. 2024, 252, 124260. [Google Scholar] [CrossRef]
Casu, M.; Guarnera, L.; Caponnetto, P.; Battiato, S. GenAI mirage: The impostor bias and the deepfake detection challenge in the era of artificial illusions. Forensic Sci. Int. Digit. Investig. 2024, 50, 301795. [Google Scholar] [CrossRef]
Firc, A.; Malinka, K.; Hanáček, P. Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors. Heliyon 2023, 9, e15090. [Google Scholar] [CrossRef] [PubMed]
Lee, E.-G.; Lee, I.; Yoo, S.-B. ClueCatcher: Catching Domain-Wise Independent Clues for Deepfake Detection. Mathematics 2023, 11, 3952. [Google Scholar] [CrossRef]
Naitali, A.; Ridouani, M.; Salahdine, F.; Kaabouch, N. Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions. Computers 2023, 12, 216. [Google Scholar] [CrossRef]
Dincer, S.; Ulutas, G.; Ustubioglu, B.; Tahaoglu, G.; Sklavos, N. Golden ratio based deep fake video detection system with fusion of capsule networks. Comput. Electr. Eng. 2024, 117, 109234. [Google Scholar] [CrossRef]
Sumanth, S.; Durga, T.C.; Sai, C.Y.; Manne, S. Temporal Convulutional Network & Content-Based Frame Sampling Fusion for Semantically Enriched Video Summarization. Research Square. 2023. Available online: https://www.researchsquare.com/article/rs-3010938/latest (accessed on 24 February 2025).
Gong, B.; Chao, W.-L.; Grauman, K. Diverse Sequential Subset Selection for Supervised Video Summarization. NeurIPS. 2014. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/5d3b9e06117de70a7e5076cc3ed89e18-Paper.pdf (accessed on 24 February 2025).
Leszczuk, M.I.; Duplaga, M. Algorithm for video summarization of bronchoscopy procedures. BioMed. Eng. OnLine 2011, 10, 110. [Google Scholar] [CrossRef] [PubMed]
Kim, H.H.; Kim, Y.H. Toward a conceptual framework of key-frame extraction and storyboard display for video summarization. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 1130–1142. [Google Scholar] [CrossRef]
Alarfaj, F.K.; Khan, J.A. Deep Dive into Fake News Detection: Feature-Centric Classification with Ensemble and Deep Learning Methods. Algorithms 2023, 16, 507. [Google Scholar] [CrossRef]
Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 63–72. [Google Scholar]
Kansal, K.; Chandra, T.B.; Singh, A. ResNet-50 vs. EfficientNet-B0: Multi-Centric Classification of Various Lung Abnormalities Using Deep Learning “Session id: ICMLDsE. 004”. Procedia Comput. Sci. 2024, 235, 70–80. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Xia, X.; Xu, C.; Nan, B. Inception-v3 for flower classification. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp. 783–787. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Peng, C.; Liu, Y.; Yuan, X.; Chen, Q. Research of image recognition method based on enhanced inception-ResNet-V2. Multimed. Tools Appl. 2022, 81, 34345–34365. [Google Scholar] [CrossRef]
Komar, M.; Dorosh, V.; Hladiy, G.; Sachenko, A. Deep neural network for detection of cyber attacks. In Proceedings of the 2018 IEEE 1st International Conference on System Analysis and Intelligent Computing, SAIC 2018—Proceedings, Kyiv, Ukraine, 8–12 October 2018; p. 8516753. [Google Scholar]
Lipianina-Honcharenko, K.; Yarych, V.; Ivasechko, A.; Filinyuk, A.; Yurkiv, K.; Lebid, T.; Soia, M. Evaluating the Effectiveness of Attention-Gated-CNN-BGRU Models for Historical Manuscript Recognition in Ukraine. In Proceedings of the First International Workshop of Young Scientists on Artificial Intelligence for Sustainable Development, Ternopil, Ukraine, 10–11 May 2024; pp. 99–108. [Google Scholar]
Lipianina-Honcharenko, K.; Telka, M.; Melnyk, N. Comparison of ResNet, EfficientNet, and Xception architectures for deepfake detection. In Proceedings of the 1st International Workshop on Advanced Applied Information Technologies CEUR-WS, Khmelnytskyi, Ukraine, Zilina, Slovakia, 5 December 2024; pp. 26–34. Available online: https://ceur-ws.org/Vol-3899/paper3.pdf (accessed on 24 February 2025).
Ni, W.; Wang, T.; Wu, Y.; Liu, X.; Li, Z.; Yang, R.; Zhang, K.; Yang, J.; Zeng, M.; Hu, N.; et al. Multi-task deep learning model for quantitative volatile organic compounds analysis by feature fusion of electronic nose sensing. Sens. Actuators B Chem. 2024, 417, 136206. [Google Scholar] [CrossRef]
Kaggle. Deepfake Detection Challenge. 2020. Available online: https://www.kaggle.com/competitions/deepfake-detection-challenge/data (accessed on 24 February 2025).
Ni, W.; Wang, T.; Wu, Y.; Chen, X.; Cai, W.; Zeng, M.; Yang, J.; Hu, N.; Yang, Z. Classification and concentration predictions of volatile organic compounds using an electronic nose based on XGBoost-random forest algorithms. IEEE Sens. J. 2023, 24, 671–678. [Google Scholar] [CrossRef]
TruScanAI. (n.d.). Available online: https://sci-proj.wunu.edu.ua/truscanai/ (accessed on 24 February 2025).
Illiashenko, O.; Kharchenko, V.; Kovalenko, A. Cyber security lifecycle and assessment technique for FPGA-based I&C systems. In Proceedings of the East-West DesignTest Symposium (EWDTS 2013), Rostov-on-Don, Russia, 27–30 September 2013; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. The structure of the neural network ensemble method for deepfake classification using golden frame selection.

Figure 2. The distribution of labels in the training dataset.

Figure 3. The golden frames selected from fake and real videos.

Figure 4. The change in accuracy and loss with the training and validation sets for different models.

Figure 5. Visualization of impact of features on deepfake classification using Grad-CAM.

Figure 6. Feature importance in the meta-model for deepfake classification.

Table 1. Overview of closest analogs.

Author(s)	Year	Method Description	Method Features	Method Novelty
Tran et al. [8]	2021	The use of CNN with a focus on specific target regions.	High performance achieved through manual feature distillation.	Model size optimization for improved performance.
Zhang et al. [9]	2022	Heterogeneous feature ensemble for detection.	Integration of various features (gray gradient, spectral features, and texture features) to improve accuracy.	Improving accuracy through feature ensemble.
Guan et al. [11]	2023	Multifunctional weighted model based on meta-learning.	Utilizing RGB and frequency domains to enhance generalization.	High generalization and adaptation to diverse data.
Lee et al. [17]	2023	“ClueCatcher” with domain-independent feature selection.	Focus on facial color mismatches, synthesis boundaries artifacts, and quality differences between faces and non-facial regions.	Effectiveness in real-world scenarios.

Table 2. Comparison of features extracted using ResNet50, EfficientNetB0, Xception, InceptionV3, and Facenet.

Feature Type	ResNet50	EfficientNetB0	Xception	InceptionV3	Facenet
Edges and contours of objects	Object edge detection.	Edge detection with computational resource optimization.	Contour detection with deep detailing.	Edge detection through multichannel operations for general objects.	Precise edge detection of faces for feature recognition.
Textural patterns	Surface texture detection.	Textural details at different scales.	Complex texture detection.	Comprehensive texture analysis, adapted for general objects.	Texture extraction tailored for facial features.
Shapes and sizes	Recognition of basic shapes and sizes.	Efficient shape recognition at multiple levels.	Detection of complex shapes and geometric features.	Generalized shape detection for diverse objects.	Detailed face shape recognition for accurate identification.
Lighting and shadows	Lighting variation analysis.	Balancing local and global lighting variations.	High sensitivity to shadows and lighting.	Robustness to lighting changes through multiscaling.	Analysis of subtle lighting variations specific to faces.
Conceptual features	Face, vehicle, and other object recognition.	Object recognition based on simplified features.	High efficiency in recognizing complex objects.	Generalized object recognition across different categories.	Optimized for face recognition and identity identification.
Depth-related features	Complex patterns (shapes, textures, contours).	Complex patterns at different scales.	Structural and textural features.	Generalized features at deep levels, effective for a wide range of objects.	Deep facial traits for differentiating even similar faces.
Spatial features	Geometric details (straight lines, angles)	Local details and global structure.	Complex interactions between elements.	Geometric patterns are adapted for different object scales.	Spatial features for precise facial detail detection.
Color features	-	Color channel correlations.	Separate extraction of spatial and channel features.	Sensitivity to color channels for general objects.	Consideration of subtle color nuances is important for faces.
Contextual features	Interaction of objects in the scene.	Local and global relationships between objects.	Interactions between objects in the scene.	Detection of relationships between objects in complex scenes.	Determining the context of faces in the scene to improve accuracy.
Features across different levels	-	Balance between details and general patterns.	Local details and general characteristics	Different feature levels for different object categories.	Local and global features for face identification.
Detailing features	General characteristics of objects.	High detailing of small objects.	Precise extraction of details and textures.	Different levels of detailing for a wide range of objects.	Super-detailing of facial features for accurate recognition.
Local structure features	Focus on general shapes and textures.	Balance between local and global features.	Focus on local details.	Local and global features for accurate scene analysis.	Focus on local facial features for reliable identification.

Table 3. Analysis of Grad-CAM activations in the deepfake classification process.

Model	Main Activation Zones	Main Type of Anomalies
ResNet50	Forehead, cheeks	Textural distortions, uneven lighting
EfficientNetB0	Contours of the mouth, eyes	Facial contour deformations, unnatural light transitions
Xception	Eye area, lips	Motion anomalies, unnatural texture details
InceptionV3	Central part of the face	Color instability, blurring of details
FaceNet	Entire face structure	Global distortions of shape and proportions
ResNet50	Forehead, cheeks	Texture distortions, uneven lighting

Table 4. Model performance under different configurations.

Model (Accuracy/Loss)	With Golden Frames	Without Golden Frames	Without Batch Optimization	Without Dropout	Without Mish Activation
ResNet50	0.597/0.673	0.486/0.701	0.431/0.700	0.528/0.686	0.528/0.692
EfficientNetB0	0.486/0.700	0.597/0.679	0.569/0.662	0.528/0.684	0.514/0.725
Xception	0.625/0.654	0.486/0.714	0.528/0.691	0.556/0.689	0.583/0.675
InceptionV3	0.500/0.722	0.486/0.767	0.583/0.704	0.431/0.746	0.514/1.088
Facenet	0.597/0.676	0.569/0.714	0.583/0.678	0.486/0.688	0.528/0.778
Meta-Model XGBoost	0.911/0.224	0.905/0.334	0.903/0.354	0.885/0.386	0.874/0.456

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lipianina-Honcharenko, K.; Melnyk, N.; Ivasechko, A.; Telka, M.; Illiashenko, O. Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection. Big Data Cogn. Comput. 2025, 9, 109. https://doi.org/10.3390/bdcc9040109

AMA Style

Lipianina-Honcharenko K, Melnyk N, Ivasechko A, Telka M, Illiashenko O. Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection. Big Data and Cognitive Computing. 2025; 9(4):109. https://doi.org/10.3390/bdcc9040109

Chicago/Turabian Style

Lipianina-Honcharenko, Khrystyna, Nazar Melnyk, Andriy Ivasechko, Mykola Telka, and Oleg Illiashenko. 2025. "Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection" Big Data and Cognitive Computing 9, no. 4: 109. https://doi.org/10.3390/bdcc9040109

APA Style

Lipianina-Honcharenko, K., Melnyk, N., Ivasechko, A., Telka, M., & Illiashenko, O. (2025). Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection. Big Data and Cognitive Computing, 9(4), 109. https://doi.org/10.3390/bdcc9040109

Article Menu

Neural Network Ensemble Method for Deepfake Classification Using Golden Frame Selection

Abstract

1. Introduction

2. Literature Review

3. Method Description

4. Implementation

4.1. Case Study

4.2. Feature Impact

4.3. Accuracy Evaluation

4.4. Experimental Results

4.5. Module Description and Operation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI