Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow

Akhtar, Naheed; Hussain, Muhammad; Habib, Zulfiqar

doi:10.3390/math12223482

Open AccessArticle

Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow

by

Naheed Akhtar

¹,

Muhammad Hussain

²

and

Zulfiqar Habib

^3,*

¹

Department of Computer Science, University of Education, Lahore 54510, Pakistan

²

Department of Computer Science, King Saud University, Riyadh 11543, Saudi Arabia

³

Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Islamabad 45550, Pakistan

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(22), 3482; https://doi.org/10.3390/math12223482

Submission received: 27 September 2024 / Revised: 24 October 2024 / Accepted: 5 November 2024 / Published: 7 November 2024

(This article belongs to the Special Issue Computer Vision and Pattern Recognition with Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Surveillance cameras provide security and protection through real-time monitoring or through the investigation of recorded videos. The authenticity of surveillance videos cannot be taken for granted, but tampering detection is challenging. Existing techniques face significant limitations, including restricted applicability, poor generalizability, and high computational complexity. This paper presents a robust detection system to meet the challenges of frame duplication (FD) and frame insertion (FI) detection in surveillance videos. The system leverages the alterations in texture patterns and optical flow between consecutive frames and works in two stages; first, suspicious tampered videos are detected using motion residual–based local binary patterns (MR–LBPs) and SVM; second, by eliminating false positives, the precise tampering location is determined using the consistency in the aggregation of optical flow and the variance in MR–LBPs. The system is extensively evaluated on a large COMSATS Structured Video Tampering Evaluation Dataset (CSVTED) comprising challenging videos with varying quality of tampering and complexity levels and cross–validated on benchmark public domain datasets. The system exhibits outstanding performance, achieving 99.5% accuracy in detecting and pinpointing tampered regions. It ensures the generalization and wide applicability of the system while maintaining computational efficiency.

Keywords:

inter–frame tampering; motion residual; local binary pattern; optical flow; frame duplication detection; frame insertion detection

MSC:

68T07

1. Introduction

Nowadays, validating the authenticity and integrity of multimedia content, including audio, video, and graphics on social media, has become a big challenge for investigating agencies, scientists, and researchers. Low–cost and easily accessible tools such as Adobe Premier Pro, Video Edit Magic, Adobe After Effects, and Movie Maker can manipulate surveillance footage for malicious purposes. Sometimes, video editing aims to improve the quality of video content. However, the alterations are sometimes not so innocent, and the video cannot be used as a primary source of evidence in critical scenarios like crime investigation. In such matters where videos captured by CCTV, smartphones, or digital cameras constitute potential evidence, it becomes crucial to ensure the content’s authenticity. This problem requires a robust tampering detection system to determine whether the video has been altered. In addition, it is imperative to identify the precise location of tampering. Video forensic techniques are categorized into active or passive. Passive tampering detection can be subdivided into intra–frame and inter–frame tampering. Both manipulate video content, but differ in their targeted domains. Intra–frame tampering is applied in either spatial or spatiotemporal domains that image forensics algorithms can identify. Common intra–frame tampering methods are copy–move and region splicing [1]. Temporal domain tampering affects the sequence of frames, including frame insertion (FI), frame deletion/elimination (FE), and frame duplication (FD). Among them, frame duplication is common for extending or hiding a specific scene. For example, frame duplication can be performed on a surveillance video to hide an individual’s departure from a building at a specific time. FD is relatively simple to execute, yet its detection remains challenging due to the seamless integration of copied frames at different temporal positions within the same video. In frame insertion, a foreign sequence of frames is introduced to create a false notion or add a fake event. Tampering detection techniques are useful in various domains, including identifying criminal activities, preventing crimes, verifying digital document authenticity, and influencing social impressions, as well as in the investigation and religious faith sectors [2,3]. Incorrect information can be generated to show that the criminal was at a different location than the crime scene [4]. If such a video is a part of a criminal investigation, it can mislead the investigators [5,6]. Frame duplication and insertion are illustrated in Figure 1.

Various methods have been proposed for the detection of inter–frame tampering, including those based on statistical features [7,8,9], pixel and texture features [10,11,12,13], motion residual and optical flow (OF) [14,15], and deep learning [5,16,17,18,19]. It is revealed from a thorough analysis of the literature that OF [15,20,21] and prediction residual (PR) [22,23,24] are the two most commonly utilized features for inter–frame tampering detection. These features are easy to compute and produce good results. Although various solutions for frame duplication and insertion detection have been proposed, they still face four main challenges. First, their applicability is limited by various factors, including the video format, number of tampered frames, and frame rate [25,26]. For example, some techniques exhibit limited capabilities in detecting inter–frame tampering, such as the technique described in Ref. [27], which is unable to identify duplication involving more than 20 frames. Similarly, the method mentioned in Ref. [28] required tampered frames to be in multiples of 10 and at least 25 in number for effective detection. Second, these methods suffer from poor generalizability due to the unavailability of standard datasets for the performance evaluation of tampering detection algorithms [29]. Although researchers have developed datasets for the detection of inter–frame tampering [14,15,23], these resources are not publicly available and are often limited in size, creating barriers to comprehensive experimentation and benchmarking. Mostly, cross–validation to ensure the generalization capabilities of existing methods is not conducted due to the lack of available datasets [11,25]. Third, there is a challenge of tampering localization; few techniques can pinpoint tampering at a specific position in a video. For example, the approach in Ref. [30] identifies frame deletion in the center (between frames 8 and 9) of a single video shot, but deletion is not necessarily performed in the middle of a video. Similarly, the passive approach presented in Ref. [13] is unable to locate frame duplication tampering. Fourth, high computational complexity is a challenge; most of the earlier techniques are computationally intensive due to pixel–based [31,32] or spatial/temporal correlation–based methods [33,34,35]. It is time–consuming to determine whether a video has high resolution and/or a large number of frames.

Additionally, existing image tampering detection systems may not be reliable in identifying tampering because the complexity arises from treating the time domain as the third dimension in a video, which significantly impacts video compression, making it challenging to apply these techniques effectively [36]. Because of these challenges, there is a need for a detection scheme for frame–based tampering that fulfils three basic requirements: broad applicability, strong generalization capability, and high accuracy with good robustness.

In this study, we develop a robust system exploiting the inherent motion inconsistencies arising from frame duplication (FD) and frame insertion (FI) operations at the beginning and end of tampered frames. The system comprises two stages. In the first stage, the contextual and textural information of each frame is utilized by extracting features based on the motion–residual–based local binary pattern (MR–LBP) to detect suspected tampered videos. In the second stage, consistency in the aggregation of optical flow and standard deviation of MR–LBP features of consecutive frames are used to pinpoint the tampered region. The presence of spikes in MR–LBP features indicates the presence of frame insertion or duplication attack in surveillance video, and the peaks in the OF aggregate graph specify the precise location of tampering.

The major contributions of this work are as follows:

We propose an inter–frame video tampering detection technique (stage 1) to identify suspicious tampered videos by tracing out alterations in texture patterns. To detect these alterations, we propose the extraction of texture features by employing the MR–LBP. The proposed method simultaneously detects both frame duplication and insertion tampering, unlike current state–of–the–art techniques [12,15,37,38,39], which are limited to the detection of only one type of tampering. Additionally, the proposed method imposes no constraints on video formats, the type of capturing device, frame rate, or the minimum number of duplicated or inserted frames required to detect tampering; it can detect the duplication and insertion of as few as ten frames. In contrast, the deep learning–based method described in Ref. [28] only detects and locates tampering regions when tampered frames occur in multiples of 10 and is unable to detect tampering involving fewer than 25 frames. It also assumes that tampered frames are only inserted in the static section of the video when frame duplication is performed. In the proposed method, the video has gray–level frames encoded with local binary pattern (LBP) results in a feature vector of dimension 256, making it computationally efficient compared to the deep learning–based method with high dimensional features.
For the localization of video inter–frame tampering (stage 2), we suggest independently employing the OF aggregation and standard deviation of MR–LBP features of consecutive frames; this removes false positives. The inconsistency in OF aggregation pinpoints the exact start and end of the tampered region in the surveillance video.
Due to a lack of benchmark datasets, the method is trained and tested on our developed large dataset, the COMSATS Structured Video Tampering Evaluation Dataset (CSVTED), which comprises challenging videos with different complexity levels and tampering quality and is cross–validated on publicly available datasets. The benchmark public domain datasets contain a variety of videos, including Event–Object–Person (EOP)–based tampering. High detection accuracy using the CSVTED and other public domain datasets strongly validates the method’s generalization (in previous studies, no cross–dataset validation has been carried out). It has a high generalization capability.
The performance of the proposed method is compared with state–of–the–art methods in terms of accuracy. Comparison results show that MR–LBP features make a major contribution to detecting frame duplication and insertion tampering, with accuracies of 99.71% and 99.87%, respectively.
The method is computationally efficient with a processing time of microseconds.

The organization of this paper is as follows: Section 2 provides recent literature on frame duplication and insertion tampering detection. The proposed technique, evaluation protocols, and experimental results are illustrated in Section 3, Section 4 and Section 5. Concluding remarks are presented in Section 6.

2. Literature Review

In the field of digital forensics, video tampering detection is still in its primitive stages. It is suffering from a dearth of the development of robust techniques for detecting and localizing video tampering [40,41]. When frames are inserted or duplicated in a video, it disturbs the consistency of object motion at the start and end of the tampered frames. Texture feature [10,12,42,43,44], optical flow [15,20,21], prediction residual [22,23,24], standard deviation of residual frames [37], bag–of–words (BoW) model [38], correlation [7,10,45], motion residual [46], and noise residue [33]–based features have been used in the literature to detect these inconsistencies in videos. The approach based on the consistency of correlation coefficients of gray values (CCCoGVs), presented by Wang et al. [8], detects inter–frame tampering. The method extract CCCoGVs to differentiate between authentic and tampered videos. Similarly, Huang et al. [47] introduced a framework based on triangular polarity feature classification (TPFC) for the detection of video tampering (insertion and deletion). Kumar and Gaur [48] proposed a statistical approach to detect and localize frame insertion and deletion by exploiting Haralick features to compute correlation coefficients between adjacent frames. The tampering location is determined at a minimum value of correlation. It demonstrates an accuracy of 97% at the frame level and 83% at the video level. This method has some limitations, particularly when fewer than five frames are inserted or removed, as it fails to detect such subtle tampering. This approach is susceptible to false positives. SIFT features are used by Kharat et al. [12] to develop approaches for detecting frame duplication only. This approach is evaluated on a small dataset of 20 videos. Jia et al. [15] propose the coarse to fine approach, where OF sum consistency is analyzed to identify suspected tampered frames. Fine detection is then carried out by OF correlation. This method only handles the single tampering approach of frame duplication. Huang et al. [49] expose video tampering by fusing the audio channel with the video frame sequence channel. Audio features are extracted using discrete wavelet packet decomposition to locate the tampered point of the audio channel. Then, the inter–frame similarity is determined using perceptual hash to locate the tampered points in a frame sequence. By fusing these results, frame insertion and deletion are detected. This method requires audio data along with video, but CCTV does not provide audio recording.

Fayyaz et al. [33] found that the noise residue between consecutive frames remains consistent in the case of authentic video and fluctuates in tampered video. Based on this fact, they estimated the sensor pattern noise (SPN) using locally adaptive DCT to detect and localize tampered regions. A significant drawback of this method is its high computational complexity. The tampering process inherently alters the texture of the video frames, providing clues to detect tampering. The camera and background are static in surveillance videos, which makes tampering easy and undetectable. Therefore, it is essential to develop a robust method to detect temporal tampering in videos. Raskar and Shah [50] developed a system to detect copy–move video tampering by exploiting histograms of the second–order gradient (HSOG). This approach employs contrast–limited adaptive histogram equalization (CLAHE) to identify suspicious frames through correlation coefficient analysis. To detect tampering, HSOG is computed based on distance and similarity thresholds. Motivated by the success of deep learning, Johnston et al. [17] introduced a tampering detection technique that extracts features from authentic content to identify key frames and tampered regions in three publicly available tampered datasets. Convolutional neural networks (CNNs) were employed to estimate the quantization parameters, deblock settings, and intra/inter modes of pixel patches from H.264/AVC sequences. The method was specifically designed to handle a single type of tampering in videos with static backgrounds.

Fadl et al. [28] utilized a pre–trained 2D–CNN model to extract spatiotemporal features. Subsequently, the structural similarity index measure (SSIM) is applied to obtain the deep learning–based features of the entire video. This method struggles to detect tampering when the tampered region comprises less than 25 frames and assumes that frames are only inserted in the static parts of the video in frame duplication. However, it effectively detects tampering when the selected frames for tampering are in multiples of 10. Moreover, its localization is imprecise, and the method lacks cross–dataset validation, limiting its generalization capability. Shelke et al. [45] proposed a passive algorithm for video tampering detection based on the correlation consistency between entropy–coded frames. This algorithm effectively identifies various types of tampering, including frame insertion, splicing, duplication, and deletion. The method was evaluated on a limited dataset comprising 30 tampered videos. Bozkurt et al. [39] proposed an effective approach for detecting frame duplication attacks in both uncompressed and compressed videos. The method relies on the visualization of feature vectors, where a binary image is generated from a feature matrix. A template of the tampered frame group is formed from the binary image, which is then used to search for duplications within the video. The algorithm can detect tampering regardless of whether the forgery occurs at the beginning or the end of the video. This method was evaluated on 13 videos recorded by stationary cameras. Recently, Akhtar et al. [19] introduced an innovative deep learning–based method for detecting various inter–frame tampering approaches such as frame insertion (FI) and frame elimination (FE) in surveillance videos. Their method employs 2D–CNN to extract spatiotemporal features, followed by the dimensionality reduction of features to lower computational complexity. Temporal dependencies among video frames are then analyzed using LSTM/GRU. Despite these sophisticated techniques, the overall detection accuracy achieved is 90.73%. The reason for the low accuracy is that the smaller dataset comprises 2555 videos, and deep learning models typically require large benchmark datasets for optimal performance. The system proposed in [18] operates through a four–stage framework to detect insertion and deletion tampering: video frames are extracted and resized, a fine tuned pre–trained VGG–16 model extracts visual features, the dimensionality of the extracted features is reduced by applying kernel principal component analysis, and then tampering is detected by correlation analysis among the extracted features. This method is unable to detect frame duplication tampering. A summary of all of the reviewed methods on frame duplication and insertion tampering detection, along with their advantages, limitations, and outcomes, is presented in Table 1.

Most deep learning–based methods [5,30,51] are efficient at detecting specific types of tampering under some conditions, but are unable to capture tampering traces left by various types of inter–frame tampering. These data–driven approaches rely on extensive training datasets to automatically learn complex features for tampering detection. Recently, Shehnaz and Kaur [13] employed texture features like histograms of oriented gradients (HoG) and variants of LBP to detect and localize multiple types of tampering, but their method cannot handle frame duplication. Many researchers have conducted experiments on videos that have been synthetically manipulated. Similarly, many temporal tampering detection methods perform effectively on specific video datasets, but struggle to deliver comparable results on unfamiliar video datasets. It is also important to localize the tampered region to gain the trust of the end user. From Table 1, it is evident that many techniques did not precisely localize the tampered frames.

Statistical features are widely used to detect/localize inter-frame tampering, but suffer from computational overhead due to the complex correlation calculations. Motion features such as OF are considered the most suitable for tampering detection, but the performance of detection techniques may be affected by the speed of objects and the background of the video. Less human interaction is needed in machine learning–based techniques, but these methods need huge datasets along with more computational power. The use of only one type of feature is not enough to determine all types of tampering. Furthermore, existing algorithms perform effectively on custom datasets. To assess the performance of the developed methods, it is necessary to evaluate them on unknown datasets that are freely accessible to the public. Unfortunately, this is challenging, because these datasets are not available in the public domain. Tampered video datasets remain underdeveloped compared to tampered image datasets.

Table 1. Summary of temporal tampering detection techniques (precision: P, recall: R, detection accuracy: DA, localization accuracy: LA).

Sr. No.	Author, Year	Method/Features	Tampering Identified	Dataset	Results				Merits/Demerits
Sr. No.	Author, Year	Method/Features	Tampering Identified	Dataset	P (%)	R (%)	DA (%)	Other	Merits/Demerits
1	Ulutas et al. (2017) [52]	Binary feature extraction and PSNR	Duplication, Mirroring	10 videos	99.98 100	99.30 97.34	99.35 98.20	-	Efficient with high accuracy and lower computational complexity Dataset is very small No cross-dataset validation
2	Kingra Aggarwal et al. (2017) [21]	Prediction residual (PR) and optical flow (OF)	Insertion, Deletion, Duplication	Developed tampered dataset personally	-	-	80, 83, 75 92, 83, 88 100, 96, 100	LA: 80%	Technique is effective when more than 30 frames are tampered Performance degradation for videos with high illumination Accuracy of localization is poor No cross-dataset validation
3	Ulutas et al. (2018) [38]	Bag-of-words (BOW) model with SIFT	Duplication (moving and static views)	31 videos	97.94 98.57	97.65 99.13	96.73 98.17	-	Computationally efficient Dataset is small Detect one type of tampering
4	Zhao, Wang et al. (2018) [43]	Similarity between H-S-V histograms, SURF feature extraction along with FLANN matching	Insertion, Deletion, Duplication, and Localization	10 test shots	98.07	100	99.01	-	Capable of detecting inter-frame tampering Does not work for videos with scene changing in shot Poor localization Dataset is very small
5	Huang, Zhang et al. (2018) [49]	Wavelet packet de-composition, quaternion DCT features	Deletion, Insertion	115 videos from OV and SULFA, 124 personally recorded	0.9876	0.9847	-	-	By fusing audio channel, the technique can detect and localize frame deletion and insertion effectively Audio file is needed with videos No cross-dataset validation
6	Jia et al. (2018) [15]	Optical flow (OF) sum consistency and correlation between frames.	Duplication, Computation	VTL: 55 SULFA: 36 DERF: 24 videos time: 1.623 µs/pixel	0.985	0.985	-	-	High detection accuracy with small computation time Detects only one type of tampering No cross-dataset validation
7	Fadl et al. (2018) [53]	Energy difference b/w frames, SNR, and spatiotemporal energy	Duplication, Insertion, Deletion	120 videos from SULFA, 28 from and 3 from IVY Lab	0.97 0.99 0.97	0.99 0.99 0.95	-	F1: 0.98 0.99 0.96	Can detect tampering in videos if they have gone through multiple compression stages Tampering localization has not been made
8	Bakas et al. (2018) [27]	3D-CNN	Insertion, Deletion, Duplication	UCF101: 9000 videos	-	-	97% (average)	-	Precisely detect insertion and deletion tampering Cannot detect tampering if more than 20 frames are duplicated Unable to localize tampered frames
9	Long, Basharat et al. (2019) [5]	I3D along with Resnet152	Duplication	MFC-18, VIRAT: 12, IPhone-4: 17 videos	-	-	-	AUC: 84.05 81.46	The method can distinguish duplicated frame-range from their corresponding original frame-range Poor localization
10	Fadl, Sondos, et al. (2020) [11]	Temporal average (TP), Edge change ratio (ECR), and GLCM	Duplication, Duplication with Shuffling	51 from SULFA, LASIESTA, and IVY lab	0.99 0.95	0.98 0.98	-	-	Method is computationally efficient Localization of tampering has not been made No cross-dataset validation
11	Kharat et al. (2020) [12]	Motion vectors and SIFT feature are used with random sample consensus algorithm to locate tampering	Duplication	20 videos from YouTube Movies	99.9	99.7	99.8	-	Method performs better for compressed and uncompressed videos A small dataset is used to test the performance of the model No cross-dataset validation
12	Fadl et al. (2021) [28]	2D-CNN with multi-class support vector machine (MSVM)	Insertion, Deletion, Duplication	13135 videos from SULFA, VIRAT, LASIESTA, and IVY.	-	-	99.9 98.7 98.5	-	High DA when frames are manipulated in multiples of 10 Localization is not precise
13	Alsakar et al. (2021) [25]	Correlation with arbitrary number of core tensors	Insertion, Deletion	18 videos taken from TRACE library	96 92	94 90	-	F1: 95, 91	Can detect insertion and deletion tampering for static as well as dynamic single shot videos No cross-dataset validation
14	Panchal et al. (2023) [54]	Sets of video quality assessment attributes are selected and multiple linear regression is applied	Deletion	Developed dataset using 80 videos of TDTVD, SULFA, UCF-101, and VTD	-	-	96.25%	-	Effectively detect single and multiple deletion in a video Identify only one type of tampering No cross-dataset validation
15	Shehnaz and Kaur (2024) [13]	HoG with LBP	Duplication, Deletion, Insertion	Developed tampered dataset using VTD and SULFA	99.4	99.2	99.6	F1: 99.5	Detect inter-frame tampering with high accuracy Cannot identify location of frame duplication tampering
16	Akhtar et al. (2024) [19]	2D-CNN with autoencoder and LSTM/GRU	Insertion, Deletion	Developed CSVTED dataset of 2555 videos	98.77 84.59	98.99 94.20	98.98 94.18	F1: 98.87 89.05	Effective in detecting and locating tampered frames Cannot detect tampering in the presence of scene change in clip

3. Proposed Method

This research focuses on developing a technique to identify and localize different types of video inter–frame tampering. Initially, we begin by defining the problem, followed by an overview of the proposed method.

3.1. Problem Formulation

The problem is to detect whether a surveillance video has been tampered with or not. If a video is found to be tampered with, then the region that has been altered by duplicating or inserting frames must be located. Formally, let x be a surveillance video consisting of t frames, each of resolution

r \times c

, i.e.,

x \in R^{r \times c \times t}

. The task is to determine whether

x

has been tampered with by duplicating or inserting frames. If tampering is detected, the next step is to locate the region in

x

where the duplication or insertion occurred. Let

Y = {a u t h e n t i c, t a m p e r e d}

. We formulate the detection problem as a 2–class classification problem and design a classifier

f : R^{r \times c \times t} \to Y

such that

f

predicts the label

y \in Y

for any

x \in R^{r \times c \times t}

i.e.,

f (x; θ) = y

, where

θ

represents the learnable parameters, which model

f

.

Once it is established that

x \in R^{r \times c \times t}

has been tampered with by duplication or insertion, it is necessary to pinpoint the precise location of the tampered frames. Let

T

denote the localization mapping for locating the inter–frame tampering. Tampered video

x

is taken as input and the tampered region is identified, i.e.,

T (x) = I

, where

I

represents the indices of frames locating the duplicated or inserted frames.

3.2. Stage 1: Proposed Method for Detection

We design the mapping

f

for detection as a composition of three mappings for stage 1, as follows:

f (x; θ) = ϕ_{3} \circ ϕ_{2} \circ ϕ_{1} (x),

(1)

where the mapping

ϕ_{1}

preprocesses

x

to yield

z_{1} {\in R}^{r \times c \times t}

,

ϕ_{2}

takes

z_{1}

as input and extract feature matrix

z_{2} {\in R}^{m \times t}

, and

ϕ_{3}

analyzes the features

z_{2}

to predict the label

y

of

x .

A visual representation of the proposed method is depicted in Figure 2; it comprises two stages. Stage 1 designs

f

using three blocks: the first block, which models

ϕ_{1}

, focuses on preprocessing; a video is split into frames, each frame is converted to gray levels, and smoothing is applied. The second block specifies

ϕ_{2},

which extracts features based on the texture of the motion residual component where the MR–LBP of all frames is calculated. The final block models the mapping

ϕ_{3}

, which identifies suspicious frames (suspicious frames are the frames that may have been duplicated or inserted). Further details are presented in the following sections.

3.2.1. Preprocessing

When a video undergoes frame duplication or frame insertion, it gives rise to motion inconsistency at the start and end points of the duplicated/inserted region in a video. The key idea is to determine these inconsistencies by analyzing the difference in consecutive frames. For this purpose, a video is split into frames, i.e.,

x = (f_{1}, f_{2}, \dots, f_{t}) {\in R}^{r \times c \times t}

, and each frame is converted to gray levels, i.e.,

x' = ψ_{1} (x),

(2)

where

ψ_{1}

converts each frame

f_{i}

of

x

to gray levels

x_{'} = (f_{1}', f_{2}', \dots, f_{t}')

. Next, we de-noise each frame. Gaussian filtering is known for its noise–reducing and smoothing capabilities; it also meets the real–time video processing requirements [55,56,57]. It was observed through experiments that a 2D–Gaussian filter of size 3 × 3 does not suppress the noise well, and 7 × 7 causes over–smoothing, so a 2D–Gaussian filter of 5 × 5 is applied for smoothing, i.e.,

\tilde{x} = ψ_{2} (x'),

(3)

where

ψ_{2}

filters each gray–level frame

f_{i}'

of

x_{'}

to generate smooth frames

\tilde{x} = ({\tilde{f}}_{1}, {\tilde{f}}_{2}, \dots, {\tilde{f}}_{t})

.

The operations mentioned above define the mapping

ϕ_{1}

, which takes the video

x

as input and produces

z_{1} {\in R}^{r \times c \times t}

, i.e.,

{z_{1} = [{\tilde{f}}_{1}, {\tilde{f}}_{2}, \dots, {\tilde{f}}_{t}] = ϕ}_{1} (x) = ψ_{3} {\circ ψ}_{2} {\circ ψ}_{1} (x)

(4)

where the mapping

ψ_{1}

divides the input video into t gray–level frames and

ψ_{3}

represents the concatenation operation.

3.2.2. Feature Extraction

Original videos exhibit strong consistency in the temporal domain. Temporally adjacent video frames of an original video have almost similar visual and semantic contents [58]; as such, the difference in adjacent video frames does not exhibit significant changes in texture patterns. To leverage this idea, first, the motion residual (MR) is computed between consecutive frames, i.e.,

P_{i} = P_{1}^{i} (z_{1}) = {\tilde{f}}_{i} (x, y) - {\tilde{f}}_{i - 1}, i = 2, 3, \dots, t,

(5)

where the mapping

P_{1}

extracts MR images

P_{i}

from each pair of consecutive smoothed gray–level images of

z_{1}

; each image is of size

r \times c

. Then, the local binary pattern of MR is computed, and the distribution of LBPs is estimated using a histogram to estimate the changes in texture patterns, as follows:

P_{i}' = P_{2}^{i} (P_{i}),

(6)

where

P_{2}

generates LBP–encoded frames of

P_{i}

and computes their histograms

P_{i}' {\in R}^{m}

, and each histogram is of dimension m. The LBP is simple to compute and has low computational complexity and rotational invariance [59,60,61]. These characteristics of the LBP motivated us to use LBP features to detect frame–based tampering. MR–LBP features show high peaks at the start and end points of the duplicated or inserted frames, leading to anomalies, as shown in Figure 3. The operations mentioned above define the mapping

ϕ_{2}

, i.e.,

{z_{2} = ϕ}_{2} (z_{1}) = P_{2} \circ P_{1} (z_{1}),

(7)

where

ϕ_{2}

takes

z_{1}

as input and produces a feature matrix

z_{2} = [P_{2}^{'}, P_{3}^{'}, \dots, P_{t}'] {\in R}^{m \times t - 1}

of consecutive pairs.

3.2.3. Classification

The MR–LBP features

z_{2} = [P_{2}^{'}, P_{3}^{'}, \dots, P_{t}'] {\in R}^{m \times t - 1}

of the input video

x

computed in the feature extraction phase are then passed one by one to a support vector machine (SVM) with a linear kernel to detect suspected tampered frames. The SVM predicts each MR–LBP feature

P_{i}^{'}

as authentic or tampered. If at least two MR–LBP features are predicted as tampered, then video x is suspected to be tampered with. Formally, it is accomplished by the mapping

χ_{1}

as follows:

[y_{2}, y_{3}, \dots, y_{t}] = χ_{1} ([P_{2}^{'}, P_{3}^{'}, \dots, P_{t}^{'}], θ),

(8)

where

χ_{1}

represents the SVM with linear kernel,

θ

represents the learnable parameters of SVM, and

y_{i} \in \{a u t h e n t i c, t a m p e r e d\} .

The vector of predictions

[y_{2}, y_{3}, \dots, y_{t}]

is passed to the mapping

χ_{2}

, which counts the number of

t a m p e r e d

labels and declares

x

as suspected tampering if the number is at least 2, i.e.,

y = χ_{2} ([y_{2}, y_{3}, \dots, y_{t}]),

(9)

where

y \in \{a u t h e n t i c, t a m p e r e d\} .

Finally, these two mappings define the classifier

ϕ_{3}

as follows:

{F_{1} = ϕ}_{3} (z_{2}) = χ_{2} \circ χ_{1} (z_{2}) .

(10)

For the training of

χ_{1}

, i.e., SVM with linear kernel, we used 450 tampered videos. In the testing phase, the trained model

χ_{1}

takes MR–LBP features of every two adjacent frames

f_{i - 1}

and

f_{i}

and determines whether

f_{i}

is a suspected tampered frame that leads to sudden spikes. In both types of tampering, duplication and insertion, the suspected tampered frames are either start or end points of the duplicated/inserted region; there may be some false positives, which need further processing.

The features

z_{2}

extracted with the previous mapping

ϕ_{2},

as shown in Figure 2, are fed to

χ_{1}

(SVM with linear kernel), which gives the inference corresponding to each frame of the video. The suspected tampered frames corresponding to the entire video are generated. Thus, classification is performed at the frame level.

3.3. Stage 2: Proposed Method for Localization

MR–LBP features help to detect suspected tampered videos, but removing false positives and determining the precise localization of tampering needs more detailed analysis.

For stage 2, we define the mapping

T

, which is composed of the following three mappings:

T (x) = L_{3} \circ L_{2} \circ ϕ_{1} (x)

(11)

where the mapping

ϕ_{1}

is the same as in stage 1 and yields

z_{1} = [{\tilde{f}}_{1}, {\tilde{f}}_{2}, \dots, {\tilde{f}}_{t}] {\in R}^{r \times c \times t}

;

L_{2}

takes

z_{1}

as input and extracts t features,

u_{i} {\in R}^{d}

, each with dimension

d

; and

L_{3}

analyses the features

u_{i}

to determine the precise tampered positions.

Stage 2 has two additional mappings, represented as two blocks in Figure 2. The mapping

L_{2}

extracts features based on optical flow. The mapping

L_{3}

detects the potential frames bounding the tampered region. We remove the false positives by taking the intersection of the frames detected by

L_{3}

and

χ_{1}

in the two stages and determining the frames bounding the tampered region. Further details are presented in the following sections.

3.3.1. Optical Flow Aggregation Calculation

Videos can differ widely in various aspects such as resolution, quality, camera movement, and lighting. This variability makes it challenging to extract robust features that remain unaffected by these variations, requiring the careful selection of features and preprocessing techniques. In order to tackle this problem, we developed an effective method for extracting discriminative features. OF serves as an appropriate feature since brightness variations are consistent in slow– and fast–motion videos. Thus, motivated by the method described in Ref. [15], we proposed two steps: OF extraction and OF aggregation. Jia et al. [15] used OF sum and OF correlation to detect frame duplication only by setting some threshold values. Correlation–based approaches are computationally intensive and time–consuming [62,63]. To determine the precise location of tampering and to reduce false positives, we employed the OF sum consistency. Let

x

be a tampered video with t frames. For the frame

{\tilde{f}}_{i}

, the absolute values of OF components, i.e., OXi and OYi at each pixel (m,n), are added to yield OF sum

s_{i}

. A sequence of OF sums

S = {(s}_{2}, s_{3}, \dots, s_{t})

corresponding to a suspected tampered video is obtained, i.e.,

s_{i} = G_{1} ({\tilde{f}}_{i}) = \sum_{m = 1}^{w i d t h} \sum_{n = 1}^{h e i g h t} (|O X_{i} (m, n)| + |O Y_{i} (m, n)|), i = 2, 3, \dots, t,

(12)

where

G_{1}

takes the preprocessed frames of a video and computes the absolute OF sums

s_{i}, i = 2, 3, \dots, t

.

Due to regularity and continuity in the motion in authentic videos, the sequence

S

exhibits consistency, showing no prominent spikes in the sequence, but this consistency is disturbed when frames are inserted/duplicated in the video. It causes larger differences in

s_{i}' s

between adjacent frames, therefore leading to anomalies in

S

. A frame

{\tilde{f}}_{i}

is in a tampered position if

s_{i}

exhibits a sudden spike. To identify the sudden spike, we consider

s_{i}' s

in a small window of 2T around

s_{i}

, and compute the mean

\bar{s_{i}}

as follows:

\bar{s_{i}} = G_{2} ({\tilde{f}}_{i}) = \frac{1}{2 T} \sum_{k = 1}^{T} (s_{i - k} + s_{i + k}),

(13)

where T is the window size for determining the number of adjacent frames, and

G_{2}

takes preprocessed frames

{\tilde{f}}_{i}

and computes

\bar{s_{i}}

the mean of OF sum of neighboring frames of

{\tilde{f}}_{i}

within the specified window. To keep the complexity low, four (with T = 2) adjacent neighboring frames of the

{\tilde{f}}_{i}

frame are taken to compute the mean value. Then, the rate of change of

s_{i}

of the frame

{\tilde{f}}_{i}

with respect to its neighboring frames is computed as follows:

{G_{3} (s_{i}, \bar{s_{i}}) = β}_{i} = s_{i} / \bar{s_{i}}

(14)

The mapping

G_{3}

takes the

s_{i}

and

\bar{s_{i}}

of ith frame

{\tilde{f}}_{i}

to calculate the fluctuation extent

β_{i}

. The above operations define the mapping

L_{2}

, i.e.,

L_{2} = G_{3} \circ G_{2} \circ G_{1} ({\tilde{f}}_{i}), i = 2, 3, \dots, t,

(15)

where mapping

L_{2}

takes the video frames

({\tilde{f}}_{1}, {\tilde{f}}_{2}, \dots, {\tilde{f}}_{t})

and generates a sequence of the fluctuation extent

β_{i} \in R

corresponding to each frame of the video. As an illustration, the plots of the sequences of the fluctuation extent

β_{i}

for frame duplication and insertion tampering in a video are shown in Figure 4. In an original video, small fluctuations in

β_{i}

are due to the movement of objects. However, the spikes are dominant at the beginning and end points of the duplicated and inserted regions because the consistency of the OF is destroyed by the duplication/insertion. These spikes can be detected to pinpoint the tampered frames.

A larger value of

β_{i}

represents an abnormal spike in the OF aggregation sequence. N–frames with the highest values of

β_{i}

are selected, i.e.,

L_{3} = F_{0} (B_{i}),

(16)

where

F_{0}

takes the sequence

B = {{β}_{2} {, β}_{3} {, \dots, β}_{t}}

of extents and selects N–tampered frames with the highest

β_{i}

values to locate tampering, which are stored in set

F_{2}

. The intersection of the set

F_{2}

of frames extracted by stage 2 and the set

F_{1}

of suspected tampered frames identified in stage 1 is then determined, i.e.,

F {= F}_{1} \cap F_{2}

. The set

F

contains the frames that mark the boundaries of the duplicated or inserted regions, specifically the start and end frames of the tampered frame sequences. This detection/localization process is performed by varying the values of N, and the best results are achieved with N = 4. This value of parameter N is selected empirically.

3.3.2. Standard Deviation

We also tested another technique for the localization of video tampering. The key idea is to exploit the variation in statistical properties of the MR–LBP features of a frame, which is accomplished by computing the standard deviation. Standard deviation is used to quantify the amount of variation in a set of values. N–frames with the highest values of standard deviation are extracted, i.e.,

σ_{i} = F_{1} (P_{i}^{'}), i = 2, 3, \dots, t,

(17)

where

F_{1}

takes the MR–LBP features of each frame of a video one by one, computes the set

{S_{0} = {σ}_{2}, σ_{3}, \dots, σ_{t}}

of standard deviations, and extracts the set

F_{3}

of N–frames with the highest values of

σ_{i}

. The intersection of frames extracted in this stage is taken with the set

F_{1}

of suspected tampered frames identified in stage 1, i.e.,

F = F_{1} \cap F_{3}

. The set

F

contains frames bounding the duplicated/inserted region. The process of detection of the tampered position is carried out with different values of N, and the best results are obtained with N = 4. The value of parameter N is selected empirically.

4. Evaluation Protocols

4.1. Experimental Setup

Numerous experiments were performed on videos that have undergone frame duplication and frame insertion tampering. The results of the experiments are summarized in this section. The experiments were conducted on a notebook computer with NVIDIA RTX2070, 32GB RAM, and a core i7 processor. The code was implemented in Python 3.8.12.

4.2. Dataset Description and Preparation

Since no benchmark datasets are available for the extensive evaluation of frame duplication and insertion detection in surveillance videos [64], in this study, a dataset is developed, named the CSVTED, which includes a variety of challenging videos with different qualities of tampering and varying levels of complexity. The videos include simple to complex backgrounds and single/multiple objects in slow and fast motion, and the videos encompass various lighting conditions. Multiple camera models are used to capture videos in indoor and outdoor environments. The CSVTED encompasses all types of inter-frame tampering including frame insertion and duplication. The frame rate of the videos is between 12 and 30 fps with durations ranging from 5.648 to 75 s. Post tampering, the videos depict natural scenes and are provided in standard formats such as mov, mp4, or avi. Details of the dataset are presented in [19]. Figure 5 illustrates a few frames taken from the CSVTED. Figure 6 provides an example of frame insertion tampering obtained from the CSVTED. We conducted extensive experiments on surveillance videos with variable duplicated/inserted frames ranging from 10 to 50 with increments of 5. A total of 450 tampered videos were used for these experiments, out of which 225 videos had gone through frame duplication tampering and 225 through insertion tampering. The details are shown in Table 2.

4.3. Evaluation Procedure

To assess the performance of the proposed method, we divided our dataset into 80% for training and 20% for testing. The method was then compared with similar tampering detection techniques reported in the literature based on their detection and localization capabilities.

In the next phase, we conducted cross-dataset validation to assess the generalization capability of the proposed model across different datasets. Unlike traditional validation, which uses the same dataset for training and testing, cross–dataset validation tests the model on a separate, independent dataset that is not part of the training.

Various parameters are computed for comparison: precision rate (PR), recall rate (RR), F1 score, and detection accuracy (DA) [12,38,52], as follows:

P R = \frac{T P}{T P + F P}, R R = \frac{T P}{T P + F N}, D A = \frac{T P + T N}{T P + T N + F P + F N},

(18)

where TP, TN, FP, and FN represent “tampered is detected as tampered”, “authentic is detected as authentic”, “authentic is detected as tampered”, and “tampered is detected as authentic”, respectively. The higher value of DA demonstrates a good detection rate by the proposed method.

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(19)

5. Experimental Results

We performed extensive experiments for frame duplication and insertion detection. The results are presented in the following sections.

5.1. Frame Duplication Detection

5.1.1. Impact of Using of Aggregation Only

The impact of using OF–based features only for tampering detection is shown in Figure 7. It represents the values of

β

for a video titled “CCTV Bikes in Fog” taken from the CSVTED. Due to high peaks of

β

–values, frames 31, 126, and 129 (shown by red circles) may wrongly be declared as tampered frames, whereas frame 110 (shown by green circle) is the actual tampered point with a low value of

β

. This indicates that OF aggregation features are not enough to detect and precisely localize tampering. Table 3 illustrates the results when only one type of feature is utilized for the detection and localization of frame duplication.

5.1.2. Impact of Using MR–LBP Only

If a video has fast–moving objects, obviously there is a big displacement in the position of objects between successive frames, causing the MR to fluctuate rapidly even in the absence of tampering. This rapid change in the MR gives rise to a change in the MR–LBP. Such videos may be mistakenly declared as tampered, increasing false positives, as shown in Figure 8a and explained in Section 5.1.3. Table 3 present the results when only MR–LBP features are used for detecting inter–frame tampering.

5.1.3. Impact of Using MR–LBP and Aggregation

To resolve the issue of false positives and for the precise localization of tampering, we employed the OF feature in stage 2 of the proposed method. OF computes the brightness variation among consecutive frames of a video, which tends to be more uniform in fast–motion videos. The Lucas–Kanade Optical Flow algorithm, introduced by Lucas and Kanade [65], is widely used for extracting OF vectors due to its simple application, rapid computation, and robustness under noise [66]. When a video undergoes frame duplication or insertion, the OF patterns show remarkable abnormalities that may not be visible to the naked eye, but can be detected using suitable techniques.

An example is presented in Figure 8, which represents the result when MR–LBP features are used to detect frame duplication tampering. The MR–LBP features show high peaks in Figure 8a for frames 61, 92, and 102 of a video titled “CCTV Moving Vehicles” from the CSVTED.

The example indicates that frame 61 can be wrongly declared as tampered due to high peaks in the MR–LBP features, but these high peaks are the result of fast–moving objects, while the actual tampering positions are frames 92 and 102 (shown by green circles). These false positives are reduced by OF–based features; Figure 8b represents the OF aggregation

β

–values (explained in Section 3.3.1) of the same video; the low value of β at frame 61 (shown by the red circle) declared it as an authentic frame, while the high peaks of

β

at frames 92 and 102 represent the correct tampering positions. In such a situation, OF–based features play a significant role in reducing false positives. Thus, the proposed forensic system, utilizing different features in the detection and localization stages, is more suitable for the detection of tampering in slow– and fast–motion videos.

A comparative analysis of the proposed techniques is presented in Table 3. Our methods were evaluated on a larger dataset of 45 videos with different levels of complexity and tampering quality. The results demonstrated that the RR and DA of the OF–based features were good, but the proposed model 2 using MR–LBP with OF achieved an impressive 99.9% recall rate (RR) and 99.71% detection accuracy (DA), surpassing the other methods.

5.1.4. Cross–Dataset Evaluation

We conducted a cross–dataset evaluation to test the robustness of the proposed video tampering detection methods. The evaluation was performed on datasets acquired from various sources that were not part of the training and testing (see Section 4.3). Both of our techniques were evaluated on publicly available test sets from Ulutas et al. 2018 [38] and an event–object–person (EOP)–based dataset from Panchal et al. 2020 [67], ensuring a fair comparison. The results for the frame duplication tampering, as presented in Table 4, show that both models achieved high precision (PR), recall (RR), and detection accuracy (DA) on unknown datasets. The results demonstrate that our proposed method can not only detect and localize frame duplication tampering from videos captured by static/moving cameras with zoom–in and zoom–out settings, but also from event–object–person–based duplicated sequences.

5.2. Frame Insertion Detection

Table 5 presents and compares the PR, RR, DA, and F1 scores of our proposed techniques for detecting and localizing frame insertion tampering. Additionally, the effectiveness of using OF features for reducing false positives and for the precise localization of tampered regions is also demonstrated. When MR–LBP–based features are employed independently for the detection of suspicious tampered frames, the precision rates are 82.22 and 25.35, respectively, indicating a high false positive rate. These high false positives are reduced by the proposed OF–based method. With model 1, we achieved precision, recall, and detection accuracy of 73.78%, 100%, and 99.67% respectively. Model 2 significantly enhanced the tampering detection, resulting in a F1 measure of up to 92.88% and DA of up to 99.87% for frame insertion tampering, reflecting the effectiveness of OF aggregation for the reduction of false positives and the precise localization of tampering. The results also highlight the superiority of the proposed technique to detect and locate the tampered region, showing the best results in terms of precision, recall, and F1 score.

Cross–Dataset Evaluation

To evaluate the robustness of the proposed methods, we tested them on a publicly available event–object–person (EOP)–based tampering dataset developed by Panchal et al. 2020 [67]. Table 6 shows that the results are promising: The PR, RR, F1, and DA of proposed model 1 using the EOP dataset were 93.75%, 100%, 96.77%, and 99.97%, respectively. Model 2 yielded even better performance with a precision of 100%, recall of 94.64%, F1 score of 97.25%, and detection accuracy of 99.98%. Achieving over 99.5% detection accuracy using our dataset as well as publicly available datasets featuring a variety of videos validates its strong generalization capability. Interestingly, the cross–dataset validation yielded even better results than those using our test dataset. This improvement may be due to the presence of a diversity of videos of different tampering quality and complexity levels in the CSVTED. This represents a significant achievement, as prior studies did not conduct cross–dataset validation.

6. Discussion

The proposed method, a two–stage technique that combines MR–LBP with optical flow (OF), can not only detect and pinpoint the tampering region by reducing false positives, but also exhibits higher sensitivity, particularly in detecting smaller numbers of tampered frames. To ensure the system’s flexibility regarding the number of duplicated and inserted frames, it was tested on videos with variable duplicated/inserted frames ranging from 10 to 50 with increments of 5. Its ability to detect subtle manipulations is presented in Figure 9. We selected a video x (Customer Dealing.mp4) taken from the CSVTED, which was tampered with via the insertion of 10 frames, i.e., from frame 38 to frame 47. Figure 9a shows the SSIM curve exploited in Ref. [28], in which there is no sharp peak to detect and localize the frame insertion. However, Figure 9b shows high peaks in the values of OF aggregation, pinpointing the start and end point of the frame insertion. The proposed method does not enforce any restriction on the minimum number of frames that must be duplicated or inserted. It is capable of detecting the tampering of as few as 10 frames. Unlike existing methods [12,15,37,38,39] that only focus on detecting one form of video tampering, the proposed approach can detect and localize both types of tampering, i.e., duplication and insertion.

To assess the execution time, we measured the time per frame and pixel across 90 test videos for frame duplication and insertion detection, and the results are provided in Table 7. A comparative analysis with the existing methods [15,38,52] revealed that the proposed methods achieved an execution time of 3.3 microseconds (µs) per pixel for model 1 and 3.6 µs for model 2. Although the running time of the proposed methods is not much shorter than those in Refs. [15,38,52] for frame duplication detection, the higher detection accuracy underscores the wide applicability, strong generalization, and good robustness of the proposed method.

In terms of frame insertion detection, the proposed model 1 performed three times faster than the state–of–the–art method used in Ref. [25]. The execution time was reduced by 32%, as presented in Table 7. Specifically, model 1 required 3.39 s per frame, while the method in Ref. [25] takes 10.75 s per frame to detect frame insertion. Although the deep learning–based model proposed by Akhtar et al. [19] demonstrates a shorter execution time, the accuracy of the proposed model is much higher in comparison. Moreover, deep learning–based methods are data–driven, and the lack of large, publicly available benchmark datasets limits their applicability. The combination of high detection accuracy and efficiency in the proposed methods highlights their effectiveness in detecting and localizing instances of video tampering.

The proposed methods were compared with other state–of–the–art approaches that utilize hand–crafted features [15,25,38,52,68] and deep learning features [19,28]. Fadl’s deep learning–based method [28] achieved high detection accuracy on their own test dataset of 12 videos. However, it had some limitations. First, the dataset was developed under the assumption that frame duplication is performed by inserting frames in the static parts of the video. Second, it is unable to detect tampering when the tampering region contains fewer than 25 frames. Third, this method performs well when the tampered frames are multiples of 10 and inserted at positions that are also multiples of 10. Last, its localization is not precise, and no validation was conducted across different datasets. In contrast, the proposed method was evaluated on a set of 45 challenging tampered videos featuring varying levels of complexity and, as mentioned in Table 2, the proposed method outperformed existing techniques. Ulutas et al. [52], Bozkurt et al. [39], Alsakar et al. [25], and Kharat et al. [12] used much smaller datasets of 10, 13, 18, and 20 samples only. Achieving very high accuracy on a particularly small dataset depicts video tampering detection as a solved problem, which may discourage other researchers from publishing better results on larger datasets. Testing on a larger dataset suggests that the proposed method is better suited for handling a wider range of scenarios or more complex data.

Table 8 summarizes and compares the performance of the proposed algorithms with the state–of–the–art techniques in terms of detection accuracy, F1 measure, cross–dataset validation, and size of the training/testing dataset. The proposed method benefited from the false positive reduction strategy and achieved higher detection and localization accuracy and stronger robustness than other techniques.

To assess the performance of the developed methods, it was necessary to evaluate them on the unknown datasets freely accessible to the public. A key distinction of the proposed method is its cross–validation performance, where it achieved over 99.5% detection accuracy on unknown datasets, ensuring its robustness. None of the other methods report cross–validation results, indicating a limitation in the robustness of these techniques.

Figure 10 illustrates the comparison of detection accuracy for inter–frame tampering. The proposed model 2 exhibits detection accuracies of 99.87% and 99.71% for frame duplication and insertion, respectively, even in the presence of a smaller number of inserted/deleted frames. This highlights the method’s ability to detect both types of tampering with excellent accuracy. It also shows a notable improvement over existing methods [12,15,25,38,39,52,69], which typically target a single type of inter–frame tampering.

The proposed system has several advantages and limitations. First, it addresses multiple types of inter–frame tampering within a single model, allowing each video frame to be processed independently. This minimizes the need for extensive computational resources and avoids the complexity of processing the entire video sequences at once. Second, the MR–LBP features coupled with OF aggregation to reduce false positives enhance the robustness of the tampering detection system. Third, the proposed technique is computationally efficient, with a processing time of microseconds. Last, by leveraging the alterations in texture patterns and optical flow between consecutive frames, the system achieves superior performance compared to the state–of–the–art techniques in detecting and localizing inter–frame tampering. It imposes no restrictions on the video format, frame rate, capturing devices, or number of tampered frames, and can detect tampering in as few as ten frames. However, the system has certain limitations. First, as it is primarily designed for surveillance videos, it cannot detect tampering if a scene change occurs in the video. Second, the proposed method is unable to detect tampering in the presence of large static scenes. Additional work is required to address these limitations.

7. Conclusions

In the field of multimedia security, detecting and locating frame duplication (FD) and frame insertion (FI) tampering in surveillance video poses significant challenges, particularly in a legal context where tampered videos can mislead investigations. This research paper introduced a robust two–stage detection system specifically designed to address these challenges. The first stage used motion–residual–based local binary pattern (MR–LBP) features to train support vector machines to identify suspicious tampered videos. In the second stage, the optical flow (OF) aggregation or standard deviation of MR–LBP features were used to reduce false positives and ensure the more precise localization of tampered regions. The proposed methods were evaluated using the CSVTED and were cross–validated on two unknown datasets not included in the training or initial testing phase. The experimental results demonstrate the efficacy of the proposed method. Notably, the PR, RR, and DA of the MR–LBP coupled with the OF–based method for frame duplication detection and localization were 99.81%, 99.90%, and 99.71%, respectively, outperforming state–of–the–art techniques [15,38,39,52,68]. To ensure the system’s flexibility regarding the number of duplicated and inserted frames, it was tested on videos with variable duplicated/inserted frames ranging from 10 to 50 with increments of 5. The proposed method does not enforce any restriction on the minimum number of frames that must be duplicated or inserted. In comparison to hand–crafted and deep–feature–based methods, the proposed method exhibits a distinct advantage in detecting and locating inter–frame tampering (frame duplication and insertion), attaining over 99.5% accuracy on unknown datasets, even with as few as ten tampered frames. Additionally, our proposed method not only detects and localizes FD and FI tampering from videos captured by static/moving cameras with zoom–in and zoom–out settings, but also from event–object–person–based tampered sequences, demonstrating high PR, RR, and DA values compared to other state–of–the–art methods. The proposed method excels in detecting inter–frame tampering in surveillance videos and outperforms existing methods in terms of tampering detection and localization accuracies. The execution time for frame insertion tampering detection is significantly reduced by 32% compared to recent state–of–the–art methods, highlighting the system’s efficiency. Although the results are promising, the proposed method fails to detect anomalies when frames are removed from a static scene. Additional work is required to address this limitation. In the future, we will continue to extend the framework to detect other types of inter–frame tampering, such as frame deletion and frame shuffling. We will also focus on the detection of tampering if a video has undergone multiple types of tampering attacks.

Author Contributions

Conceptualization, M.H. and Z.H.; data curation and formal analysis, N.A.; investigation, N.A.; methodology, N.A.; supervision, M.H. and Z.H.; writing—original draft, N.A.; funding acquisition, M.H. and Z.H.; project administration, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Researchers Supporting Project, number (RSP2024R109), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

The datasets generated and/or analyzed in the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts or competing interests.

References

Akhtar, N.; Saddique, M.; Asghar, K.; Bajwa, U.I.; Hussain, M.; Habib, Z. Digital Video Tampering Detection and Localization: Review, Representations, Challenges and Algorithm. Mathematics 2022, 10, 168. [Google Scholar] [CrossRef]
Nabi, S.T.; Kumar, M.; Singh, P.; Aggarwal, N.; Kumar, K. A comprehensive survey of image and video forgery techniques: Variants, challenges, and future directions. Multimed. Syst. 2022, 28, 939–992. [Google Scholar] [CrossRef]
Mohiuddin, S.; Malakar, S.; Kumar, M.; Sarkar, R. A comprehensive survey on state-of-the-art video forgery detection techniques. Multimed. Tools Appl. 2023, 82, 33499–33539. [Google Scholar] [CrossRef]
Huang, C.C.; Zhang, Y.; Thing, V.L.L. Inter-frame video forgery detection based on multi-level subtraction approach for realistic video forensic applications. In Proceedings of the IEEE 2nd International Conference on Signal and Image Processing (ICSIP), Singapore, 4–6 August 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Long, C.; Basharat, A.; Hoogs, A.; Singh, P.; Farid, H. A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Forged Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Johnston, P.; Elyan, E. A review of digital video tampering: From simple editing to full synthesis. Digit. Investig. 2019, 29, 67–81. [Google Scholar] [CrossRef]
Wang, W.; Farid, H. Exposing digital forgeries in video by detecting duplication. In Proceedings of the 9th Workshop on Multimedia & Security, Dallas, TX, USA, 20–21 September 2007. [Google Scholar]
Wang, Q.; Li, Z.; Zhang, Z.; Ma, Q. Video Inter-Frame Forgery Identification Based on Consistency of Correlation Coefficients of Gray Values. J. Comput. Commun. 2014, 2, 51–57. [Google Scholar] [CrossRef]
Singh, G.; Singh, K. Video frame and region duplication forgery detection based on correlation coefficient and coefficient of variation. Multimed. Tools Appl. 2019, 78, 11527–11562. [Google Scholar] [CrossRef]
Zhang, Z.; Hou, J.; Ma, Q.; Li, Z. Efficient video frame insertion and deletion detection based on inconsistency of correlations between local binary pattern coded frames. Secur. Commun. Networks 2015, 8, 311–320. [Google Scholar] [CrossRef]
Fadl, S.; Megahed, A.; Han, Q.; Qiong, L. Frame duplication and shuffling forgery detection technique in surveillance videos based on temporal average and gray level co-occurrence matrix. Multimed. Tools Appl. 2020, 79, 17619–17643. [Google Scholar] [CrossRef]
Kharat, J.; Chougule, S. A passive blind forgery detection technique to identify frame duplication attack. Multimed. Tools Appl. 2020, 79, 8107–8123. [Google Scholar] [CrossRef]
Shehnaz; Kaur, M. Detection and localization of multiple inter-frame forgeries in digital videos. Multimed. Tools Appl. 2024, 83, 71973–72005. [Google Scholar] [CrossRef]
Feng, C.; Xu, Z.; Jia, S.; Zhang, W.; Xu, Y. Motion-adaptive frame deletion detection for digital video forensics. IEEE Trans. Circuits Syst. Video Technol. 2016, 27, 2543–2554. [Google Scholar] [CrossRef]
Jia, S.; Xu, Z.; Wang, H.; Feng, C.; Wang, T. Coarse-to-Fine Copy-Move Forgery Detection for Video Forensics. IEEE Access 2018, 6, 25323–25335. [Google Scholar] [CrossRef]
Zampoglou, M.; Markatopoulou, F.; Mercier, G.; Touska, D.; Apostolidis, E.; Papadopoulos, S.; Cozien, R.; Patras, I.; Mezaris, V.; Kompatsiaris, I. Detecting Tampered Videos with Multimedia Forensics and Deep Learning. In Proceedings of the International Conference on Multimedia Modeling, Thessaloniki, Greece, 8–11 January 2019; Springer: Cham, Switzerland, 2019. [Google Scholar]
Johnston, P.; Elyan, E.; Jayne, C. Video tampering localisation using features learned from authentic content. Neural Comput. Appl. 2020, 32, 12243–12257. [Google Scholar] [CrossRef]
Shelke, N.A.; Kasana, S.S. Multiple forgery detection in digital video with VGG-16-based deep neural network and KPCA. Multimed. Tools Appl. 2024, 83, 5415–5435. [Google Scholar] [CrossRef]
Akhtar, N.; Hussain, M.; Habib, Z. DEEP-STA: Deep Learning-Based Detection and Localization of Various Types of Inter-Frame Video Tampering Using Spatiotemporal Analysis. Mathematics 2024, 12, 1778. [Google Scholar] [CrossRef]
Wang, Q.; Li, Z.; Zhang, Z.; Ma, Q. Video inter-frame forgery identification based on optical flow consistency. Sens. Transducers 2014, 166, 229. [Google Scholar]
Kingra, S.; Aggarwal, N.; Singh, R.D. Inter-frame forgery detection in H.264 videos using motion and brightness gradients. Multimed. Tools Appl. 2017, 76, 25767–25786. [Google Scholar] [CrossRef]
Singh, R.D.; Aggarwal, N. Optical flow and prediction residual based hybrid forensic system for inter-frame tampering detection. J. Circuits, Syst. Comput. 2017, 26, 1750107. [Google Scholar] [CrossRef]
Yu, L.; Wang, H.; Han, Q.; Niu, X.; Yiu, S.; Fang, J.; Wang, Z. Exposing frame deletion by detecting abrupt changes in video streams. Neurocomputing 2016, 205, 84–91. [Google Scholar] [CrossRef]
Stamm, M.C.; Lin, W.S.; Liu, K.J.R. Temporal forensics and anti-forensics for motion compensated video. IEEE Trans. Inf. Forensics Secur. 2012, 7, 1315–1329. [Google Scholar] [CrossRef]
Alsakar, Y.M.; Mekky, N.E.; Hikal, N.A. Detecting and Locating Passive Video Forgery Based on Low Computational Complexity Third-Order Tensor Representation. J. Imaging 2021, 7, 47. [Google Scholar] [CrossRef] [PubMed]
Sitara, K.; Mehtre, B. A comprehensive approach for exposing inter-frame video forgeries. In Proceedings of the 2017 IEEE 13th International Colloquium on Signal Processing & its Applications (CSPA), Penang, Malaysia, 10–12 March 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Bakas, J.; Naskar, R. A Digital Forensic Technique for Inter-Frame Video Forgery Detection Based on 3D CNN. In Proceedings of the International Conference on Information Systems Security, Funchal, Purtugal, 22–24 January 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Fadl, S.; Han, Q.; Li, Q. CNN spatiotemporal features and fusion for surveillance video forgery detection. Signal Process. Image Commun. 2021, 90, 116066. [Google Scholar] [CrossRef]
Tyagi, S.; Yadav, D. A detailed analysis of image and video forgery detection techniques. Vis. Comput. 2023, 39, 813–833. [Google Scholar] [CrossRef]
Long, C.; Smith, E.; Basharat, A.; Hoogs, A. A c3d-based convolutional neural network for frame dropping detection in a single video shot. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Subramanyam, A.V.; Emmanuel, S. Pixel estimation based video forgery detection. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Qureshi, M.A.; Deriche, M. A bibliography of pixel-based blind image forgery detection techniques. Signal Process. Image Commun. 2015, 39, 46–74. [Google Scholar] [CrossRef]
Fayyaz, M.A.; Anjum, A.; Ziauddin, S.; Khan, A.; Sarfaraz, A. An improved surveillance video forgery detection technique using sensor pattern noise and correlation of noise residues. Multimed. Tools Appl. 2020, 79, 5767–5788. [Google Scholar] [CrossRef]
Kaur, H.; Jindal, N. Deep convolutional neural network for graphics forgery detection in video. Wirel. Pers. Commun. 2020, 112, 1763–1781. [Google Scholar] [CrossRef]
Lin, G.-S.; Chang, J.-F.; Chuang, C.-H. Detecting frame duplication based on spatial and temporal analyses. In Proceedings of the 2011 6th International Conference on Computer Science & Education (ICCSE), Singapore, 3–5 August 2011; IEEE: Piscataway, NJ, USA, 2011. [Google Scholar]
El-Shafai, W.; Fouda, M.A.; El-Rabaie, E.-S.M.; El-Salam, N.A. A comprehensive taxonomy on multimedia video forgery detection techniques: Challenges and novel trends. Multimed. Tools Appl. 2023, 83, 4241–4307. [Google Scholar] [CrossRef] [PubMed]
Fadl, S.M.; Han, Q.; Li, Q. Authentication of surveillance videos: Detecting frame duplication based on residual frame. J. Forensic Sci. 2018, 63, 1099–1109. [Google Scholar] [CrossRef]
Ulutas, G.; Ustubioglu, B.; Ulutas, M.; Nabiyev, V.V. Frame duplication detection based on BoW model. Multimed. Syst. 2017, 24, 549–567. [Google Scholar] [CrossRef]
Bozkurt, I.; Ulutaş, G. Detection and localization of frame duplication using binary image template. Multimed. Tools Appl. 2023, 82, 31001–31034. [Google Scholar] [CrossRef]
Singh, R.D.; Aggarwal, N. Video content authentication techniques: A comprehensive survey. Multimed. Syst. 2018, 24, 211–240. [Google Scholar] [CrossRef]
Al-Sanjary, O.I.; Sulong, G. Detection of video forgery: A review of literature. J. Theor. Appl. Inf. Technol. 2015, 74, 208–220. [Google Scholar]
Liao, S.-Y.; Huang, T.-Q. Video copy-move forgery detection and localization based on Tamura texture features. In Proceedings of the 6th International Congress on Image and Signal Processing (CISP), Hangzhou, China, 16–18 December 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Zhao, D.-N.; Wang, R.-K.; Lu, Z.-M. Inter-frame passive-blind forgery detection for video shot based on similarity analysis. Multimed. Tools Appl. 2018, 77, 25389–25408. [Google Scholar] [CrossRef]
Bakas, J.; Naskar, R.; Dixit, R. Detection and localization of inter-frame video forgeries based on inconsistency in correlation distribution between Haralick coded frames. Multimed. Tools Appl. 2019, 78, 4905–4935. [Google Scholar] [CrossRef]
Shelke, N.A.; Kasana, S.S. Multiple forgeries identification in digital video based on correlation consistency between entropy coded frames. Multimed. Syst. 2022, 28, 267–280. [Google Scholar] [CrossRef]
Chen, S.; Tan, S.; Li, B.; Huang, J. Automatic detection of object-based forgery in advanced video. IEEE Trans. Circuits Syst. Video Technol. 2015, 26, 2138–2151. [Google Scholar] [CrossRef]
Huang, C.C.; Lee, C.E.; Thing, V.L.L. A Novel Video Forgery Detection Model Based on Triangular Polarity Feature Classification. Int. J. Digit. Crime Forensics 2020, 12, 14–34. [Google Scholar] [CrossRef]
Kumar, V.; Gaur, M. Multiple forgery detection in video using inter-frame correlation distance with dual-threshold. Multimed. Tools Appl. 2022, 81, 43979–43998. [Google Scholar] [CrossRef]
Huang, T.; Zhang, X.; Huang, W.; Lin, L.; Su, W. A multi-channel approach through fusion of audio for detecting video inter-frame forgery. Comput. Secur. 2018, 77, 412–426. [Google Scholar] [CrossRef]
Raskar, P.S.; Shah, S.K. VFDHSOG: Copy-move video forgery detection using histogram of second order gradients. Wirel. Pers. Commun. 2022, 122, 1617–1654. [Google Scholar] [CrossRef]
Voronin, V.; Sizyakin, R.; Zelensky, A.; Nadykto, A.; Svirin, I. Detection of deleted frames on videos using a 3D convolutional neural network. In Proceedings of the Counterterrorism, Crime Fighting, Forensics, and Surveillance Technologies II, Berlin, Germany, 10–11 September 2018; SPIE: Bellingham, WA, USA, 2018. [Google Scholar]
Ulutas, G.; Ustubioglu, B.; Ulutas, M.; Nabiyev, V. Frame duplication/mirroring detection method with binary features. IET Image Process. 2017, 11, 333–342. [Google Scholar] [CrossRef]
Fadl, S.M.; Han, Q.; Li, Q. Inter-frame forgery detection based on differential energy of residue. IET Image Process. 2019, 13, 522–528. [Google Scholar] [CrossRef]
Panchal, H.D.; Shah, H.B. Multiple forgery detection in digital video based on inconsistency in video quality assessment attributes. Multimed. Syst. 2023, 29, 2439–2454. [Google Scholar] [CrossRef]
Hsiao, P.-Y.; Chou, S.-S.; Huang, F.-C. Generic 2-D gaussian smoothing filter for noisy image processing. In Proceedings of the TENCON 2007—2007 IEEE Region 10 Conference, Taipei, Taiwan, 30 October–2 November 2007; IEEE: Piscataway, NJ, USA, 2007. [Google Scholar]
Hsiao, P.-Y.; Chen, C.-H.; Chou, S.-S.; Li, L.-T.; Chen, S.-J. A parameterizable digital-approximated 2D Gaussian smoothing filter for edge detection in noisy image. In Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, Kos, Greece, 21–24 May 2006; IEEE: Piscataway, NJ, USA, 2006. [Google Scholar]
Deng, G.; Cahill, L. An adaptive Gaussian filter for noise reduction and edge detection. In Proceedings of the 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference, San Francisco, CA, USA, 31 October–6 November 1993; IEEE: Piscataway, NJ, USA, 1993. [Google Scholar]
Yang, J.; Hauptmann, A.G. Exploring temporal consistency for video analysis and retrieval. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, Santa Barbara, CA, USA, 26–27 October 2006. [Google Scholar]
Song, K.-C.; Yan, Y.-H.; Chen, W.-H.; Zhang, X. Research and perspective on local binary pattern. Acta Autom. Sin. 2013, 39, 730–744. [Google Scholar] [CrossRef]
Mahale, V.H.; Ali, M.M.; Yannawar, P.L.; Gaikwad, A.T. Image inconsistency detection using local binary pattern (LBP). Procedia Comput. Sci. 2017, 115, 501–508. [Google Scholar] [CrossRef]
Gaikwad, A.; Mahale, V.; Ali, M.M.; Yannawar, P.L. Detection and Analysis of Video Inconsistency Based on Local Binary Pattern (LBP). In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, Solapur, India, 21–22 December 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Bourouis, S.; Alroobaea, R.; Alharbi, A.M.; Andejany, M.; Rubaiee, S. Recent advances in digital multimedia tampering detection for forensics analysis. Symmetry 2020, 12, 1811. [Google Scholar] [CrossRef]
Joy, S.; Kurian, L. Video Forgery Detection Using Invariance of Color Correlation. Int. J. Comput. Sci. Mob. Comput. 2014, 3, 99–105. [Google Scholar]
Fadl, S.; Han, Q.; Qiong, L. Exposing video inter-frame forgery via histogram of oriented gradients and motion energy image. Multidimens. Syst. Signal Process. 2020, 31, 1365–1384. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981. [Google Scholar]
Bruhn, A.; Weickert, J.; Schnörr, C. Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. Int. J. Comput. Vis. 2005, 61, 211–231. [Google Scholar] [CrossRef]
Panchal, H.D.; Shah, H.B. Video tampering dataset development in temporal domain for video forgery authentication. Multimed. Tools Appl. 2020, 79, 24553–24577. [Google Scholar] [CrossRef]
Shelke, N.A.; Kasana, S.S. Multiple forgery detection and localization technique for digital video using PCT and NBAP. Multimed. Tools Appl. 2022, 81, 22731–22759. [Google Scholar] [CrossRef]
Sitara, K.; Mehtre, B. Detection of inter-frame forgeries in digital videos. Forensic Sci. Int. 2018, 289, 186–206. [Google Scholar] [CrossRef]

Figure 1. Illustration of video tampering, frames numbers in red font represent tampered frames: (a) frame duplication, (b) frame insertion.

Figure 2. Workflow of the proposed method for inter–frame tampering detection in surveillance video.

Figure 3. Representation of MR–LBP features of a tampered video: (a,b,n,o) are authentic frames, (c–m) are duplicated frames, and (c,m) represent the start and end points of duplicated frames, respectively.

Figure 4. Illustration of OF aggregation β: (a) original video, (b) frame duplication, (c) frame insertion.

Figure 5. Sample frames from the CSVTED.

Figure 6. Example of video from the CSVTED; original frames in the first and second row; tampered frames (frame 113 to frame 116) in the third and fourth row.

Figure 7. Illustration of false positives when only OF–based features are used for tampering detection; red circles show false positives and green circle shows tampered position.

Figure 8. (a) Illustration of false positives during tampering detection stage; red circle represents false positives and green circles represent tampered position; (b) effectiveness of OF–based features to reduce false positives.

Figure 9. Effectiveness of proposed approach to detect subtle tampering. (a) The SSIM curve utilized in Ref. [28]. (b) OF aggregation β of proposed method: high peaks represent start and end of frame insertion (Customer Dealing.mp4).

Figure 10. Comparison with the state–of–the–art methods for inter–frame tampering detection [12,15,19,28,38,39,52,68,69].

Table 2. Detail of datasets utilized in the literature for frame duplication/insertion detection.

Reference	# Videos	Test Videos	# Frames Duplicated	Resolution	Frame Rate	Scenario
Ulutas et al. [38]	31	5	20, 30, 40, 50, 55, 60, 70, 80	320 × 240	29.97, 30	-
Jia et al. [15]	115	-	10, 20, 40	320 × 240	29.97, 30	-
Ulutas et al. [52]	10	10	20, 30, 40, 50, 55, 60, 70, 80	320 × 240	29.97, 30	-
Fadl et al. [28]	FD: 62 + FI: 287	FD: 12 + FI: 57	10 to 600	720 ×1280, 240 × 320, 288 × 352, 576 × 704	23.98 to 30	-
CSVTED	FD: 225 + FI: 225	FD: 45 + FI: 45	10, 15, 20, 25, 30, 35, 40, 45, 50	640 × 360, 640 × 480, 1920 × 1080, 1280 × 720,	12.50, 15, 25, 29.97, 30	Morning, Evening, Night, and Fog

Table 3. Comparison of performance parameters of the proposed methods for frame duplication detection.

Method	PR (%)	RR (%)	DA (%)
Proposed method with MR–LBP features only	95.92	99.90	95.93
Proposed method with OF–based features only	99.84	99.84	99.69
Proposed Model 1 (MR–LBP with standard deviation)	99.23	99.89	99.10
Proposed Model 2 (MR–LBP with OF aggregation)	99.81	99.90	99.71

Highest values of PR, RR, F1 and DA are represented in bold.

Table 4. Performance comparison of the proposed methods on cross–datasets for frame duplication detection.

Method	Dataset	PR	RR	DA
Proposed Model 1 (MR–LBP with standard deviation)	Ulutas Dataset [38]	99.67	99.92	99.59
Proposed Model 1 (MR–LBP with standard deviation)	Panchal Dataset [67]	99.61	99.95	99.57
Proposed Model 2 (MR–LBP with OF aggregation)	Ulutas Dataset [38]	99.83	99.92	99.75
Proposed Model 2 (MR–LBP with OF aggregation)	Panchal Dataset [67]	99.75	99.89	99.64

Highest values of PR, RR, F1, and DA are represented in bold corresponding to each model

Table 5. Comparison of performance parameters of the proposed methods for frame insertion detection.

Method	PR (%)	RR (%)	F1 (%)	DA (%)
Proposed method with MR–LBP features only	25.35	100	40.45	97.29
Proposed method with OF–based features only	82.22	82.22	82.22	99.67
Proposed model 1 (MR–LBP with standard deviation)	73.78	100	84.91	99.67
Proposed model 2 (MR–LBP with OF aggregation)	91.40	94.4	92.88	99.87

Highest values PR, RR, F1 and DA are represented in bold

Table 6. Performance comparison of the proposed methods on cross–dataset for frame insertion detection.

Method	Dataset	PR (%)	RR (%)	F1 (%)	DA (%)
Proposed model 1 (MR–LBP with standard deviation)	Panchal Dataset [67]	93.75	100	96.77	99.97
Proposed model 2 (MR–LBP with OF aggregation)	Panchal Dataset [67]	100	94.64	97.25	99.98

Highest values of PR, RR, F1 and DA are represented in bold

Table 7. Comparison of computational efficiency for inter–frame tampering detection.

	Frame Duplication		Frame Insertion
Methods	Time per Frame (s)	Time per Pixel (µs)	Time per Frame (s)	Time per Pixel (µs)
Ulutas et al. [38]	0.2	2.6	×	×
Ulutas et al. [52]	0.01	×	×	×
Alsakar et al. [25]	×	×	10.75	×
Jia et al. [15]	×	1.623	×	×
Akhtar et al. [19]	×	×	0.227	×
Proposed model 1 (MR–LBP with standard deviation)	3.6	3.3	3.39	3.1
Proposed model 2 (MR–LBP with OF aggregation)	3.8	3.6	3.7	3.4

Table 8. Overall comparison with the state–of–the–art methods.

Method		Evaluation				Cross–Validation
	Size of Dataset	FD		FI		FD	FI
	Size of Dataset	DA	F1	DA	F1	DA	DA
Ulutas et al. [52]	10	99.35	99.64	×	×	×	×
Ulutas et al. [38]	31	96.73	97.79	×	×	×	×
Sitara et al. [69]	90	94.5	×	×	×	×	×
Jia et al. [15]	115	98	98.5	×	×	×	×
Kharat et al. [12]	20	99.7	99.82	×	×	×	×
Bozkurt et al. [39]	13	98.59	×	×	×	×	×
Alsakar et al. [25] (HARRIS features)	18	×	×	×	63	×	×
Alsakar et al. [25] (GLCM features)	18	×	×	×	67	×	×
Alsakar et al. [25] (SVD features)	18	×	×	×	95	×	×
Fadl et al. [28]	349	98.5	×	99.9		×	×
Shelke and Kasana [68]	100	98.56	96	98.28	97.05	×	×
Akhtar et al. [19]	2555	×	×	98.98	98.87	×	×
Proposed	450	99.71	99.85	99.87	92.88	99.75	99.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Akhtar, N.; Hussain, M.; Habib, Z. Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow. Mathematics 2024, 12, 3482. https://doi.org/10.3390/math12223482

AMA Style

Akhtar N, Hussain M, Habib Z. Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow. Mathematics. 2024; 12(22):3482. https://doi.org/10.3390/math12223482

Chicago/Turabian Style

Akhtar, Naheed, Muhammad Hussain, and Zulfiqar Habib. 2024. "Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow" Mathematics 12, no. 22: 3482. https://doi.org/10.3390/math12223482

APA Style

Akhtar, N., Hussain, M., & Habib, Z. (2024). Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow. Mathematics, 12(22), 3482. https://doi.org/10.3390/math12223482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow

Abstract

1. Introduction

2. Literature Review

3. Proposed Method

3.1. Problem Formulation

3.2. Stage 1: Proposed Method for Detection

3.2.1. Preprocessing

3.2.2. Feature Extraction

3.2.3. Classification

3.3. Stage 2: Proposed Method for Localization

3.3.1. Optical Flow Aggregation Calculation

3.3.2. Standard Deviation

4. Evaluation Protocols

4.1. Experimental Setup

4.2. Dataset Description and Preparation

4.3. Evaluation Procedure

5. Experimental Results

5.1. Frame Duplication Detection

5.1.1. Impact of Using of Aggregation Only

5.1.2. Impact of Using MR–LBP Only

5.1.3. Impact of Using MR–LBP and Aggregation

5.1.4. Cross–Dataset Evaluation

5.2. Frame Insertion Detection

Cross–Dataset Evaluation

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI