Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network

Wang, Yan; Sun, Qindong; Rong, Dongzhu

doi:10.3390/electronics13183630

Open AccessArticle

Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network

by

Yan Wang

¹

,

Qindong Sun

^1,2,*

and

Dongzhu Rong

¹

Shaanxi Key Laboratory of Network Computing and Security, Xi’an University of Technology, Xi’an 710048, China

²

School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3630; https://doi.org/10.3390/electronics13183630

Submission received: 5 August 2024 / Revised: 7 September 2024 / Accepted: 10 September 2024 / Published: 12 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Source camera identification can verify whether two videos were shot by the same device, which is of great significance in multimedia forensics. Most existing identification methods use convolutional neural networks to learn sensor noise patterns to identify the source camera in closed forensic scenarios. While these methodologies have achieved remarkable results, they are nonetheless constrained by two primary challenges: (1) the interference of semantic information and (2) the incongruity in feature distributions across different datasets. The former will interfere with the extraction of effective features of the model. The latter will cause the model to fit the characteristic distribution of the training data and be sensitive to unseen data features. To address these challenges, we propose a novel source camera identification framework that determines whether a video was shot by the same device by obtaining similarities between source camera features. Firstly, we extract video key frames and use the integral image to optimize the smoothing blocks selection algorithm of inter-pixel variance to remove the interference of video semantic information. Secondly, we design a residual neural network fused with a constraint layer to adaptively learn video source features. Thirdly, we introduce a triplet loss metric learning strategy to optimize the network model to improve the discriminability of the model. Finally, we design a multi-dimensional feature vector similarity fusion strategy to achieve highly generalized source camera recognition. Extensive experiments show that our method achieved an AUC value of up to 0.9714 in closed-set forensic scenarios and an AUC value of 0.882 in open-set scenarios, representing an improvement of 5% compared to the best baseline method. Furthermore, our method demonstrates effectiveness in the task of deepfake detection.

Keywords:

multimedia forensics; video forensics; camera source attribute; deep learning

1. Introduction

In recent years, due to the universality and ease of multimedia content editing tools, people pay more and more attention to the reliability of multimedia content. In addition, the problem is further exacerbated by the development of artificial intelligence technology, which can generate high-quality multimedia content without requiring users to have specific domain knowledge or technology [1,2,3]. If these multimedia contents are disseminated through news media and social networks, misleading social opinion, disrupting social order, and even endangering national public security, they will trigger a global trust crisis. Especially in the field of forensics, when law enforcement officers present multimedia evidence in court, they need to confirm the source of the multimedia evidence to ensure its validity. Consequently, the traceability of multimedia content holds significant research importance.

Multimedia content traceability is the reverse engineering of the acquisition process of digital multimedia content to trace the source of the shooting, mainly including identification of the camera model and equipment [4]. Camera model traceability is the identification of the brand and model of the camera shooting multimedia content. A camera device is an identification of a specific physical device that captures multimedia content. Camera model traceability cannot map multimedia content to a specific physical device. In the application scenario of forensics, forensics personnel mainly obtain their devices through multimedia content to ensure the reliability of multimedia evidence. Therefore, this paper mainly aims at the equipment-level forensics of cameras. In addition, due to the portability and popularity of mobile devices as well as the high definition of integrated cameras, people usually use smartphones to obtain multimedia content [5,6]. Compared with image content, videos can better record people’s lives and interesting things [7]. If the reliability of this information can be ensured, it can be used as more convincing evidence. Therefore, we focus on the forensic methods of videos taken by smartphones.

Most of the existing camera source attribute forensics adopt machine learning models to design and mine heuristic features and have made excellent progress in source identification tasks in closed scenes. However, in practical application scenarios, the video data obtained by forensic investigators at the scene may not be captured by any device in the known training device set—that is, the data feature distribution between the training set and the test set is inconsistent. Due to the variations in feature distributions among different datasets, the performance of existing forensic methods for identifying device sources in open-set scenarios is degraded. In addition, the existing source identification algorithms do not fully consider the interference of semantic information on the extraction of discriminant features, indicating that the performance of the detection model needs to be improved.

In order to solve the above problems, we propose a novel video source forensics method for general video source forensics including open and closed sets. The main contributions of this paper are as follows: (1) We designed a video frame smooth patch extraction algorithm, where smooth patches are used as the object of feature extraction. This algorithm reduces the sample space, enabling the feature extraction network to focus on extracting forensic features without the interference of content information. (2) We designed a residual neural network fused with a constraint layer to adaptively learn video source features. Additionally, we introduced a triplet loss metric learning strategy to optimize the network model, aiming to reduce intra-class variations and enlarge inter-class differences among samples, thereby enhancing the discriminability of the model. (3) The fusion block-level similarity feature determines whether two videos are taken by the same camera for video-level similarity. (4) Our experimental results show that the proposed video source forensics system can effectively verify the forensics video source, and its performance is better than other methods on two large-scale public datasets and social network platforms.

2. Related Works

Digital video source recognition is an important research direction of digital video forensics. In this section, firstly, the existing digital video source forensics algorithms are summarized, which are introduced from two aspects: closed-set video source forensics and open-set video source forensics. Then, we summarize the problems solved by the digital video source forensics algorithm in this paper.

Digital video source forensics is to determine the data source through the traces left by the camera in the process of obtaining digital video content and map the digital video to the shooting equipment to confirm the reliability of the evidence [4,8]. Closed set image or video source recognition is a problem of assigning an image or video to a group of known camera models or devices—that is, the camera model or equipment to which the image or video to be obtained by the forensics investigator belongs is in the candidate model or equipment set. On the contrary, the image or video being analyzed in the field of open-set image or video source recognition can belong to a known or unknown camera model or device.

Digital video is composed of a series of frame sequences. It has similar technology with image source forensics. Video data from the same source have the same characteristics. These characteristics are distributed in the hardware and software of the camera. These characteristics are called the fingerprint of the camera [9,10].

In the aspect of hardware fingerprint, there are special pixels or image noise on the generated image or video due to some internal defects of the equipment in the production process and materials. These special pixels and image noise are matched with the reference model noise of the specific camera for image source forensics [11]. It mainly includes sensor dust [12], lens radial distortion [13,14], pattern noise [15,16], etc. Researchers mainly use statistical methods to manually design features [17,18] and convolution neural networks to extract features [19,20,21,22,23,24,25,26] for image source forensics.

In terms of software fingerprint, considering that each digital camera uses a proprietary difference algorithm to obtain the missing color value, Bayram et al. [27] identify the source camera of the image according to the trajectory of the proprietary difference algorithm. Compared with images, video files have high compression quality, and CFA is greatly affected by image compression quality. Therefore, this technology is rarely used in video forensics.

Digital video source forensics is more challenging than image forensics. There are two main reasons: (1) The amount of data of digital video is much larger than that of image, which requires higher computational complexity of the forensics algorithm. (2) Video has higher compression quality. Compared with image source forensics methods, researchers also have relatively little research on video source forensics methods. Amerini et al. [28] proposed using PRNU noise to generate composite fingerprints to identify video sources uploaded on social networks. In [29,30,31,32], researchers applied PRNU’s camera source recognition method to stable videos. Kouokam et al. [33] proposed a video frame-based PRNU fingerprint estimation method for identifying YouTube video sources, considering the impact of video compression on PRNU noise in video frames. Chen et al. [34] proposed using wireless signals and sensor signals for video source recognition. This method not only has a good recognition effect on traditional video but also on wireless streaming video with blocking and blur, and can resist wireless camera spoofing attacks. Kuzin et al. [35] proposed an end-to-end camera source identification network based on convolutional neural networks. This network adopted a 161-layer DenseNet architecture and used pre-trained weights on ImageNet to initialize the model, achieving good recognition results. In [36], the authors proposed using the combination of image and video to study the devices from which they come. The specific method was to calculate the reference fingerprint from the still image obtained by the original equipment, estimate and query the fingerprint according to the survey video, and then compare it with the reference fingerprint to determine the video source. Yang et al. [37] analyzed video integrity from the perspective of video packaging container structure. Firstly, the video packaging container was used to construct domain symbols and value symbols so as to establish a better feature representation of the video packaging container. Secondly, the irrelevant feature vectors were removed automatically by likelihood comparison. Finally, a decision tree classifier was constructed to realize the effective recognition of video sources.

However, the above forensics methods are aimed at the identification of closed-set sources, which is far from meeting the requirements of actual forensics analysis. For open-set source forensics, researchers mainly use the data-driven method based on machine learning technology in recent years. Kharrazi et al. [38] proposed multiple features for blind source camera identification, providing researchers with discriminant features that can be used for blind source camera identification. Mayer et al. [39] proposed a video source forensics system based on deep learning. The system first extracts the general depth features from the blocks of video frames, and then uses the similarity neural network to compare the extracted video feature pairs. Finally, the model of video source camera is obtained by fusing the frame-level comparison results.

The existing video source forensics methods have the following defects: (1) there is relatively little research on the camera model or equipment unknown to the forensics personnel. (2) The open-set video source forensics method is mainly aimed at the identification of the camera model, but the identification of the camera model cannot accurately locate the source equipment, which needs further analysis by a forensics investigator. Based on the above defects, this paper proposes a new video source forensics system. Different from the existing methods, this method can determine whether the two videos come from the same camera by detecting the similarity of the source attributes of any two video block sequences, without any prior knowledge.

3. Background

This section provides a detailed introduction to the relevant terminologies in the proposed source identification algorithm, which lays a solid theoretical foundation for the construction of the algorithm. Explanations of the main terminologies are as follows:

(1): I frame: I frame, also known as Intra-coded Frame, along with predictive-coded frame and bi-directional-coded frame, collectively form the encoding structure of a video sequence. Among them, the I frame, also referred to as an intra-frame or keyframe, is characterized by its independence and completeness. This means that it does not rely on other frames for encoding and contains all the pixel information of the current frame. Therefore, in this paper, smooth blocks extracted from the I frames of the video are used as inputs to the model.
(2): Smooth patches: Smooth patches in video frames typically refer to regions with minimal pixel variations and less texture information. The main characteristics of smooth patches are as follows: (1) Low texture complexity, meaning that the variations between pixel values are small, lacking distinct edges and details, resulting in a visually uniform and monochromatic appearance with consistent brightness. (2) High stability, as the pixel values in these regions do not change significantly between different frames. Based on these characteristics of smooth patches, combined with existing research on smooth blocks in the literature, it is concluded that forensic features are generally located in smooth patches with low texture complexity.
(3): Integral image: The integral image algorithm is a method used for rapidly computing regional features in images, primarily applied in the field of computer vision. Leveraging the integral image allows for the swift calculation of the sum of pixels within any rectangular region of an image. The detailed process of calculating the sum of pixels in a rectangular area by integral image is as follows:

Step 1: Construction of the integral image

Given an image

I (x, y)

, where x and y represent the columns and rows of the image, respectively, the integral image

J (x, y)

can be computed using the following formula:

J (x, y) = I (x, y) + J (x - 1, y) + J (x, y - 1) - J (x - 1, y - 1)

(1)

where

J (x - 1, y)

represents the integral image value of the column immediately to the left of the current point,

J (x, y - 1)

represents the integral image value of the row immediately above the current point, and

J (x - 1, y - 1)

is the integral image value of the pixel diagonally up and left from the current point.

Step 2: Calculation of the sum of pixels within a rectangular region

To compute the sum of pixels within a rectangular region from

(x_{1}, y_{1})

to

(x_{2}, y_{2})

, where

x_{1} < x_{2}

and

y_{1} < y_{2}

, the following formula can be used:

S = J (x_{2} + 1, y_{2} + 1) - J (x_{2} + 1, y_{1}) - J (x_{1}, y_{2} + 1) + J (x_{1}, y_{1})

(2)

This formula leverages the properties of the integral image to efficiently calculate the sum of pixels within any rectangular region by simple arithmetic operations, significantly enhancing computational efficiency.

Based on the detailed analysis and understanding of the abovementioned relevant theories, we design a framework for generalizing source camera identification based on integral image optimization and constrained neural network. This approach aims to achieve effective video source identification.

4. Problem Formulation

Digital video source forensics analysis is an important research direction in the field of multimedia forensics. The previous video source forensics problem is defined as the classification of forensics traces. However, these methods will classify the forensics fingerprints that are not in the training set to a wrong device. In addition, for forensic investigators, verifying whether the obtained videos are from the same source in general to ensure the reliability of collected evidence can meet the forensic analysis needs of forensic investigators.

To address this, we propose a model that is capable of operating on unknown devices. The main idea is to use the similarity between source attribute features between videos to determine whether two videos come from the same device. Even if the source device of the test video is not in the forensics database, the model proposed in this paper can still determine whether the two test videos come from the same device. We define two videos,

V_{1}

and

V_{2}

. The similarity score of device source attribute features of the two videos is

\begin{matrix} S (V_{1}, V_{2}) = \{\begin{matrix} 0 & if V_{1}, V_{2} diff . traces, \\ 1 & if V_{1}, V_{2} same trace \end{matrix} \end{matrix}

(3)

where

S (\cdot)

is the video similarity function between video

V_{1}

and video

V_{2}

. Video source attribute features can be fused by the video frame smooth patch feature extractor

X \overset{f}{\to} R^{N}

. The feature extractor maps the smooth patch X of the video frame to an N-dimensional feature space. This feature space represents the device source attribute features (forensics traces) of smooth block X. We specify how this is achieved in Section 5, where we describe our proposed implementation of the forensic similarity model.

5. Overview

The goal of our model is to learn the source attribute characteristics of videos and calculate their similarity to verify whether the two videos are taken by the same device for multimedia video forensics. As shown in Figure 1, the model primarily consists of the following parts:

(1): Smooth patches selection: To mitigate the interference of semantic information on source identification features, we designed a variance-based patches selection algorithm leveraging the high similarity between adjacent pixels within a smooth block. This is combined with the integral image algorithm to accelerate the variance calculation, and patches are discriminated and selected based on a prior threshold.
(2): Video source feature extraction: To further eliminate the interference of semantic information on video source features, we designed a constrained convolutional neural network to extract video source features. By introducing specific constraints, the network’s discriminative ability for source identification features is improved.
(3): Training Strategy: Given the fine-grained differences in the distribution of videos from the same device versus those from different devices within the sample space, as well as the uneven distribution of samples, we employ triplet metric loss to train the designed network model. This enables the separation of videos from different device sources and the clustering of videos from the same device source.
(4): Block-level feature fusion: To address the limited training sample data for block-level and frame-level source identification, as well as the lack of contextual information, we fuse the block-level similarity measurement results to achieve video-level source identification.

5.1. Smooth Patches Selection

In this section, we propose a smooth patches algorithm using integral image [40,41] to optimize the variance between image points to extract the smooth patches of video frames as the input of the network model in Section 5.2.1 so as to fully extract the forensic features of these flat patches. Because the I frame of video has the main information of video content, we extract the smooth patches on the I frame of video for feature extraction.

The reasons why we adopt the proposed smooth patches extraction algorithm are as follows: (1) The non-smooth patches on the I frame contain a large number of content features, which will seriously affect the extraction of forensics features by neural network [42]. (2) If the frame center patch is directly used as the network input, the feature extraction will be insufficient due to the insufficient video dataset. (3) If all patches on the video frame are used as network input, it will increase the computational overhead and affect the performance of forensic feature extraction. Therefore, how to extract sufficient forensic features from limited patches is the primary problem to be solved. The specific implementation process of our proposed integral image-based algorithm for selecting smooth patches by optimizing the variance between pixels is as follows:

Step 1: We use the variance-based patches extraction algorithm to extract disjoint smooth patches in the frame. In order to accelerate the calculation process of the flat slider, we use the integral image [40] to calculate the sum of the image points of the region patch.

\begin{matrix} D = & {[\frac{1}{N} \sum_{i = 0}^{N} \sum_{j = 0}^{N} P (i, j)]}^{2} - \frac{1}{N^{2}} \sum_{i = 0}^{N} \sum_{j = 0}^{N} P {(i, j)}^{2} \end{matrix}

(4)

From the definition of integral image, it can be deduced that

\begin{matrix} S A T = & S (u + N, v + N) - S (u + N, v) - S (u, v + N) + S (u, v) \end{matrix}

(5)

where

S A T = \sum_{i = 0}^{N} \sum_{j = 0}^{N} P (i + u, j + v)

;

S (u, v) = \sum_{i = 0}^{u} \sum_{j = 0}^{v} P (i, j)

; N is the number of smooth patches; u, v represent the coordinates of the upper left corner of the smooth patch; and

P (i, j)

is the pixel value of the I frame of the video at

(i, j)

.

Step 2: The variance of the threshold we set is between 10 and 100. The smooth patches are trained as the input block of the network model in Section 5.2.1, and the noise characteristics of the smooth patches are extracted. The pseudo code of the algorithm is described in Algorithm 1.

Algorithm 1 Video smooth patch extraction

Input: A Frame P of size $w \times h$ , Maximum variance $ϵ$ , Maximum block number $N u m$ , Patch size N
Output: A Set of patch G

1:: $S i g n \leftarrow z e r o s (w, h)$ , count ← 0
2:: while count < Num do
3:: for all $p a t c h$ in P do
4:: x = GetX( $p a t c h$ ), y = GetY( $p a t c h$ )
5:: if $S i g n [x] [y]$ is 1 then
6:: continue
7:: end if
8:: Calculate variance $d_{i}$ Equation (4) by formula Equation (5)
9:: if $d_{i} < d_{m i n}$ then
10:: $d_{m i n} \leftarrow d_{i}$ , $p a t c h_{m i n} \leftarrow p a t c h$
11:: end if
12:: end for
13:: if $d_{m i n} > ϵ$ then
14:: break
15:: end if
16:: $G \leftarrow G \cup p a t c h_{m i n}$ , count← count + 1
17:: SET $S i g n [x - N : x + N, y - N : y + N] = 1$
18:: end while

5.2. Deep Features Extractor

In this section, we describe the network structure and training strategy of the feature extractor. The feature extractor takes the video frame image smooth patches obtained by the patches selection method as the input and extracts the forensics feature vector through the optimized MISLnet [43] CNN.

5.2.1. Source Identification Feature Extraction Network Architecture

In this section, we optimize the MISLnet CNN structure. The network proposes a constrained convolution layer to restrain the content of the image and adaptively learn the traces of image processing. It is also very effective for the extraction of image features not seen in training. It is widely used in digital image source camera model forensics tasks [44,45]. Therefore, we optimize and use it in the extraction of forensic features of video source camera equipment.

Our optimization of MISLnet CNN mainly includes the following parts: (1) We use an RGB frame image instead of image gray block as the input of neural network. Because there are different forensic features on different color channels, using RGB image patches as input can make forensic feature extraction more sufficient. (2) The convolution kernel is composed of a

5 \times 5

optimized into two

3 \times 3

. The number of parameters is greatly reduced and the computational complexity is reduced. In addition, two convolution kernels of

3 \times 3

provide more activation functions, increase the nonlinear features of the network, and make the extracted forensics features more sufficient. (3) We use the idea of residual learning [46], which is effectively solved by a

5 \times 5

convolution kernel optimized into two convolution kernels of

3 \times 3

, leading to the degradation phenomenon caused by the increase in the number of network layers.

5.2.2. Training Strategy

In this paper, the triple loss function [47,48] is used to train the network in Section 5.2.1. Compared with the contrast loss function [49,50] used in training forensics feature extraction network, the triple loss function fully considers the distance relationship between positive and negative samples and anchors.

\begin{matrix} L = \sum_{k = 1}^{N} {[{∥F_{a p}∥}_{2}^{2} - {∥F_{a n}∥}_{2}^{2} + margin]}_{+} \end{matrix}

(6)

where

{[δ]}_{+} = max (0, δ)

, N is the total number of samples,

F_{a p} = f (x_{k}^{a}) - f (x_{k}^{p})

,

F_{a n} = f (x_{k}^{a}) - f (x_{k}^{n})

,

f (x_{k}^{a})

is the anchor feature vector,

f (x_{k}^{p})

is the positive example feature vector, and

f (x_{k}^{n})

is the negative example feature vector. Margin is a constant, which can force the network to study hard. It makes the distance between the anchor feature vector

f (x_{k}^{a})

and negative example feature vector

f (x_{k}^{n})

larger and makes the distance between the anchor feature vector

f (x_{k}^{a})

and positive example feature vector

f (x_{k}^{n})

smaller.

The samples used for device source attribute identification have the following characteristics: due to the similarity of the manufacturing process of devices of the same model, the distance between the smooth patch samples of different devices of the same model in the sample space is close while the distance between the smooth patch samples of different devices of different models is far. Then, it shows that the distribution of samples in the sample space is uneven. The triple loss function fully considers the distance relationship between positive and negative samples and anchor points. When the distance between negative sample pairs is larger than the distance between positive sample pairs by margin, the loss is 0, it is considered that the current model has learned well, and the model will not be updated later. Therefore, we select the triple loss function to fully train the samples.

When using triple loss function to the train network, it is very important to select triplet samples. If the selected triplet is too simple, the model will not be updated. If hard mining is used to mine difficult cases, the model will be very sensitive to noise and the model will not converge well. Therefore, we define the smooth patches under different devices of the same model in the sample space as a difficult sample, and the video frame slider under different devices of different models as a simple sample. The semi hard triplets online training method is adopted. The difficult samples and simple samples are trained at a 1:1 ratio. In the training process, the proportion of difficult samples and simple samples is gradually changed. When the proportion of difficult samples and simple samples reaches 3:1, the training ends. The pseudo code of the algorithm is described in Algorithm 2.

Algorithm 2 The procedure of training the feature extraction network

Input: Triplet dataset $d a t a s e t$ , number of epochs $e p o c h s_n u m$ , learning rate $α$ , margin $m a r g i n$
Output: Trained model parameters $θ$

1:: Initialize model parameters $θ$
2:: for $e p o c h = 1$ to $e p o c h s_n u m$ do
3:: For each batch in $d a t a s e t$ :
4:: for each triplet $(a_{i}, p_{i}, n_{i})$ in the batch do
5:: Let $A_e m b e d d i n g = m o d e l (A n c h o r, θ)$
6:: Let $P_e m b e d d i n g = m o d e l (P o s i t i v e, θ)$
7:: Let $N_e m b e d d i n g = m o d e l (N e g a t i v e, θ)$
8:: Compute triplet loss using Equation (6)
9:: end for
10:: Backpropagate and update parameters $θ$ using learning rate $α$
11:: end for
12:: return Final model parameters $θ$

5.3. Similarity Calculation and Fusion of Patch Feature Vectors

For video source forensics, the accuracy of using a single patch decision is not enough to identify the video source reliably. In order to solve this problem, we improve the recognition accuracy of the camera source device of the system by fusing the video smooth patch features extracted in Section 5.2.

We use the method of fusing patch feature vector similarity to determine the similarity of two video source attributes. To determine whether videos

V_{i}

and

V_{j}

are from the same device, the two videos can be formally expressed as follows:

V_{i} = \{F_{1}^{i}, F_{2}^{i}, \dots, F_{m}^{i}\}

, which means that video

V_{i}

consists of m frames.

F_{m}^{i} = \{x_{1, m}^{i}, x_{2, m}^{i}, \dots, x_{u, m}^{i}\}

, which means there are u smooth patches in frame

F_{m}^{i}

.

x_{u, m}^{i}

represents the feature vector of the smooth patch extracted by the algorithm in Section 5.2.

V_{j} = \{F_{1}^{j}, F_{2}^{j}, \dots, F_{n}^{j}\}

, which means that video V consists of n frames.

F_{n}^{j} = \{x_{1, n}^{j}, x_{2, n}^{j}, \dots, x_{v, n}^{j}\}

, which means that video

V_{j}

consists of n frames. The specific algorithm is as follows:

Step 1: We calculate the Euclidean distance between the elements in the two feature vectors and then obtain the similarity of the two patches.

< x_{u, m}^{1}, x_{v, n}^{2} > = \{\begin{matrix} 0 & {∥x_{u, m}^{1} - x_{v, n}^{2}∥}_{2} > τ \\ 1 & else \end{matrix}

(7)

where

τ = 1

, and we compare

{∥x_{u, m}^{1} - x_{v, n}^{2}∥}_{2}

with the threshold

τ

. When

{∥x_{u, m}^{1} - x_{v, n}^{2}∥}_{2} > τ

, the similarity score between the patches is 1, which indicates that the attributes of the two test patches’ sources are different. When

{∥x_{u, m}^{1} - x_{v, n}^{2}∥}_{2} < τ

, the similarity score between patches is

θ

, which indicates that the properties of the two test patches’ sources are the same.

Step 2: We calculate the source attribute features similarity of frames.

\begin{matrix} < F_{m}^{i}, F_{n}^{j} > = \{\begin{matrix} 0 & {∥x^{i} \times x^{j}∥}_{1} < \frac{u v}{2} \\ 1 & else \end{matrix} \end{matrix}

(8)

where

x^{i} = \{x_{1, m}^{i}, x_{2, m}^{i}, \dots, x_{u, m}^{i}\}

represents all smooth patches’ feature vectors on frame

F_{m}^{i}

,

x^{j} = \{x_{1, n}^{j}, x_{2, n}^{j}, \dots, x_{v, n}^{j}\}

represents all smooth patches’ feature vectors on frame

F_{n}^{j}

, and × represents the Cartesian product between feature vectors of the corresponding smooth patches.

Step 3: We calculated the similarity of test videos

V_{i}

and

V_{j}

according to the frame similarity to determine whether the two videos came from the same source.

\begin{matrix} < V_{i}, V_{j} > = \{\begin{matrix} 0 & {∥F^{i} \times F^{j}∥}_{1} < \frac{m n}{2} \\ 1 & else \end{matrix} \end{matrix}

(9)

where

F^{i} = \{F_{1}^{i}, F_{2}^{i}, \dots, F_{m}^{i}\}

represents the set of m frames on video

V_{i}

and

F^{j} = \{F_{1}^{j}, F_{2}^{j}, \dots, F_{n}^{j}\}

represents the set of n frames on video

V_{j}

. Whether video

V_{i}

and video

V_{j}

are from the same source is obtained by fusing the similarity between frames.

6. Experiment and Analysis

In this section, we evaluate the effectiveness of our proposed method for video source identification in both open-set and closed-set forensic scenarios. Firstly, we compare the impact of block-level, frame-level, and video-level approaches on video source identification in both open-set and closed-set forensic scenarios. Subsequently, we analyze the importance of each component within our model. Furthermore, we employ our model and a baseline method to perform source identification on compressed datasets. Lastly, we extend our model to the task of deepfake detection and compare it against a baseline method. We aim to answer the following questions:

How do the various components in the proposed model and related parameters affect the detection performance of the model?
How can we evaluate the video source recognition performance of the proposed method on compressed datasets compared to the baselines?
How can we evaluate the effectiveness of extending the model to a deepfake detection task?

6.1. Experimental Setup

In order to complete the above experiments and evaluate the effectiveness of the proposed method, we mainly use Daxing [51] and Vision [52] datasets for experiments. The Daxing dataset includes 43,400 images and 1400 videos captured by 90 smartphones of 22 models belonging to 5 brands. There are mainly the following shooting scenes (sky, grass, stone, tree, staircase, indoor vertical printer, hall wall, classroom white wall, etc.). Each scene captures at least three videos, all of which take more than 10 s. The Vision dataset includes local video and social networking platform video. It contains 34,427 images and 1914 videos collected by 35 smartphones from 11 major brands. In this paper, we only use the video data. There are three kinds of scenes: plane scene (sky or plane wall), indoor scene (classroom, office, hall, store, etc.), and outdoor scene (nature, garden, city, etc.).

We use 60% of Daxing dataset as the training set and 40% as the test set. All videos in the Vision dataset are used as test sets. We use the block selection algorithm to extract

256 \times 256

blocks of each video. A total of 106,306 flat sliders are generated in the training set and 47,302 flat sliders are generated in the test set.

When training the network model, we use the following parameters: The network trains 200 iterations on 64 smooth patch pairs, the initial learning rate is 0.001, and the learning rate is halved for every 30 epoch training. We trained and tested on NVIDIA GeForce RTX 3090 using Pytorch1.9.

6.2. Verification Results and Analysis of Source Device in Forensics Scene

We evaluate the performance of our proposed system in the forensics scenarios of open set, closed set, and a mixture of open set and closed set. Therefore, we use the test set in Daxing dataset as the closed-set device and the Vision dataset as the open-set data for experimental verification.

The verification results are shown in Table 1, showing the verification accuracy of different fusion levels in different forensics scenarios. The open-set scenario means that both camera devices are not in the training set. The accuracy of our proposed system is 84.5%, which exceeds 73.89% of the patch level and 80.88% of the frame level, indicating that the integration of the patch level into the video level improves the verification rate. The closed-set forensics scenario means that both devices are visible in the training set. The accuracy is 93.17%, indicating that the prior knowledge of camera equipment improves the verification accuracy through the training data. In addition, it shows that the video-level forensics method is better than the block-level and frame-level forensics methods.

In addition, in order to observe more clearly the performance of the proposed method in the forensic task, we also conducted confusion matrix experiments on the classification task for closed-set forensic scenarios. As shown in Figure 2, it is evident that the normalized True Positives (TP) and True Negatives (TN) are significantly greater than the False Positives (FP) and False Negatives (FN), indicating that the proposed method is highly effective in classifying positive instances as positive and negative instances as negative. This demonstrates that the proposed method is capable of accurately capturing the target categories and the video-level results outperform those at the block level and frame level.

6.3. Analysis of Forensics Model Components

Since the proposed forensics model contains several main components, we also compare the variants of the forensics model from the following angles to prove the effectiveness of the proposed forensics model. (1) The effect of smooth patch preprocessing. (2) The influence of forensic feature extraction on network model. The following variants of the forensics system are designed for comparison.

Setting A: The pre-processing operation of selecting smooth patches by patch algorithm is removed and the patches of video frame are directly used as the input of the optimized MISLnet CNN to extract forensics features.

Setting B: Patches selection algorithm is used to select smooth patches for pre-processing; then, they are input into the original MISLnet CNN structure to extract forensics features.

Setting C: Patches selection algorithm is used to select smooth patches for pre-processing; then, they are input into the optimized MISLnet CNN structure to extract forensics features. The results are shown in Table 2.

Influence of smooth patches: We compare the effect of using smooth patch or not on forensics results on the Daxing and Vision datasets to study the effectiveness of smooth patches as feature extraction objects. From the results, we can observe that our forensics system is better than Setting A, which shows that the smooth patch extraction is necessary. Taking the flat slider as the feature extraction object can make the feature extraction network focus on the extraction of forensics features without the interference of content information so as to obtain better forensics results.

Influence of forensics feature extraction on network model: we compare the performance of forensics system and Setting B on two datasets, Daxing and Vision, to study the effectiveness of network model. The results show that the performance of the forensics system used in this paper is better than Setting B, which proves that this paper is effective for the optimization of network model structure and training method.

6.4. Smooth Patch Threshold Analysis

The selection of smooth patches plays an important role in the extraction of forensic features. In this experiment, we analyze the influence of the threshold of extracting the smooth patches on the forensics results. In order to achieve this, we use different threshold intervals to verify their impact on the forensics results.

For convenience of representation, we use standard deviation instead of variance value. Figure 3 shows the accuracy of extracting smooth patches using different standard deviation threshold space for video similarity forensics. When the standard deviation of the whole block is between 0 and 3, the accuracy of similarity forensics is 90.60%. This is because the smooth patches extracted within the standard deviation threshold have high saturation, so it is difficult to extract useful noise forensics features. When the standard deviation of the whole block is between 3 and 10, the accuracy of forensics is 98.31%. The patches selected in this interval are smooth and unsaturated. The system can select the smooth patches from this region and then extract sufficient and effective forensics features. The smooth patches extracted in this interval have less scene content, which reduces the impact of scene content on forensic results. When the standard deviation of the patches is greater than 10, the accuracy of evidence collection is less than 87.15%. This is because the smooth blocks extracted in this range contain more scene information, which affects the extraction of forensic features.

6.5. Verification Results and Analysis of Video on Social Network Platform

Due to the wide use of social network platforms, forensic investigators often obtain video re-encoded on social network platforms. To this end, we use FFmpeg library to simulate social network video coding and use the corresponding coding parameters of YouTube and Facebook to re-encode each test video so as to improve the verification accuracy of forensics personnel for highly compressed video source devices on social networks in practical application scenarios.

The verification results of recompressed video are shown in Table 3. We use YouTube and Facebook platforms to encode the parameter re-encoded video. In the open-set scenario, the accuracy of our proposed system is 74.10% and 68.55%, respectively. In the closed-set scenario, the accuracy rates are 77.60% and 75.60%. The accuracy rates of 75.20% and 71.48% were obtained in the test set with the mixture of open set and closed set. These results show that the video-level forensics method proposed in this paper is better than the block-level and frame-level methods. In addition, the recompressed video on the social network platform will lose some forensics feature information in the compression process, and the forensics of the compressed video is more challenging. However, the system proposed in this paper still has a certain effect on the recompressed video on the social network.

6.6. Comparison with Existing Algorithms

In the final experiment, we compare the forensics system with the [39,53] methods in Daxing, Vision, and the mixed datasets to prove the advanced nature of the forensics method.

Figure 4a shows the receiver operating characteristic (ROC) [54] curves of our algorithm and the PRNU [53] and MISL [39] algorithms on the Daxing dataset, which is a closed set. It can be seen from the figure that our algorithm is obviously superior to other methods. Among them, the result of the PRNU algorithm on Daxing dataset is very poor because the videos of Daxing dataset are short videos and the number of key frames that can be used to calculate the PRNU value is very small. In Figure 4b, the ROC curves of our algorithm and the PRNU and MISL algorithms on the mixed datasets of Daxing and Vision are shown. It can be seen from the figure that our algorithm is obviously superior to other methods. Compared with the MISL algorithm, our algorithm has better results because our system uses smooth patches to train the optimized network, which can minimize the impact of scene content on forensic feature extraction and more fully extract noise forensic features for camera source recognition. In Figure 4c, the ROC curves of our algorithm and the PRNU and MISL algorithms on the vision dataset, which is an open set, are shown. It can be seen from the figure that our algorithm is obviously superior to other methods. This result shows that our algorithm also has good forensics effect on untrained datasets. In addition, in Figure 4d, to further prove the effectiveness of the algorithm, we use SOCRatES [55] video dataset for experiments. The experimental results show that the algorithm proposed in this paper is also effective on this dataset. Therefore, our algorithm has good generalization.

Additionally, we conduct experiments using parameters for re-encoding videos on the Facebook platform. As shown in Figure 5, after undergoing compression encoding, the performance of both our method and the baseline methods declined. Our method also demonstrates a certain level of effectiveness in source device forensics for compressed videos. However, the results of our proposed method are poorer on the Vision dataset in open-set forensic scenarios. This may be due to the fact that the Vision dataset was captured under three modes: refined, mobile, and pan-tilt-zoom, which might introduce corresponding noise during the filming process, interfering with the video source features. Furthermore, due to the interference of compression noise, the performance of both our method and the baseline methods decreased. This insight leads us to analyze the relationship between compression noise and video source features in future work, aiming to design targeted models to mitigate the interference caused by compression noise, enhance video source features, and achieve compression-robust video source identification models.

6.7. Discussion and Limitation

According to all experimental results, several insights are worth analyzing and discussing as follows:

(1): Applicability of the model: The ultimate goal of identification is to serve practical applications. For compressed video datasets on social networks, our proposed source recognition method performs better than existing baselines, as shown in Figure 5. Compression operations can cause a loss of contextual information within videos, leading to a decline in performance for existing algorithms when handling video source recognition tasks. However, our proposed source recognition method exhibits superior performance, mainly due to two key factors: On one hand, according to Reference 1, source recognition features are primarily located in the high-frequency components of images. Our method employs a smoothing block preprocessing algorithm to minimize the interference of low-frequency semantic information with source recognition features. On the other hand, our model adopts constrained convolutional layers, further enhancing the representation of video source recognition features. By incorporating triplet loss from metric learning, we minimize the distances between samples of the same class while maximizing the distances between different classes, enabling the model to learn more discriminative feature representations.
(2): Generalization of the model: The results of the experiments on video source devices in forensic tasks and the comparison with existing algorithms. Table 1 demonstrates our proposed method’s superior performance in video-level forensics. Figure 4 shows that the proposed algorithm has better generalizability compared to baseline methods, particularly exhibiting good performance on the shorter-video Daxing dataset. This is mainly because our method employs an integral image-based optimization algorithm to extract smoothing blocks, combined with constrained convolutional layers to collaboratively reduce the impact of scene content on forensic features, thereby enhancing source recognition features. Additionally, the proposed algorithm uses the similarity between features from different video source devices to determine whether they originate from the same camera rather than learning the mapping relationship between input images and output labels. Our method does not require prior knowledge of forensic features but instead explores the rich features inherent in the video data itself, which is more conducive to the network model learning the intrinsic features of the video source, thus enhancing the model’s generalization.
(3): Transferability of the model: To explore the transferability of the proposed algorithm, we conduct experiments on the deepfake detection task using the proposed method. We conducted experiments using the DeepFakes [23] and Face2Face [56] datasets from the FaceForensics++ [57] dataset. Each dataset is randomly divided into training, validation, and test sets consisting of 720, 140, and 140 videos, respectively. We compare the proposed method with Xception [58] and CNNDetection [59]. The experimental results are illustrated in Table 4. Our method demonstrated a certain degree of effectiveness in detecting deepfake videos. Nevertheless, when compared to existing baseline methods, our approach falls short in terms of performance. This is primarily because our algorithm focuses mainly on the features of the video source device, whereas deep fake detection tasks emphasize the identification of generated content. To some extent, we can use the features of the video source device as one of the discriminative features for deepfakes. However, these features may be compromised by the introduction of other noises during the deep fake generation process.

In addition, Kong et al. [60] proposed an end-to-end camera source identification network based on convolutional neural networks. They also designed a dual-stream image manipulation localization framework, which can effectively extract pixel-inconsistent forged fingerprints, achieving more general and robust manipulation localization performance [60]. These studies also inspire us to consider that in-depth analysis of the relationship between video source device features and deepfake characteristics is of great significance for improving the transferability of the model. Consequently, in our future work, we will conduct a thorough analysis of the relationship between video source device features and deepfake characteristics. Additionally, we will explore the additional discriminative features that arise from the generative algorithms used in deepfake detection, distinguishing between authentic and falsified videos. This endeavor aims to further enhance the effectiveness of our method in the task of deepfake detection.

7. Conclusions

In this work, we study the problem of video source forensics using camera source attributes. The major challenge of video source forensics is the accurate identification of video source equipment in open-set forensics scenes, and the performance of existing methods in this case is not satisfactory. In order to address this issue, we propose a novel video source forensics system that can verify whether the video is taken by the same device by obtaining the similarity of original device attribute features between videos without any prior knowledge. Firstly, our system uses the smooth patches of video frame input to the optimized MISLnet CNN to extract the attribute features of video source. Then, the similarity between video blocks is calculated. Finally, block-level similarity is fused into video-level similarity to determine whether the two videos are taken by the same camera. Experiments and comparative results demonstrate that the video source forensics system proposed is effective not only on the original video but also on the compressed video downloaded from YouTube and Facebook social network platforms and can outperform the state-of-the-art methods. In the future, we will explore a more effective method to extract the attribute features of video sources on social networks that can accurately identify fake videos forged by artificial intelligence methods and, finally, realize a general video source forensics system.

In future work, we will analyze the relationship between video source forensic features and deep fake detection features, exploring the common mechanisms between the two. This will help further enhance the transferability of the proposed forensic model.

Author Contributions

Conceptualization, Y.W. and Q.S.; methodology, Y.W. and D.R.; software, D.R.;validation, Y.W. and Q.S.; formal analysis, Y.W. and D.R.; investigation, Q.S.; writing—original draft preparation, Y.W. and D.R.; visualization, Y.W. and D.R.; supervision, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in this paper is supported in part by the National Natural Science Foundation of China (No. 62272378); the Key Research and Development Projects of Shaanxi Province, China (No. 2022ZDLSF07-07); and the Youth Innovation Team of Shaanxi Universities, China (No. 2019-38).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pasquini, C.; Amerini, I.; Boato, G. Media forensics on social media platforms: A survey. EURASIP J. Inf. Secur. 2021, 2021, 4. [Google Scholar] [CrossRef]
Diwan, A.; Sonkar, U. Visualizing the truth: A survey of multimedia forensic analysis. Multimed. Tools Appl. 2024, 83, 47979–48006. [Google Scholar] [CrossRef]
Masood, M.; Nawaz, M.; Malik, K.M.; Javed, A.; Irtaza, A.; Malik, H. Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 2023, 53, 3974–4026. [Google Scholar] [CrossRef]
Anmol, T.; Sitara, K. Video source camera identification using fusion of texture features and noise fingerprint. Forensic Sci. Int. Digit. Investig. 2024, 49, 301746. [Google Scholar] [CrossRef]
Li, Y.; Ye, J.; Zeng, L.; Liang, R.; Zheng, X.; Sun, W.; Wang, N. Learning Hierarchical Fingerprints via Multi-Level Fusion for Video Integrity and Source Analysis. IEEE Trans. Consum. Electron. 2024, 70, 3414–3424. [Google Scholar] [CrossRef]
Liu, Y.y.; Chen, C.; Lin, H.w.; Li, Z. A new camera model identification method based on color correction features. Multimed. Tools Appl. 2024, 83, 29179–29195. [Google Scholar] [CrossRef]
Villalba, L.J.G.; Orozco, A.L.S.; López, R.R.; Castro, J.H. Identification of smartphone brand and model via forensic video analysis. Expert Syst. Appl. 2016, 55, 59–69. [Google Scholar] [CrossRef]
Verdoliva, D.C.G.P.L. Extracting camera-based fingerprints for video forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–21 June 2019. [Google Scholar]
Huang, Y.; Pan, L.; Luo, W.; Han, Y.; Zhang, J. Machine Learning-Based Online Source Identification for Image Forensics. In Cyber Security Meets Machine Learning; Springer: Berlin/Heidelberg, Germany, 2021; pp. 27–56. [Google Scholar]
Zhang, K.; Liu, Z.; Hu, J.; Wang, S. An Auto-Encoder Based Method for Camera Fingerprint Compression. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zheng, H.; You, C.; Wang, T.; Ju, J.; Li, X. Source camera identification based on an adaptive dual-branch fusion residual network. Multimed. Tools Appl. 2024, 83, 18479–18495. [Google Scholar] [CrossRef]
Dirik, A.E.; Sencar, H.T.; Memon, N. Source camera identification based on sensor dust characteristics. In Proceedings of the 2007 IEEE Workshop on Signal Processing Applications for Public Security and Forensics, Washington, DC, USA, 11–13 April 2007; pp. 1–6. [Google Scholar]
Choi, K.S.; Lam, E.Y.; Wong, K.K. Source Camera Identification Using Footprints from Lens Aberration; SPIE: Bellingham, WA, USA, 2006; Volume 6069, pp. 172–179. [Google Scholar]
San Choi, K.; Lam, E.Y.; Wong, K.K. Automatic source camera identification using the intrinsic lens radial distortion. Opt. Express 2006, 14, 11551–11565. [Google Scholar] [CrossRef]
Lawgaly, A.; Khelifi, F.; Bouridane, A.; Al-Maaddeed, S. Sensor pattern noise estimation using non-textured video frames for efficient source smartphone identification and verification. In Proceedings of the 2021 International Conference on Computing, Electronics & Communications Engineering (iCCECE), Virtual, 16–17 August 2021; pp. 19–24. [Google Scholar]
Cozzolino, D.; Marra, F.; Gragnaniello, D.; Poggi, G.; Verdoliva, L. Combining PRNU and noiseprint for robust and efficient device source identification. EURASIP J. Inf. Secur. 2020, 2020, 1–12. [Google Scholar] [CrossRef]
Marra, F.; Poggi, G.; Sansone, C.; Verdoliva, L. A study of co-occurrence based local features for camera model identification. Multimed. Tools Appl. 2017, 76, 4765–4781. [Google Scholar] [CrossRef]
Bernacki, J. Digital camera identification by fingerprint’s compact representation. Multimed. Tools Appl. 2022, 81, 21641–21674. [Google Scholar] [CrossRef]
Bayar, B.; Stamm, M.C. Design principles of convolutional neural networks for multimedia forensics. Electron. Imaging 2017, 2017, 77–86. [Google Scholar] [CrossRef]
Caldelli, R.; Becarelli, R.; Amerini, I. Image origin classification based on social network provenance. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1299–1308. [Google Scholar] [CrossRef]
Cozzolino, D.; Verdoliva, L. Noiseprint: A CNN-based camera model fingerprint. IEEE Trans. Inf. Forensics Secur. 2019, 15, 144–159. [Google Scholar] [CrossRef]
Kirchner, M.; Johnson, C. Spn-cnn: Boosting sensor-based source camera attribution with deep learning. In Proceedings of the 2019 IEEE International Workshop on Information Forensics and Security (WIFS), Delft, The Netherlands, 9–12 December 2019; pp. 1–6. [Google Scholar]
Mayer, O.; Stamm, M.C. Exposing fake images with forensic similarity graphs. IEEE J. Sel. Top. Signal Process. 2020, 14, 1049–1064. [Google Scholar] [CrossRef]
Mandelli, S.; Cozzolino, D.; Bestagini, P.; Verdoliva, L.; Tubaro, S. CNN-based fast source device identification. IEEE Signal Process. Lett. 2020, 27, 1285–1289. [Google Scholar] [CrossRef]
Fanfani, M.; Piva, A.; Colombo, C. PRNU registration under scale and rotation transform based on convolutional neural networks. Pattern Recognit. 2022, 124, 108413. [Google Scholar] [CrossRef]
Wu, H.; Zhou, J.; Zhang, X.; Tian, J.; Sun, W. Robust Camera Model Identification over Online Social Network Shared Images via Multi-Scenario Learning. IEEE Trans. Inf. Forensics Secur. 2023, 19, 148–162. [Google Scholar] [CrossRef]
Bayram, S.; Sencar, H.; Memon, N.; Avcibas, I. Source camera identification based on CFA interpolation. In Proceedings of the IEEE International Conference on Image Processing 2005, Genoa, Italy, 11–14 September 2005; Volume 3, p. III–69. [Google Scholar]
Amerini, I.; Caldelli, R.; Del Mastio, A.; Di Fuccia, A.; Molinari, C.; Rizzo, A.P. Dealing with video source identification in social networks. Signal Process. Image Commun. 2017, 57, 1–7. [Google Scholar] [CrossRef]
Altinisik, E.; Sencar, H.T. Source camera verification for strongly stabilized videos. IEEE Trans. Inf. Forensics Secur. 2020, 16, 643–657. [Google Scholar] [CrossRef]
Yang, W.C.; Jiang, J.; Chen, C.H. A fast source camera identification and verification method based on PRNU analysis for use in video forensic investigations. Multimed. Tools Appl. 2021, 80, 6617–6638. [Google Scholar] [CrossRef]
Flor, E.; Aygun, R.; Mercan, S.; Akkaya, K. PRNU-based source camera identification for multimedia forensics. In Proceedings of the 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), Las Vegas, NV, USA, 10–12 August 2021; pp. 168–175. [Google Scholar]
Bruni, V.; Tartaglione, M.; Vitulano, D. Coherence of PRNU weighted estimations for improved source camera identification. Multimed. Tools Appl. 2022, 81, 22653–22676. [Google Scholar] [CrossRef]
Kouokam, E.K.; Dirik, A.E. PRNU-based source device attribution for YouTube videos. Digit. Investig. 2019, 29, 91–100. [Google Scholar] [CrossRef]
Chen, S.; Pande, A.; Zeng, K.; Mohapatra, P. Live video forensics: Source identification in lossy wireless networks. IEEE Trans. Inf. Forensics Secur. 2014, 10, 28–39. [Google Scholar] [CrossRef]
Kuzin, A.; Fattakhov, A.; Kibardin, I.; Iglovikov, V.I.; Dautov, R. Camera model identification using convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3107–3110. [Google Scholar]
Iuliani, M.; Fontani, M.; Shullani, D.; Piva, A. Hybrid reference-based video source identification. Sensors 2019, 19, 649. [Google Scholar] [CrossRef]
Yang, P.; Baracchi, D.; Iuliani, M.; Shullani, D.; Ni, R.; Zhao, Y.; Piva, A. Efficient video integrity analysis through container characterization. IEEE J. Sel. Top. Signal Process. 2020, 14, 947–954. [Google Scholar] [CrossRef]
Kharrazi, M.; Sencar, H.T.; Memon, N. Blind source camera identification. In Proceedings of the 2004 International Conference on Image Processing, 2004. ICIP’04, Singapore, 24–27 October 2004; Volume 1, pp. 709–712. [Google Scholar]
Mayer, O.; Hosler, B.; Stamm, M.C. Open set video camera model verification. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; pp. 2962–2966. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, Hawaii, 8–14 December 2001; Volume 1, p. I. [Google Scholar]
Vs, V.; Gupta, V.; Oza, P.; Sindagi, V.A.; Patel, V.M. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4516–4526. [Google Scholar]
Güera, D.; Zhu, F.; Yarlagadda, S.K.; Tubaro, S.; Bestagini, P.; Delp, E.J. Reliability map estimation for CNN-based camera model attribution. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 964–973. [Google Scholar]
Bayar, B.; Stamm, M.C. Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2691–2706. [Google Scholar] [CrossRef]
Mayer, O.; Stamm, M.C. Forensic similarity for digital images. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1331–1346. [Google Scholar] [CrossRef]
Akbari, Y.; Al Maadeed, S.; Elharrouss, O.; Ottakath, N.; Khelifi, F. Hierarchical deep learning approach using fusion layer for Source Camera Model Identification based on video taken by smartphone. Expert Syst. Appl. 2024, 238, 121603. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wang, S.; Zhang, L.; Wang, P.; Wang, M.; Zhang, X. BP-triplet net for unsupervised domain adaptation: A Bayesian perspective. Pattern Recognit. 2023, 133, 108993. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Uher, D.; Drenthen, G.S.; Poser, B.A.; Hofman, P.A.; Wagner, L.G.; van Lanen, R.H.; Hoeberigs, C.M.; Colon, A.J.; Schijns, O.E.; Jansen, J.F.; et al. DeepFLAIR: A neural network approach to mitigate signal and contrast loss in temporal lobes at 7 Tesla FLAIR images. Magn. Reson. Imaging 2024, 110, 57–68. [Google Scholar] [CrossRef] [PubMed]
Tian, H.; Xiao, Y.; Cao, G.; Zhang, Y.; Xu, Z.; Zhao, Y. Daxing smartphone identification dataset. IEEE Access 2019, 7, 101046–101053. [Google Scholar] [CrossRef]
Shullani, D.; Fontani, M.; Iuliani, M.; Al Shaya, O.; Piva, A. VISION: A video and image dataset for source identification. EURASIP J. Inf. Secur. 2017, 2017, 1–16. [Google Scholar] [CrossRef]
Goljan, M.; Fridrich, J.; Filler, T. Large scale test of sensor fingerprint camera identification. In Proceedings of the Media forensics and security. International Society for Optics and Photonics, San Jose, CA, USA, 19–21 January 2009; Volume 7254, p. 72540I. [Google Scholar]
Mason, S.J.; Graham, N.E. Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Q. J. R. Meteorol. Soc. A J. Atmos. Sci. Appl. Meteorol. Phys. Oceanogr. 2002, 128, 2145–2166. [Google Scholar] [CrossRef]
Galdi, C.; Hartung, F.; Dugelay, J.L. SOCRatES: A Database of Realistic Data for SOurce Camera REcognition on Smartphones. In Proceedings of the ICPRAM, Prague, Czech Republic, 19–21 February 2019; pp. 648–655. [Google Scholar]
Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2387–2395. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2020; pp. 8695–8704. [Google Scholar]
Kong, C.; Luo, A.; Wang, S.; Li, H.; Rocha, A.; Kot, A.C. Pixel-inconsistency modeling for image manipulation localization. arXiv 2023, arXiv:2310.00234. [Google Scholar]

Figure 1. The overall structure of the proposed forensics system.

Figure 2. The confusion matrix of the proposed method in closed-set forensic scenarios. (a) Patch. (b) Frame. (c) Video.

Figure 3. Forensics accuracy of video source equipment at different block standard deviation thresholds.

Figure 4. The ROC curves of our algorithm are compared with PRNU and MISL algorithms on open, closed, and semi-closed datasets. (a) Closed set. (b) Mixture. (c) Open set (Vision). (d) Open set (SOCRatES).

Figure 5. The ROC curves of our algorithm are compared with PRNU and MISL algorithms on Facebook. (a) Closed set. (b) Mixture. (c) Open set (Vision). (d) Open set (SOCRatES).

Table 1. Verification accuracy of different fusion levels in different forensics scenarios.

Methods	Situations
Methods	Open Set	Mixture	Closed Set
Patch	73.89%	82.76%	85.33%
Frame	80.88%	88.61%	89.93%
Video	84.50%	90.70%	93.17%

Table 2. The ablation study results (AUC).

Methods	Situations
Methods	Open Set	Mixture	Closed Set
Setting A	0.849	0.890	0.903
Setting B	0.838	0.914	0.897
Setting C	0.882	0.946	0.971

Table 3. Verification accuracy of recompressed videos on social network platforms.

SNs	Methods	Situations
SNs	Methods	Open Set	Mixture	Closed Set
YouTube	Patch	67.75%	70.68%	70.04%
	Frame	72.34%	73.88%	71.93%
	Video	74.10%	75.20%	77.60%
FaceBook	Patch	64.13%	67.28%	68.58%
	Frame	66.00%	68.63%	71.25%
	Video	68.55%	71.48%	75.60%

Table 4. ACC and AUC scores comparison of Deepfake detection.

Methods	Deepfakes		Face2Face
Methods	ACC	AUC	ACC	AUC
Xception	93.73%	0.9584	90.57%	0.9372
CNNDetection	96.47%	0.9739	94.92%	0.9373
Ours	90.04%	0.9104	86.95%	0.8823

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Sun, Q.; Rong, D. Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network. Electronics 2024, 13, 3630. https://doi.org/10.3390/electronics13183630

AMA Style

Wang Y, Sun Q, Rong D. Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network. Electronics. 2024; 13(18):3630. https://doi.org/10.3390/electronics13183630

Chicago/Turabian Style

Wang, Yan, Qindong Sun, and Dongzhu Rong. 2024. "Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network" Electronics 13, no. 18: 3630. https://doi.org/10.3390/electronics13183630

APA Style

Wang, Y., Sun, Q., & Rong, D. (2024). Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network. Electronics, 13(18), 3630. https://doi.org/10.3390/electronics13183630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalizing Source Camera Identification Based on Integral Image Optimization and Constrained Neural Network

Abstract

1. Introduction

2. Related Works

3. Background

4. Problem Formulation

5. Overview

5.1. Smooth Patches Selection

5.2. Deep Features Extractor

5.2.1. Source Identification Feature Extraction Network Architecture

5.2.2. Training Strategy

5.3. Similarity Calculation and Fusion of Patch Feature Vectors

6. Experiment and Analysis

6.1. Experimental Setup

6.2. Verification Results and Analysis of Source Device in Forensics Scene

6.3. Analysis of Forensics Model Components

6.4. Smooth Patch Threshold Analysis

6.5. Verification Results and Analysis of Video on Social Network Platform

6.6. Comparison with Existing Algorithms

6.7. Discussion and Limitation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI