1. Introduction
The creation of multimedia content such as digital images and videos for several platforms is comparatively easy in the current environment. Multimedia security challenges have multiplied significantly as a result of the general availability of computing gear and software, as well as the ease with which digital information can be created and altered. Determining the authenticity of digital content that originates from an illegitimate or unknown source could be challenging. Such digital content must first have its validity confirmed before consumption. An important form of digital contents are digital images [
1]. A digital image is a representation of a two-dimensional visual scene, object, or subject in electronic form. It is a collection of individual picture elements or “pixels”, each of which is a tiny square or dot that contains gray-value or color information. These pixels are organized in a grid, with each pixel having a specific color or gray value, which can be displayed on a screen or printed on paper. Even though digital images are a crucial sort of digital content, it is quite challenging to confirm their legitimacy [
2,
3].
The authenticity of digital photos is becoming more questioned with the introduction of modern image processing tools and the ease with which information is shared and altered. As a result, there is a growing preference for blind image forensic techniques.
Digital image forensics is a specialized field within digital forensics that focuses on the analysis and authentication of digital images to determine their origin and integrity, as well as the presence of any alterations or forgeries. It involves using various techniques and tools to examine digital images for signs of manipulation, tampering, or other forms of digital deception. Digital image forensics experts employ methods such as metadata analysis, image compression analysis, noise patterns, and error-level analysis to uncover inconsistencies and anomalies in images. This discipline is crucial in a world where digital images play a significant role in both legal and non-legal contexts, ensuring the credibility and trustworthiness of visual information in domains like criminal investigations, journalism, and the verification of digital evidence. Forensic examiners benefit from being able to observe the background of how much a digital image has been processed.
Digital image forensics aims to restore trust in digital image forensics (DIF). DIF confirms an image’s legitimacy using image-editing fingerprints, and no prior knowledge-based techniques, such as watermarking, are needed [
4].
General-purpose image manipulation operations involve the application of sets of operations that do not change the semantics or meaning of the images. Rather, they are used to remove traces left by other operations, making the detection of certain operators difficult. Various modifications, including median filtering, resampling, JPEG compression, and contrast enhancement, are on the list of general-purpose image manipulation and must be detected as part of the digital image forensics process [
5,
6].
In digital image forensics, inherent characteristic signatures left behind by image-editing methods are used to detect image changes, and the same process is applied for detection of general-purpose image manipulation operations performed on images. Detecting general-purpose image manipulation operators is a reasonable initial step in investigating the processing history of an image that may have gone through several transformations. Image forgers make use of these fundamental operators, such as median filters, which are intrinsically nonlinear. This allows them to remove any traces of evidence that may have been left behind by linear operations carried out on the images. Furthermore, in the fields of watermarking and steganography, the image’s history is also important [
4,
7]. The research literature offers a variety of techniques for detecting fundamental operators applied to digital images. In most cases, these methods build techniques to detect basic operators on an individual basis. In contrast, comparatively little effort is put into the design of procedures that are effective in the detection of numerous operators.
The main contributions of our work are summarized as follows:
We present a method of undertaking image classification for the purpose of image forensics by utilizing an existing body of domain knowledge called feature engineering.
We developed a 36-dimension feature vector based on texture features for general-purpose image manipulation detection.
We designed a system in which we replaced a CNN-based solution with an MLP-based solution. The MLP-based solution was found to perform better than the state-of-the-art methods.
Furthermore, we propose GIMP-FCN, a multilayer perceptron (MLP) consisting of fully connected layers followed by activation layers that accept texture-based features for further learning from features and ultimately performs general-purpose image manipulation detection.
The performance of our approach is superior to that of the most recent and cutting-edge method.
Our work shows that a multilayer perceptron in combination with feature engineering can be employed for digital image forensics.
2. Related Work
Recent efforts by professionals have been directed toward the development of image forensic tools that are suitable for use in a variety of contexts and can determine whether or not an image has been processed and in what way the processing took place. The tools that were first developed for steganography have been repurposed so that they can be used to perform image forensics for general applications.
Researchers have employed deep learning methods to solve a large number of problems in various fields. Deep learning has also found use in the field of digital image forensics, where researchers are working to solve challenges connected to the detection of image tampering. In particular, the field of basic operator forensics makes use of convolutional neural networks (CNNs).
A steganalytic model-based universal forensics technique was introduced in [
8]. The use of universal steganalytic features allows for a variety of image processing processes to be described as steganography and detected using these features. The findings of experiments reveal that all of the examined steganalyzers function well, in addition to showing that certain steganalytic approaches, such as the spatially rich model (SRM) [
9] and LBP [
10]-based methods perform significantly better than specialized forensic procedures. A detector of image manipulation that can be put to a variety of different uses was developed by Fan et al. [
11]. The Gaussian mixture model (GMM) properties of small image patches are utilized by this detector in order to train itself on image-altering fingerprints. After collecting these fingerprints, one can determine whether or not a image has been altered.
A general approach for detecting basic operator manipulation was provided by Bayer et al. in [
12]. The authors trained a CNN to automatically extract features from images after suppressing image contents by restricting a new convolutional layer called the constrained convolutional layer to only learn prediction error filters. This allowed them to achieve their goal. This method allowed the authors to accurately identify four different types of image-editing procedures, including median filtering and resampling, AWGN image corruption, and Gaussian filtering.
Mazumdar et al. [
13] provided a general-purpose forensic technique based on a Siamese CNN. A Siamese neural network evaluates image similarity. Untrained for the detection of AWGN and gamma correction, the model’s ability to recognize these two operations is an intriguing finding.
By studying image modification traces, researchers developed algorithms to detect targeted editing. This strategy has led to successful forensic algorithms, yet an issue persisted, i.e., the creation of individual forensic image detectors is difficult and time-consuming. Forensic analysts need access to general-purpose forensic algorithms that are in a position to recognize a wide variety of image manipulations. Bayar and Stemm [
14] proposed a novel approach that can be used in forensic investigations in general and makes use of convolutional neural networks (CNNs) as the primary tool. The developed model is also available for transfer-learning-based image forensics.
The authors of [
15] proposed a densely connected CNN for general-purpose image forensics based on isotropic constraints and taking into account antiforensic attacks. By reducing the image content information, the isotropic convolutional layer functions as a high-pass filter to highlight artifacts of image processing operations.
The CNN proposed by Yang et al. [
16] includes a magnified layer that is part of the preprocessing process. In order to obtain an adaptive average pooling function from global average pooling that is able to accommodate any size of input pictures, the input images are enlarged using the nearest-neighbor interpolation algorithm in the magnified layer, then input into the CNN model for classification. This strategy was put to the test using six widely used image processing operations.
Rana et al. [
17] designed a CNN called a multiscale residual deep convolutional neural network (MSRD-CNN) for image manipulation detection. In the first step of the procedure, which is called the preprocessing stage, an adaptive multiscale residual module is utilized to first extract the prediction error or noise features. Then, high-level image-tampering features are retrieved from the collected noise features using a feature extraction network with several feature extraction blocks (FEBs). After that, the resulting feature map is presented to the fully connected dense layer for classification purposes. Although the MSRD-CNN achieves good results, it is very complex, consisting of around 76 layers, of which 26 are convolutional layers, 19 are batch normalization layers and 17 are ‘ReLU’ activation layers.
Ensemble learning is a machine learning technique that harnesses the power of diversity to enhance predictive accuracy and robustness. It involves the combination of the outputs of multiple individual models to create a more reliable and high-performing meta model. The key idea behind ensemble learning is that by aggregating the wisdom of several models, we can reduce the risk of overfitting and capture complex patterns in the data; this method was previously use in [
18,
19,
20,
21] for image forensics.
Table 1 summarizes the state-of-the-art methods for general image manipulation techniques, and
Table 2 provides a summary of operators studied in some of the important studies in the literature.
Because deep learning methods that extract features directly from images necessitate knowledge of topology, training methods, and other factors, there is no universally accepted theory that can be used to select appropriate deep learning tools. Training can be quite expensive due to the complexities of deep learning models. This applies to both the time and effort required to explore and select optimal deep learning model parameters, as well as the quantity of processing required [
22]. Domain specialists have a significant edge over deep learning algorithms when using approaches like feature engineering, which require significantly less effort from the researcher. Furthermore, unlike deep learning systems, understanding the relationship between inputs and outputs is significantly simpler. The primary benefit of deep-learning-based solutions, on the other hand, is that no particular feature needs to be created for the problem. Without human involvement, the deep neural network extracts the desirable features [
23,
24,
25].
A multilayer perceptron (MLP) is characterized an input layer, an output layer, and one or multiple optional hidden layers. An MLP is an example of a feed-forward artificial neural network that is made up of many perceptrons. The very first layer is known as the input layer and is responsible for feeding input data into the system. The last layer, known as the output layer, is responsible for making predictions based on the information that has been provided. In addition, there may be any number of other hidden layers in the space in between these two levels. Every node that makes up this network is referred to as a neuron, and it uses nonlinear activation functions. During the forward pass, the signal is sent from the input layer to the output layer by way of the hidden layers. Through the use of the universal approximation theorem, George Cybanko [
26] showed that a feed-forward network with a restricted number of neurons, a single hidden layer, and a nonlinear activation function can approximate training objects with a low error rate.
Recently, the use of popular neural network structures has been questioned, and multilayer perceptron (MLP)-based solutions with performance similar to that of deep-neural-network-based solutions have been proposed [
24]. In [
27], the authors investigated the possible performance of a neural network devoid of convolutions and offered suggestions on how the performance of fully connected networks might be improved. For the purpose of image classification, the authors of [
28] introduced ResMLP, an architecture fully comprising multilayer perceptrons. In [
29], researches replaced the attention layer with a simple feed-forward network. In [
30], a gMLP, i.e., MLPs with gating, was designed to challenge vision transformer (ViT)-based solutions, inferring that gMLP performs better in comparison with bidirectional encoder representations from transformers popularly know as BERT models.
Shi et al. [
31] compared deep learning development tools for the FCN-5 fully connected neural network, a five-layer fully connected neural network, and FCN-8, an eight-layer neural network, with CNN-based solutions AlexNet and ResNet-50 [
32] for a variety of hardware and software tool combinations. The results indicate that the FCN-based solutions performed comparably to the CNN-based solutions. Zhao et al. [
33] compared CNN-, transformer-, and MLP-based solutions and discovered that these three network architectures are comparable in terms of the accuracy–complexity tradeoff. We drew inspiration from the works cited above to perform image forensics tasks using an MLP.
The proposed work blends image processing domain expertise with a deep learning methodology. We developed a solution that detects image modification by basic operators by integrating existing domain knowledge in digital image forensics and image staganalysis with an MLP with nonlinearity in the form of activation layers after each fully connected layer. In this work, a feature vector based on texture characteristics is retrieved from an image. The texture characteristics are derived from the gray-level run-length matrix (GLRLM), the gray-level co-occurrence matrix (GLCM) matrices, and a normalized streak area (nsa) feature inspired by the percentage streak area (psa) developed in [
34]. The gray-level co-occurrence matrix is used to create the first 22 features, the gray-level run-length matrix is used to derive the next 11 features, and the normalized streak area is used to derive the last 3 features.
Next, we developed a deep neural network that can discern between original images and images that are the result of the application of a range of image-editing processes. This network has fully connected layers and activation layers placed at strategic positions throughout its structure. In the end, we used a very large number of optimization parameters to compare the deep neural network against itself in order to optimize its performance. We performed many experiments in order to obtain a solid understanding of how well the designed system would perform. Furthermore, the results demonstrate that the proposed method can effectively differentiate between unfiltered and basic editing operators such as median filters, mean filters, additive white Gaussian noise, and JPEG compression when compared to the benchmark research [
14] and the state-of-the-art method [
17].
The remainder of this paper is laid out as follows. In
Section 3, we provide details about the proposed features of the neural network, and in
Section 4, we provide details about the experimental setup. Results are reported and discussed in
Section 5; finally,
Section 6 contains a discussion of future work. The proposed work combines domain knowledge of image processing with a deep learning approach. First, we designed a feature vector, then trained a deep neural network for classification.
4. Experimental Setup
In order to test the performance of the proposed method in the identification of various image processing activities, we performed an extensive set of experiments. Standard image datasets UCID [
50], BOSSBass [
51], RAISE [
52], and the Dresden image dataset (DID) [
53] were used to generate various training and testing sets for various experiments.
A frequently used dataset called UCID [
50] (Uncompressed Color Image Datasets) contains 1338 colored images with resolutions of 512 × 384 and 384 × 512. Images can be used as a base to create testing and training datasets for the benchmarking of detectors on uncompressed image datasets, from which additional processed datasets can be generated. The main feature of UCID is that images are in their uncompressed state. The UCID dataset, which consists of images in the TIFF format, was created initially for content-based image retrieval (CBIR). It is now used by a very wide range of image-based algorithms and is one of the primary datasets on which researchers test operator detectors.
Released in May 2011, the BOSS base 1.1 [
51] dataset (Break Our Stenographic System) consists of 10,000 uncompressed 512 × 512-resolution images from the BOSS competition that were taken by seven different cameras. The images in the dataset were produced from color, full-resolution RAW images. The BOSS dataset has also been updated in the past. With CNN-based techniques, the BOSSbase dataset is more widely used.
The Dresden Image Dataset was initially created for camera-based digital forensic methods. It is made up of over 14,000 photos taken with roughly 73 different cameras. Images from many different scenarios can be found in the dataset.
1388 images with dimensions of 512 × 386 from UCID, images with dimensions of 512 × 512 from the BOSSbase, 1448 images of varying dimensions from DID, and 4000 images from the RAISE dataset were used to set up a total count of images. A total of images of varying sizes were thus selected to construct . The original image set was then used as a base for the generation of various training and testing datasets. All images in were cropped to extract multiple image patches with dimensions of 256 × 256 to create . The large images in datasets such as DID were cropped from the center, and multiple non-overlapped images with dimensions of 256 × 256 were extracted for dataset generation, with small image datasets such as UCID and BOSSbase contributing one or two image patches.
Gray-scale conversion was performed on all colored images as per Rec.ITU-R BT.601-7 [
54], grouping together a weighted average of the red(R), green(G), and blue(B) components as follows:
Dataset Generation
The
was used to generate datasets for this study. To construct datasets for individual operations such as the median-filtered image dataset (
), window sizes of w,
were employed to filter the
images, generating three different datasets (
,
,
). Similarly, the additive white Gaussian noise (AWGN), denoted by the
dataset, was created by setting
. The JPEG-compressed dataset (
) was created by compressing
with a JPEG compression quality factor of
. A mean filter datasets (
) was created by mean filtering each image in
using a filter window with dimensions of
. The
Figure 3 shows images in various datasets generated for the study. The
Figure 3a shows image from original gray scale image dataset. The
Figure 3b shows the same image median filtered with filter window size of 3 × 3. The
Figure 3c shows the image compressed with JPEG compression. The
Figure 3d shows the image with added AWGN noise. The
Figure 3e shows the images mean filtered image.
Table 4 summarizes the parameters used for dataset generation.
Finally, results are reported in terms of parameters defined in Equations (
42)–(
49) as follows:
Specificity is defined as
The false-positive rate (FPR) is defined as
The F1 score is defined as
The Error, miss classification error, is defined as
The Matthews correlation coefficient (MCC) is defined as
The Matthews correlation coefficient (MCC) can have values between −1 and 1, with −1 being the lowest and 1 being the highest. A value of −1 means that the predicted classes and the actual classes are completely different. A value of 0 means that the guessing was totally random, and a value of 1 means that the predicted classes and the actual classes are exactly the same. The MCC is a more reliable statistical rate that only yields a high score if the prediction was correct in all four of the confusion matrix categories [
55].
In the equations shown above, the notation TP represents for the number of true positives, TN refers to the number of true negatives, FP stands for the number of false positives, and FN stands for the number of false negatives.
We compared our work with two very significant works in the state of the art: those reported by Bayers [
14] and Rana [
17]. Bayers work was implemented with network was trained as described by the author. Rana’s [
17] work was also implemented and simulated; for a better comparison, we used the same dataset described in Equation (
40).
We implemented the experiments using Matlab 2021 [
56] on a system with an Intel Core -i7 and a Nvidia GeForce GTX 1080 GPU graphics processing unit (GPU) with 8 GB of dedicated memory and 16 GB of RAM. The deep learning toolbox [
57] was employed to design the networks, and the Experiment Manager app [
58] was used to manage the experiments and for thorough testing of the models.
6. Conclusions and Future Work
We developed a method for general-purpose image alteration detection by applying the feature engineering methodology and combined it with a deep neural network design strategy. First, we established a set of features to achieve image modification detection. Next, we designed a deep neural network that utilizes fully connected layers with activation layers at appropriate points to differentiate between original images and a variety of image alteration procedures. Finally, we optimized the performance of the deep neural network by comparing it to itself using a very large number of optimization parameters.
In order to get a good idea of how well the planned system would function, we conducted a number of tests. The findings of the studies show that the proposed system, which employs a multilayer perceptron (MLP) trained with a 36-feature set, attained an accuracy of 99.46%. This method outperforms current deep-learning-based solutions, which achieved an accuracy of 97.89%.
In the future, it will be necessary to include a large number of operators in research on operation detection, and it will also be necessary to implement a more thorough approach for feature engineering that takes into account the extraction of larger feature sets from images for operator detection. Another dimension in which this work can be extended is experimentation with recent innovations in deep learning models.
In comparison to state-of-the-art methods, the real-world implementation of this work is anticipated to be faster because it requires less computation and has fewer parameters. In the future, we will perform a detailed time and space complexity analysis of the system proposed above.