1. Introduction
With the rapid development of digital image processing techniques, increasingly advanced image editing software provides more convenience and fun for modern human life. Nevertheless, a large number of forged digital images are generated by malicious use of these techniques, which has led to a serious security and trust crisis of digital multimedia. Therefore, image forensics has gradually attracted an increasing concern in the digital multimedia era, such as JPEG compression forensics [
1,
2], median filtering detection [
3], copy-moving and splicing localization [
4,
5], universal image manipulation detection [
6,
7], and so on.
Image inpainting is an effective image editing technique which aims to repair damage or removed image regions based on known image information in a visually plausible manner, as shown in
Figure 1. A variety of image inpainting methods have been constantly proposed in recent years, and these can be classified roughly into three categories: the diffusion-based approaches [
8,
9], the exemplar-based approaches [
10,
11], and the deep learning (DL)-based approaches [
12,
13]. Due to its effective and efficient editing ability, image inpainting has been widely applied in many image processing fields [
11], such as image restoration, image coding and transmission, photo-editing and virtual restoration of digitized paintings, etc. However, the powerful image editing tool is also conveniently used to maliciously modify an image even by non-professional forgers with less visible traces, which poses a serious threat to multimedia information security.
The major forensic tasks for image inpainting are to locate the inpainted regions of an input image, so inpainting forensics require pixel-wise binary classification at the manipulation level, i.e., binary semantic segmentation. In fact, the goal of binary semantic segmentation is to classify pixels in an image into two categories: foreground and background. Specifically, for the inpainting forensics task, the pixels in the image are classified into inpainted pixels and uninpainted pixels. Generally, this is more difficult than the common manipulation detection, which only makes a decision regarding whether a certain manipulation took place or not.
There has been limited research on inpainting forensics until now. Some traditional forensic methods employ hand-crafted features to identify inpainted pixels. For instance, the features depending on image patch similarity were extracted for the detection of exemplar-based inpainting operation [
14,
15], and the features based on image Laplacian transform were designed to identify the diffusion-based inpainting operation [
16,
17]. However, the manipulation traces left by image inpainting on the image are so weak that it is hard to reveal by manually designed features. In addition, the emerging DL-based inpainting methods can not only achieve more realistic inpainting results than traditional methods, but also generate new objects, which brings greater challenges to inpainting forensics. Recently, deep convolutional neural networks (DCNNs) have made great success in many fields [
18,
19,
20] via their powerful learning capabilities. Inspired by these works, some researchers have made some attempts to develop CNN-based forensics methods, such as median filtering forensics [
3], camera model identification [
21], copy-move and splicing localization [
4], as well as JPEG compression forensics [
2]. A few research efforts have been also devoted to deep learning-based forensics for image inpainting [
22,
23].
DL-based methods learn the discriminant features and make the decisions for target tasks in a data-driven way and thus bring about a significant performance advantage on large-scale datasets. In this paper, we propose a new end-to-end network for image inpainting forensics. The network is established by considering the following factors.
First, manipulation feature extraction is a key problem for the DL-based inpainting forensics. Although DCNN is employed to directly extract features from the inpainted image through end-to-end training [
22,
24], it tends to learn image content rather than manipulate features [
25]. To this end, the preprocessing modules are constructed to enhance image manipulation traces in many deep forensic approaches [
7,
23,
26]. However, most of these methods rely on prior knowledge and cannot be effectively applied to all inpainting methods.
In addition, down-sampling operations are inevitably used in a DCNN for the extraction of high-level features, causing detail information loss on image regions [
27]. For dense prediction forensics, it is necessary to study the strategies for compensating spatial information.
Last, cross-entropy and weighting strategies are generally selected for most deep inpainting forensics [
23,
24,
26]. In practice, the loss function is extremely important for training DL-based methods, and its selection should be closely related to the proposed network structure and mission objectives. Thus, the design of loss function needs to be carefully considered based on our methods.
Based on the above considerations, we propose a novel dual-stream network for image inpainting forensics, called frequency attention-based dual-stream network (FADS-Net) in this paper. The main contributions of this work are three-fold, as below:
We develop the DCNN for image inpainting forensics following the encoder–decoder network structure [
18,
28] to directly regress the ground truth binary mask, representing the pixel-wise class labels (inpainting and uninpainting). In order to capture richer inpainting clues, the encoder is designed as a dual-stream structure that consists of the raw input stream (RIS), the frequency recalibration stream (FRS), and a fusion module. The RIS, like most forensics methods, takes the original inpainted image as the input, and image information adaptively recalibrated in the frequency domain is used as input to the FRS. Then, the extracted dual-streamed features are fused into more effective and comprehensive feature responses through a well-designed fusion module. Finally, the fused features will be gradually enlarged to the full resolution through the decoder, and the final prediction mask is generated.
We design the comprehensive up-sampling module combining transpose convolution [
29], skip connection [
30], and attention mechanism [
31]. Through transpose convolution, the fed features are refined to enhance the feature representations for the purpose of forensics while increasing the feature resolution. Information fusion can be effectively performed between coarse high-level features and fine low-level features by combination of skip connection and attention mechanism, which compensates for the spatial information loss caused by the down-sampling operations in the encoder.
We design a new joint loss function based on the intersection-over-union (IoU) metric to train the proposed forensics network. The loss is obtained by combining a proposed IoU loss and the cross entropy (CE) loss, where the IoU loss can directly guide the FADS-Net to optimize the IoU performance metric. CE loss is used to supervise the training of the entire network and two feature extraction streams, respectively, which ensures stable training of the network and efficient operation of each feature extraction stream.
The structure of this paper is as follows.
Section 2 briefly introduces the related work of inpainting forensics. In
Section 3, a frequency attention-based dual-stream forensics network is proposed and details of the network are presented. A series of experiments are carried out to evaluate the proposed network in
Section 4. Finally,
Section 5 concludes this paper.
4. Experimental Results
In order to validate the proposed inpainting forensics method, we train and test the established network on image databases set up by representative inpainting technologies. Experiments are conducted on the datasets to compare our FADS-Net with the state-of-the-art forensic networks in terms of localization accuracy and robustness. An ablation study is also performed to verify the major components of our network.
4.1. Training and Testing Datasets
In the MIT Place dataset [
61], we randomly select 79,200 color images of size
to set up experimental datasets. These images are divided into four datasets on average, each of which is tampered with by one of the four inpainting methods, including diffusion-based inpainting [
62], exemplar-based inpainting [
10], and two state-of-the-art deep learning-based methods [
12,
13]. The region to be inpainted is produced by a random mask in shape and location. Circular, rectangular, and irregular masks are randomly generated and located at a given image. The mask size is indicated by the tampering ratio of the tampered pixels to all the pixels in an image. It is randomly set to one of 0.1%, 0.4%, 1.56%, or 6.25% for diffusion-based inpainting and 1.0%, 5.0%, or 10.0% for others. The parameter settings are applied considering the fact that diffusion-based inpainting is more appropriate for inpainting a smaller missing region. The inpainted images with the associated binary masks form four datasets, called, for convenience, diffusion, exemplar, ICT, and DeepfillV2 datasets, corresponding to the utilized inpainting. Several sample images are shown in
Figure 6 with the masked regions in green.
4.2. Training Details on Synthetic Datasets
The proposed FADS-Net with the input of size is implemented using PyTorch. It is trained and tested on a single Nvidia GeForce GTX 3080Ti GPU. For training, the ADAM optimizer with a batch size of 32 is adopted. The parameters and of ADAM take the default values of 0.9 and 0.999, respectively. The learning rate is initialized to 0.001 for all layers and remains unchanged during training. Training is carried out on the constructed synthetic datasets for 100 epochs to ensure convergence.
Moreover, data augmentation technologies are involved to prevent our model from overfitting and improve the forensic performance on robustness. Specifically, each training image is a JPEG compressed under quality factors (QFs) of 95, 85, 75, and additive white Gaussian noise (AWGN) with 30 dB signal-to-noise ratio (SNR). The processed images together with the original are further randomly flipped horizontally and vertically and rotated by 90 degrees before they are fed into the network.
For comparisons, the state-of-the-art deep learning-based forensics methods are chosen, including FCN with the high-pass pre-filtering module, called HP-FCN [
23], IID-Net [
26], PSCC-Net [
24], and our early work named VGG-FNet [
22]. These methods were retrained on our dataset. The training procedures and parameter settings introduced in their papers are strictly followed during training.
4.3. Forensic Performance on Inpainting Images without Any Distortions
Firstly, we evaluate the forensics performance of FADS-Net by qualitative and quantitative methods on each testing dataset without postprocessing.
Figure 7 visually displays the forensic results obtained on several sample images generated by four typical inpainting methods.
Figure 7a,b exhibit the forensic effects for two traditional inpainting methods, i.e., diffusion-based and exemplar-based inpainting [
10,
62], respectively. Apparently, our method (see the 7th column) achieves quite accurate predictions with little error, while the comparison methods (see columns 3 to 6) yield significant false positive and false negative regions, especially in the 3rd row of
Figure 7b. The forensic results illustrate that VGG-FNet [
22] (the third column) manifests more false negative errors, while HP-FCN [
23] (the fourth column) presents worse false positive performance. For these given sample images, the predictions of IID-Net [
26] and PSCC-Net [
24] (the fifth and sixth columns) are only slightly better than the previous two methods, and worse than our results.
For two deep learning-based inpainting methods, i.e., ICT [
13] and DeepfillV2 [
12], the forensic results are shown in
Figure 7c,d, respectively. It is noticeable that the performances of VGG-FNet [
22], HP-FCN [
23], and IID-Net [
26] are similar, and there are prediction results with significant false positive or negative errors, e.g., the results of VGG-FNet, HP-FCN, and IID-Net in the third row of
Figure 7c and first row of
Figure 7d. PSCC-Net [
24] reaches a better performance, but it is still inferior to our model.
In principle, the forensic effects of all the tested methods become much worse for the inpainted regions with lower tampering ratios (e.g., the first row of
Figure 7a) or uniform regions (e.g., the third row of
Figure 7d). Generally, the inpainting effects of small or uniform regions are more realistic, and fewer inpainting traces are left, thus causing difficulties in forensics. In addition, the edges of inpainting regions, especially irregular regions (e.g., the third rows of
Figure 7a–d), are more prone to prediction errors. Impressively, our FADS-Net obtains forensic results (in the penultimate columns of
Figure 7a–d) best fitting with the ground truth masks (in the last columns of
Figure 7a–d) for inpainted regions with different shapes and scales.
Then, the forensic performance is measured by two objective metrics: IoU and F1-score. The average values of IoU and F1-score obtained by the tested methods are summarized in
Table 1 on four testing datasets. The best results are marked in bold.
All compared methods achieve relatively low forensic performance on the forensic dataset for diffusion-based inpainting, mainly due to the image with a smaller inpainting ratio and weak inpainting traces left. For example, PSCC-Net achieves IoU of approximately 61.5% and F1-score of approximately 67.2%, which is close to the performance of VGG-FNet. IID-Net also has a similar performance to HP-FCN, both reaching slightly higher than IoU of 70.0% and F1 score of 75.0%. Surprisingly, FADS-Net presents IoU of approximately 88.07% and F1 score of approximately 91.56%, which is obviously superior to other methods. The performance gain of FADS-Net may be obtained by the design of mining inpainting traces and refining spatial information.
On the forensic dataset for exemplar-based inpainting, each tested method yields larger IoU and F1 than the previous inpainting dataset. For instance, the IoU of PSCC-Net increases from approximately 61.5% to 86.2%. Clearly, the phenomenon indicates that it is easier to locate a larger inpainted region. IID-Net also has a performance close to PSCC-Net and outperforms VGG-FNet and HP-FCN by approximately 10% in IoU and more than 8% in F1 score. Our FADS-Net is approximately 6.7% and 4.0% higher in IoU and F1 score, respectively, than the second-best IID-Net. This is consistent with the qualitative results given previously.
For the ICT dataset, the forensic performance of PSCC-Net presents approximately 85.7% in IoU and 91.8% in F1-score, and ours (91.34% in IoU and 95.27% in F1-score) explicit exceed its performance. Other compared methods reach IoU from 69% to 82%, and F1 score from 75% to 88%. On the DeepfillV2 dataset, except for VGG-FNet and our FADS-Net, the tested methods emerge with significant performance degradation. Similarly, FADS-Net once again exhibits the best performance and has quite a large margin in IoU, approximately 10.6%, compared with the second-best method.
Figure 8a–c respectively show the influence of the inpainted regions with various shapes and scales on the forensics performance. Observing
Figure 8a, as the scale of the inpainting region decreases, the forensic performance of all methods decreases dramatically. The effect is the most prominent on the dataset for diffusion-based inpainting, which again confirms the difficulty of forensics for the small inpainted region. In addition, through comparing the results of the diffusion dataset in
Figure 8a–c, all the tested methods generally have the lower IoU for irregular masks and the larger one for circular or rectangular masks. This situation also almost obtains on the other three datasets. According to the above analysis, due to the AFRM, dual-stream feature fusion encoders, and attention-based feature fusion, the proposed method provides optimal and stable forensic performance for masks with various shapes and scales.
4.4. Quantitative Evaluation under Typical Attacks
In practice, some postprocessing operations might be performed by forgers after inpainting to hide traces of tampering and evade forensic detection. Thus, we investigate the robustness of the proposed method against JPEG compression and AWGN. These manipulations are considered because they are often employed in many applications.
Specifically, forensics is performed on the distorted images by JPEG compression with QFs of 95, 85, and 75, and AWGN with signal-to-noise ratios (SNRs) of 50 dB, 45 dB, 40 dB, and 35 dB. The average values of IoU and F1-score obtained by the tested methods on the established datasets are reported in
Table 2,
Table 3,
Table 4 and
Table 5. Notice that none of these postprocessing operations with the above parameter settings are used to create our training datasets. The best results are marked in bold.
From
Table 2, all tested methods experience significant performance degradation as QF or SNR decrease. For instance, FADS-Net receives IoU of 86.03% and F1-score of 89.76% for the case of QF = 0.95, which are slightly lower than those obtained under no attacks, while only approximately IoU of 73% and F1-score of 77% are received under QF = 75. Notice that some forensic networks are relatively insensitive to JPEG compression since they obtain a little lower IoU and F1-score on the images with no distortions, e.g., VGG-FNet [
22]. For AWGN with SNR from 50 dB to 35 dB, the IoU and F1-score of FADS-Net are reduced to 92.45% from 94.54% and 95.51% from 97.10%, respectively. The performance degradation is approximately 2%, and the compared methods are subjected to a similar slight performance drop. The above results reveal that all tested methods have more stable robustness against AWGN than JPEG compression. The main reason is that JPEG compression tries to remove the content-unrelated information while guaranteeing the image quality; thus, some inpainting traces are further masked during this process. Similar observations can be made from
Table 3,
Table 4 and
Table 5 on the other three datasets. Impressively, our model outperforms other methods significantly despite different datasets and attack parameters. As an example, FADS-Net outperforms the second-best PSCC-Net [
24] by nearly 10.0% in IoU and 8.0% in F1-score under QF = 75 in
Table 4. This reveals that our forensic method is more effective for capturing inpainting traces.
4.5. Ablation Analysis
We perform ablation experiments to investigate the effect of two feature extraction streams (RIS and FRS), dense-scale feature fusion module (DFFM), Locality-Sensitive Attention Module (LSAM), and IoU-aware joint loss (IJL) through ablation experiments. For this purpose, we construct the following variants of our full model (FADS-Net).
RISS-Net: The variant refers to a single-stream encoder network, which takes the original inpainted image as input. Because the encoder is changed to a single-stream structure, the DFFM is removed, but the LSAM module in the decoder is still retained. Moreover, the network training makes use of a hybrid loss that combines CE loss and IoU loss.
FRSS-Net: The setting of this network is consistent with RIS, except that its single-stream encoder employs FRS.
DSCF-Net: The network architecture is the same as that of our full model but it discards DFFM and uses simple concatenation to fuse features.
DSAF-Net: This variant employs the full encoder and loss function of the full model, yet the decoder employs element-wise addition to recover spatial information instead of LSAM.
FADS-Net (MCL): The network has the same structure as the full model but removes two branch decoders and uses CE loss to supervise the difference between the output of the main decoder and the ground truth label.
FADS-Net (JCL): CE loss is applied to optimize the results of the main decoder and branch decoder of the full model during training.
FADS-Net (MHL): The network utilizes a hybrid loss established by combining CE loss and IoU loss to train the full model removing two branch decoders.
All these variants are trained on the DeepfillV2 dataset with the same training options as those of the full model. The average quantitative results are listed in
Table 6, where no extra distortions, JPEG compression with QF = 75, and AWGN with SNR = 35 dB are considered. The best results are marked in bold.
As shown in
Table 6, by removing two feature extraction streams, the network with a single-stream encoder only achieves IoU of near 89% and F1 scores of approximately 94% with no distortions averaged on the whole testing dataset, and obtains lower IoU and F1 score under attacks, particularly JPEG compression. The performance is much worse than that of our full model, but is still competitive or superior to that of the former state-of-the-art models for comparison in the considered cases. This shows that the high-resolution structure of encoder and efficient feature extraction module MSDM are beneficial to forensic performance. In addition, FRS has a significant performance improvement over RIS under JPEG compression, which indicates that learning in the frequency domain plays a key role in enhancing the inpainting trace.
The performance is further improved by two variants with dual-stream feature extraction. Although DFFM and LSAM were removed, DSCF-Net and DSAF-Net still exceed two single-stream variants by approximately 2.5% to 3.5% in both average scores of IoU and F1. These results imply that the dual-stream feature fusion can effectively improve the performance of inpainting forensics. By comparing the above two variants with the full model, we may get to know the contribution of DFFM and LSAM. For example, the full model outperforms DSCF-Net by approximately 1.1% in IoU score at SNR = 35 db and produces larger performance margin in the case of DSAF-Net with QF = 75.
Through analyzing the performance of the remaining three variants, the full model has the best performance despite whether or not the tested images undergo some attacks. From the results of the full model and variant FADS-Net (MCL), the use of IoU-aware joint loss increases the performance margin by approximately 2.0% in average IoU and 2.1% in average F1. In particular, the performance of the complete model is approximately 4.5% higher than that of the methods without IAL in the case of QF = 75. Thus, we can confirm that IoU-aware joint loss can drive networks to focus on the inpainted regions more than CEL and ensure that the two-stream feature extraction works effectively.
All the results of the above ablation experiments exhibit that all the components present performance improvement and contribute to the overall performance.
5. Conclusions
In this paper, a novel deep learning method for image inpainting forensics, called FADS-Net, has been presented. In order to locate the tampered regions by inpainting operation, FADS-Net is constructed by following the encoder–decoder network structure. The encoder is a dual-stream network composed of an adaptive frequency recalibration module, two feature extraction sub-networks, and a feature fusion module. The two feature streams can efficiently extract feature maps from the original input and the one recalibrated by this adaptive frequency recalibration module. Then, through the feature fusion module, these extracted features are fully fused to generate more comprehensive and effective feature representations. By introducing the attention mechanism, the decoder can restore more spatial information while improving the feature resolution. Last, we propose an IoU-aware joint loss to guide the training of FADS-Net, where the item of IoU loss takes the forensics performance as the optimization objective, and CE loss can ensure the stability of training and the validity of various parts.
FADS-Net has been extensively tested on various images and several typical inpainting methods and compared with the state-of-the-art forensics methods. Qualitative and quantitative experimental results show that the proposed network can locate the inpainting region more accurately and achieve superior performance in terms of IoU and F1-score. Moreover, our network shows excellent robustness against commonly used post-processing, including JPEG compression and AWGN.