The Circular U-Net with Attention Gate for Image Splicing Forgery Detection

Peng, Jin; Li, Yinghao; Liu, Chengming; Gao, Xiaomeng

doi:10.3390/electronics12061451

Open AccessCommunication

The Circular U-Net with Attention Gate for Image Splicing Forgery Detection

by

Jin Peng

^1,†,

Yinghao Li

^1,†

,

Chengming Liu

^1,*

and

Xiaomeng Gao

²

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China

²

International College, Zhengzhou University, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(6), 1451; https://doi.org/10.3390/electronics12061451

Submission received: 10 February 2023 / Revised: 8 March 2023 / Accepted: 16 March 2023 / Published: 19 March 2023

(This article belongs to the Special Issue Advanced Techniques in Computing and Security)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

With the advent and rapid development of image tampering technology, it has become harmful to many aspects of our society. Thus, image tampering detection has been increasingly important. Although current forgery detection methods have achieved some success, the scale of the tampered areas in each forgery image are different, and previous methods do not take this into account. In this paper, we believe that the inability of the network to accommodate tampered regions of various sizes is the main reason for the low precision. To address the mentioned problem, we propose a neural network architecture called CAU-Net, which adds residual propagation and feedback, attention gate and Atrous Spatial Pyramid Pooling with CBAM to the U-Net. The Atrous Spatial Pyramid Pooling with CBAM can capture information from multiple scales and adapt to differently sized target areas. In addition, CAU-Net can solve the vanishing gradient issue and suppress the weight of untampered regions, and CAU-Net is an end-to-end network without redundant image processing; thus, it is fast to detect suspicious images. In the end, we optimize the proposed network structure by ablation study, and the experimental results and visualization results demonstrate that our network has a better performance on CASIA and NIST16 compared with state of the art methods.

Keywords:

image forgery detection; residual propagation; attention gate

1. Introduction

With the advent and wide use of image forgery software in recent years, it has become increasingly easy for people to modify and even edit the content of images. Image forgery has a negative impact on many aspects of our lives, such as academic fraud, fake images, etc. These phenomena make us pay attention to image forgery technology. There are many image forgery techniques [1,2,3,4], such as compositing, enhancement and retouching, but in general, image forgery techniques can be divided into three categories: copy and paste forgery, splicing forgery and removal forgery. Different types of tampering methods are not detected in the same way. In this paper, we specifically focus on splicing forgery detection in image forgery. The process of image splicing forgery is to copy a part of an image into another image to compose a new tampering image. As Figure 1 shows, a deer from an unknown image is copied, and then the deer is pasted into Figure 1a (host image) to merge a new image, Figure 1b (forgery image). Figure 1c (ground-truth) indicates the tampered regions.

In order to address the splicing forgery issue, a great deal of related methods have been proposed. The methods can be divided into traditional-based approaches and deep learning-based approaches. The traditional-based methods mostly depend on a specific feature that can highlight the differences between the untampered areas and tampered areas, such as Color Filter Array (CFA) artifacts [5] and noise inconsistency [6]. However, these hand-designed feature have limitations and lack representativeness. In addition, these methods are not robust to various attacks. Recently, the convolutional neural network (CNN) has had great achievements in the field of computer vision. Its ability to extract images features adaptively has led increasing numbers of researchers to apply it to image tampered detection, so large numbers of CNN-based image forgery detection methods have been proposed in recent years [7,8,9,10,11,12]. Although the approaches have achieved great performance, these methods pay more attention to learning the inconsistency between the untampered areas and tampered areas while ignoring the different sizes of tampered areas for different forgery images. Since targets of different sizes need to be in different receptive fields to be relatively easy to separate from the background, in this paper, we believe that the inability of the network to accommodate forgery regions of various sizes and capture forgery features from multiple scales is the main reason for the low precision. Previous work (e.g., DeepLab v2 [13]) proposed the Atrous Spatial Pyramid Pooling (ASPP) module, which can get multi-scale information and is useful for segmenting objects of different scales. Benefitting from this, we introduce the ASPP module into the forgery detection network to improve the ability to segment tampered regions of different scales. However, it is not enough simply to add an ASPP module.

In order to address the above issues, we propose a network called Circular U-Net with Attention Gate (CAU-Net) in this paper. The network incorporates the features of [14] based on [9]. Meanwhile, we introduce the ASPP incorporating Convolutional Block Attention Module (CBAM) [15] called CBASPP module in our network to capture information at multiple scales. We optimize the proposed network structure by ablation study, and our network has a better performance on NIST16 [16] and CASIA [17] than the previous methods. In summary, the key contributions of our work are as follows.

We modified the ASPP module and applied the CBASPP module to enhance the detection performance of CAU-Net. The CBASPP module samples the input feature map in parallel with a convolution of voids at different sampling rates and then concatenates the results to expand the number of channels. Thus, adding the CBASPP module can capture the context of an image at multiple scales well.
We introduce an ingenious method between the corresponding encode and decode layers called an attention gate. It can control the tampered region weights by enlarging them, and untampered region weights become smaller, which means that the neural network is able to get better results.
We use ResNet with Efficient Channel Attention (ECA) [18] instead of the regular ResNet to improve the performance of tampered detection without additional parameters.

2. Related Work

In the following, we briefly describe how our ideas were implemented and explain the novelties accordingly. Since it was discovered that forgery features can be extracted from the noise stream, an increasing number of forgery detection approaches chose to get more information from both the noise stream and the RGB stream. Zhou et al. [8] proposed a two-stream network, including the RGB stream and noise stream generated by the SRM filter [19]. Hu et al. [20] also use the two-stream structure. In [20], feature fusion is adopted at an early stage. The difference is that [8] performs feature concatenation at a later stage. However, when splicing forgery images are taken from the same type of camera, the noise information will be consistent, so the noise stream is useless. Several approaches (e.g., MFCN [7], MVSS [11], ET [12]) are proposed to learn the forgery edge with deep learning. Howeverm there is a gradient degradation issue as the network gets increasingly deep. Thus, the effect of edge learning on the deep network is weak. RRU-Net [9] and MCNL-Net [10] proposed the ringed residual U-Net to solve the gradient degradation issue. Recently, forgery detection networks such as PSCCNet [21] and ObjectFormer [22] have been proposed to localize the tampered regions.

Although these methods achieve promising results, most of them overlook the variable scales of the tampered area, which leads to low precision. Our work is closer to RRU-Net, as we apply the CBASPP module to capture multi-scale information and introduce some attention modules to suppress the weight of untampered regions. These modules are described in detail in the following.

3. Network Architecture Overview

We elaborate our network (CAU-Net) in this section. The structure of the method we proposed is shown in Figure 2. We discuss these modules one by one in the following sections. The purple ring structure represents the residual module, which is composed of residual propagation and residual feedback. Details of these two algorithms are introduced in Section 3.1 and Section 3.2, receptively. The encode layers and decode layers are summed by the attention gate instead of simply summing to remove redundant information; the details are described in Section 3.3. The ringed structure is downsampled by MaxPool in the encode layers and upsampled by transposed convolution in the decode layers. The CBASPP module is applied in the bottom layer to obtain more feature information at multiple scales, and details of the structure are presented in Section 3.4. In the end, we present the structure of the ECA module in Section 3.5.

3.1. Residual Propagation

For image forgery detection, the differences of image essential properties between host images and copied images are crucial. Using these differences, we can easily detect and locate tampered areas. However, with the network getting increasingly deep, image essential properties fade away, and the minor differences between the tampered and untampered regions thus disappear. In order to address the mentioned problem, we introduce a classical method called residual propagation mechanism [23] to each component block. The component block is shown in Figure 3. It consists of two convolution blocks as well as residual propagation. The residual propagation mechanism is defined in Equation (1):

y_{f} = F (x, W_{i}) + M (x),

(1)

where x and

y_{f}

represent the input and output of the component block, respectively, as shown in Figure 3, and

W_{i}

denotes the parameters of the layer i. The function

F (x, W_{i})

is learnable; in this paper, this function represents the feature map after two convolutional layers and ReLU. After that, we use the function M, which is a learnable linear mapping layer, to alter the dimension of input x to ensure the same dimension as

F (x, W_{i})

so that they can be added together. Compared to the previous normal networks, the residual network introduces a new mechanism called a jump connection. Benefit from this, more information from the previous residual block can flow unimpeded to the next residual block, which can improve the information flow and allow the difference in the essential properties of the images to propagate with the network. Moreover, this can avoid the problem of a vanishing gradient because of the excessive depth of the network.

The mechanism is just like the human brain’s recall mechanism. When we learn a large amount of new knowledge, we may forget part of our previous knowledge. At this point, we need a recall mechanism to help us remember what we have learned before, and so does the network, using the residual propagation module to keep crucial information from being forgotten. Meanwhile, it is of no doubt that if the differences of image essential properties between host images and copied images do not disappear, the performance of the network will be greatly improved.

3.2. Residual Feedback

According to the explanation above, it is obvious that as soon as we can enhance the differences between the untampered and tampered regions, the detection accuracy of the network will be improved. In [8], the proposed method chose to used the SRM filter to strengthen the noise feature. The method has some effect, but it is useful only for RGB image tamper detection. In addition, when tampered and untampered areas come from cameras of the same brand, the SRM filter will not work well because of the consistent noise. In addition, the mechanism of residual propagation alone is not enough for the network to learn more differences in the essential properties of images. In order to further strengthen the differences, this paper introduces the residual feedback [9] block, which is an auto-learning mechanism, to further improve the performance of the network. In addition, a simple and useful attention mechanism is used for the residual feedback block to pay more attention to the input feature map. The mentioned mechanism uses the properties of the attention method to avoid the key information loss and redundancy of not-crucial information. The weights acquired from the input feature map are used to enlarge the differences in the essential properties of the images between the tampered and untampered regions. In the component block, the residual feedback method is shown as Figure 4, and its definition is presented in Equation (2).

y_{b} = (s (W (y_{f})) + 1) \times x,

(2)

where x and

y_{b}

represent the input of the component block and the enhanced input, respectively, and

y_{f}

is the output of the previous step of residual Equation (1). W represents a linear mapping to change the dimension of

y_{f}

, and s represents the sigmoid activation function.

Unlike the residual propagation, the residual feedback is more interesting in the difference between the untampered and tampered regions in the input feature map. It can suppress the weights of the untampered areas as well as amplifying the weights of the tampered areas. Based on this, our network can easily segment out of the tampered areas. The combination of residual propagation and residual feedback can not only make the network perform better but also shorten the training time of the network. In summary, residual feedback has some meaningful effects:

The enhancement of positive labeling features while suppressing negative labeling features makes the differences in the image’s essential properties between the tampered and untampered regions increasingly large.

3.3. Attention Gate

In the above subsections, we introduced two modules, which are residual propagation and residual feedback, to improve the performance of our network. However, residual propagation and residual feedback only work between the component blocks, and each corresponding decode and encode layer is subjected to simple summation, with a great deal of redundant information. If this redundant information is not eliminated, the difference between the tampered and untampered areas cannot be well represented. To address the mentioned issue, we introduce the attention gate block [14] between the encode layers and decode layers at the same level, and the details are shown in Figure 5.

Where inputs g and

x_{l}

are manipulated by two convolutional kernels

W_{g}

and

W_{x}

to obtain A and B, respectively. Then, A and B are added to get C through the ReLU operation and convolutional kernel

Ψ

operation to obtain E, and finally the attention coefficients (

α

) are obtained by sigmoid activation function and resampling. Let

α

and

x_{l}

be multiplied so that attention can be put on the tampered regions.

This mechanism is similar to the human eye. In our daily life, we pay more attention to the areas that interest us. In the forgery detection task, adding the attention mechanism can make our network more interested in tampered regions. In addition, this mechanism uses soft attention instead of hard attention. Soft attention is differentiable, and differentiable attention can be learned by neural networks that calculate gradients and learn the weights of attention by forward propagation and backward feedback. This is used to learn more important features. In addition, this mechanism is also useful for the network to identify small tampered regions. Because the attention gate does not focus on the whole image, but operates on small local tampered regions, in summary, the role of the attention gate mechanism is that in the input feature map after the operation of attention coefficients, the tampered region weights are amplified and the untampered regions weights become smaller so that the accuracy of the network can be improved.

3.4. Atrous Spatial Pyramid Pooling with CBAM

Atrous Spatial Pyramid Pooling (ASPP) was first proposed in DeepLab V2 [13] to acquire multi-scale information and to better segment objects at different scales. Moreover, CBAM pays more attention to identifying target regions. Inspired by these, we combine the ASPP module with the CBAM module into the CBASPP module. Its exact structure is shown in Figure 6.

Dilation convolution with dilation rates of 4, 8 and 12 and a convolution kernel size of 3 × 3 is used. The local features on the previous layer are associated with a wider field of view to prevent small target features from being lost during information transfer. As shown in the above, from top to bottom, the first branch is a convolution with the filter size of 1 × 1, which aims to maintain the original receptive field. The second to fourth branches are dilation convolutions with different dilation rates, which aim at feature extraction to obtain different receptive fields. The fifth branch is the global average pooling of the input to obtain global features. Finally, the feature maps of the five branches are stacked in the channel dimension, and the information at different scales is fused by convolution with filter of 1 × 1 to get the new feature map F. The feature map F (H × W × C) is then calculated by the Channel Attention Module (CAM) and Spatial Attention Module (SAM) to obtain the final feature map

M_{s}

. The detailed process is shown in Figure 6, and the formula of CAM is shown in Equation (3):

M_{c} (F) = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{m a x}^{c}))),

(3)

where

σ

represents the sigmoid function,

W_{0}

∈

R^{C / r \times C}

and

W_{1}

∈

R^{C \times C / r}

, and r is the reduction ratio. The weights of MLP are

W_{0}

and

W_{1}

, and they are shared for the inputs, and the ReLU activation function is followed by

W_{0}

. Moreover, the formula of SAM is shown in Equation (4):

M_{s} (F) = σ (f^{7 \times 7} ([F_{a v g}^{c}; F_{m a x}^{c}])),

(4)

where

σ

represents the sigmoid function, and f is a convolution operation with the kernel size of 7 × 7.

F_{a v g}^{s} \in R^{1 \times H \times W}

and

F_{m a x}^{s} \in R^{1 \times H \times W}

, and they are the 2D feature maps obtained by two pooling operations for

F_{c}

.

Since the scale and location of the tampered areas in each forgery image are different, this poses a problem for forgery detection. The above CBASPP module enlarges the receptive field and enriches feature information through multiple level sampling rates of dilated convolution parallel sampling, and the features of the image level can be efficient in capturing global contextual information. The module considers these relationships between contexts to avoid the problem of segmentation errors caused by getting caught in local features and to improve the accuracy of image forgery detection. Moreover, this module aggregates multiple scale contextual information and enhances the ability of the network to identify tampered regions of different sizes.

3.5. Efficient Channel Attention

ECA [18] is an efficient channel attention module that adopts a local cross-channel interaction strategy without dimensionality reduction, effectively avoiding the effect of dimensionality reduction on the learning effect of channel attention. In this paper, we apply the ECA module to the residual propagation that can enhance channel features to improve forgery detection network performance. The details are shown in Figure 7.

After the two convolution operations with the kernel size of 3 × 3, we then get aggregated features of 1 × 1 × C by global average pooling (GAP). The ECA module produces the channel weights by performing a fast one-dimensional convolution of size k, where k is determined adaptively by a mapping of the channel dimension C and

σ

represents the sigmoid activation function.

4. Experiments

In the above sections, we have shown the specific structure and design ideas of our method, and we analyzed the possibility of these modules from the theoretical perspective. In order to further demonstrate its performance, we have conducted various comprehensive experiments. The goal of our work is to detect whether the target image is tampered at the pixel level and locate the tampered regions. Below, we present the details of our experiment in several ways. We introduce the experiment specific setup, such as the experimental dataset, evaluation metrics, etc. in Section 4.1 and then present results compared with other previous methods in Section 4.2. Next, we perform an ablation study to demonstrate the effectiveness of each module in the network in Section 4.3. Then, we perform a robustness experiment on our network in Section 4.4. In the end, we show the visualization of the detection results predicted by various methods in Section 4.5.

4.1. Experimental Settings

Experimental Dataset:There are now many tampered datasets on the internet. After some thought, we chose CASIA [17] and Nist Nimble 2016 (NIST16) datasets [16] as the experimental dataset. CASIA contains two tampering types—copy-and-paste forgery and splicing forgery—and all images in this dataset are artificially produced, complex, real and not easily judged by human eyes. Using it as part of our dataset makes our work more practical and better demonstrates network performance. NIST16 is a dataset that includes copy-and-paste forgery, splicing forgery and removal forgery. The tampering of the dataset is post-processed to hide the visible trace. CASIA has 5610 images in sizes from 240 × 600 to 800 × 600. NIST16 has 564 images, and these ground-truth masks are available for evaluation. To guarantee the authenticity of the experiment, for the datasets, we randomly selected 10% as the test set, 10% as the validation set and 80% as the training set. We resized all images at a uniform size, which is 256 × 384. Meanwhile, we kept both TIFF and JPG images in the dataset so that we could make our network more applicable.

Experimental Metrics: In the experiment, appropriate evaluation metrics can reflect the performance of the model well. For image tampering detection, the crucial evaluation is the degree of accuracy in locating the forgery areas at the pixel level. In this experiment, we follow the evaluation metrics used in the previous related works [22], which are the F1 score and Area Under Curve (AUC) score. The F1 score is the combination of recall and precision to measure the performance of the network. The AUC is is the area under the ROC curve. ROC curves are generally used to evaluate the classification effect of a certain classifier. In tampering detection, we classify each pixel of the images into tampered and untampered. Thus, we also use AUC as an experimental metric.

Compared Methods: In order to test the performance of the proposed method and to ensure the authenticity of the experiment, we chose some previous excellent forgery detection approaches as compared methods, which are RGB-N [8], RRU-Net [9], MCNL-Net [10], PSCCNet [21] and ObjectFormer [22]. We describe them in detail below.

RGB-N adopts two streams, which are the RGB stream and noise stream, in parallel to detect the forgery features and noise inconsistency within an image, respectively.

RRU-Net is an end to end image segmentation network for image forgery detection. It uses residual propagation and residual feedback to enhance the capacity of feature extraction and detect tampering images without any pre-process and post-process.

The structure of MCNL-Net is similar to RRU-Net. It adds the BAM module and MaxBlurPool module on the basis of RRU-Net. In addition, it uses convolution kernels of different sizes for better feature extraction.

PSCCNet performs image tampering localization in a step-by-step manner from coarse to fine.

ObjectFormer models the visually inconsistent information at the object level for tampering detection with the advantage of a visual transformer.

Implementation Details: All images before entering the network training are resized 256 × 384. Our network and other compared detection methods were run on a server with an NVIDIA GeForce RTX 2080 Ti GPU. RRU-Net, MCNL-Net and CAU-Net were implemented by PyTorch 1.8.2. The details about the training process of our network are as follows. We utilized stochastic gradient descent (SGD) as optimization and used random weights as initial parameters. Moreover, the batch size was 8, the momentum was 0.9 and the weight decay was 0.0005. In the first 80 epochs, we set the learning rate as 0.01. Before 120 epochs, we reduced the learning rate to 0.005. After 120 epochs, we set the learning rate to 0.001.

4.2. Compared Detection Methods

To evaluate the practical effects of the method proposed in this paper, we chose SPAN, RGB-N, RRU-Net, MCNL-Net, PSCCNet and ObjectFormer as the compared methods in this experiment. In the above section, we presented the experimental metrics. In order to ensure the fairness of the experiments, the parameters of the above comparison experiments have been tuned to be optimal, and the best detection results of various methods are taken for comparison.

We report the F1 score and AUC (%) in Table 1, from which we can observe that our network performs best on the CASIA dataset with 58.3% F1 score and 88.4% AUC. On the NIST16 dataset, out method outperforms ObjectFormer by 11.9% in terms of the F1 score, but the AUC is slightly lower on the NIST16 dataset than PSCCNet and ObjectFormer. Overall, the network we proposed has a better performance compared with other related approaches.

4.3. Ablation Study

The attention gate module is used to remove redundant information between the corresponding encode layer and decode layer, and the CBASPP module makes it possible for the network to extract image information from multiple scales. In addition, the ECA module uses 1D-CNN to achieve cross-channel information interaction. In the above section, we analyzed the effectiveness of these three modules in terms of theoretical aspects. In order to evaluate the authenticity of the modules, the ablation study compared the experimental effects of the model with the attention gate removed, with the ASPP module and ECA module removed and with both removed, and we evaluated the forgery detection performance on CASIA.

The experimental results are listed in Table 2. It is easy to observe that without the attention gate, ASPP and ECA modules, the F1 fraction and AUC decreased by 13.1% and 8.6%, respectively, on the CASIA dataset. In contrast, for the network without the attention gate module, the F1 score and AUC decreased by 5.1% and 4.1%. Without the CBASPP module, the F1 score and AUC decreased by 6.8% and 7.1%. Without the ECA module, the F1 score and AUC decreased by 3.8% and 7.5%. The degradation of the detection results verifies that the use of these modules is effective in improving the performance of our network.

4.4. Robustness Evaluation

In the above, we compare our work with other experiments as well as performing an ablation study. To further evaluate the robustness of our network, we apply different image attack methods on original images from the NIST16 dataset. The attack methods include resize, JPEG compression with the quality factor

η

and Gaussian blur with the kernel size

κ

. The parameters and forgery detection performance (F1 score and AUC) are shown in Table 3. From Table 3, we can observe that our network is robust to the different image attack methods.

4.5. Visualization Results

We provide experimental results in Table 1. However, the results are not intuitive from the data alone, so we visualize the tampering detection results of various methods. Since the code of ObjectFormer [22] and PSCCNet [21] is not provided, their detection results are not available. In order to ensure authenticity, we randomly chose four sets of data from the test set on CASIA as the example of visualization results. We show the detection results of various methods in Figure 8.

In Figure 8, each line represents a different meaning, where the first line represents the forgery images, the second line is the real tampered regions, which is the ground truth image, the third line shows visualization results of the prediction of RRU-Net method, the fourth line represents the prediction results of MCNL-Net method, and finally the fifth line represents the prediction results of the method we proposed in this paper. From the following visualization of results, it is easy to see that the RRU-Net method has good performance for normal tampered areas and can locate the tampered areas roughly, but there are still some false detections and missed areas. Additionally, the method does not perform well on small tampered regions. Compared with the RRU-Net method, the MCNL-Net method has a better performance with fewer false detection and missed detection areas. However, we can observe that the method also does not show excellent performance for tiny tampered areas through its predicted visualization results. From a subjective perspective, the predicted visualization results demonstrate that our model is the best and the most stable among the three methods. It can not only locate the forgery areas more accurately but also develop more sharp boundaries.

From the experimental metrics and predicted visualization results, we can easily conclude that the network we proposed has a better performance than other compared methods.

5. Conclusions

In this paper, we proposed the circular U-Net with attention gate (CAU-Net), which is an end to end image forgery detection network. The Atrous Spatial Pyramid Pooling with CBAM extracts image information from multiple scales to better detect tampered areas of different sizes. The attention gate module can suppress the untampered regions, so our network pays more attention to the tampered regions. Besides, Efficient Channel Attention enhances channel features to improve forgery detection network performance. We demonstrate the effectiveness of our network through theoretical analysis and comparative experiments with experimental data and visualization results. Extensive experiments on CASIA and NIST16 public datasets verify that our network has better performance than state of the art methods.

Author Contributions

The conception of the study, C.L. and J.P.; literature search, Y.L. and X.G.; figures, X.G.; data collection, J.P. and X.G.; data analysis, C.L. and J.P.; data interpretation, Y.L.; writing, J.P.; review and editing C.L. and Y.L.; the experiment J.P. and X.G.; supervision, C.L. and Y.L.; funding acquisition C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2020YFB1712401), and Chinese Scholarship Council (No. 202007045007).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dhamo, H.; Farshad, A.; Laina, I. Semantic image manipulation using scene graphs. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Li, B.; Qi, X.; Lukasiewicz, T. Manigan: Text-guided image manipulation. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Park, T.; Zhu, J.-Y.; Wang, O. Swapping autoencoder for deep image manipulation. In Proceedings of the Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
Vinker, Y.; Horwitz, E.; Zabari, N. Deep single image manipulation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Ferrara, P.; Bianchi, T.; De Rosa, A. Image forgery localization via fine-grained analysis of CFA artifacts. IEEE Trans. Inf. Forensics Secur. 2012, 7, 1566–1577. [Google Scholar] [CrossRef] [Green Version]
Pan, X.; Lyu, S. Region Duplication Detection Using Image Feature Matching. IEEE Trans. Inf. Forensics Secur. 2010, 5, 857–867. [Google Scholar] [CrossRef]
Salloum, R.; Ren, Y.; Kuo, C.C.J. Image Splicing Localization Using a Multi-task Fully Convolutional Network (MFCN). J. Vis. Commun. Image Represent. 2017, 51, 201–209. [Google Scholar] [CrossRef] [Green Version]
Zhou, P.; Han, X.T.; Morariu, V.I. Learning rich features for image manipulation detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061. [Google Scholar]
Bi, X.; Wei, Y.; Xiao, B.; Li, W. The Ringed Residual U-Net for Image Splicing Forgery Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Wei, Y.; Wang, Z.; Xiao, B. Controlling Neural Learning Network with Multiple Scales for Image Splicing Forgery Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–22. [Google Scholar] [CrossRef]
Chen, X.; Dong, C.; Ji, J.; Cao, J. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Sun, Y.; Ni, R. ET: Edge-enhanced Transformer for Image Splicing Detection. IEEE Signal Process. Lett. 2022, 29, 1232–1236. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Oktay, O.; Schlemper, J.; Folgoc, L.L. Attention U-Net: Learning where to Look for the Pancreas. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Amsterdam, The Netherlands, 4–6 July 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Nist. Nimble 2016 Datasets. Available online: https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation (accessed on 5 February 2016).
Dong, J.; Wang, W.; Tan, T.N. CASIA image tampering detection evaluation database. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882. [Google Scholar] [CrossRef] [Green Version]
Hu, X.; Zhang, Z.; Jiang, Z. Span: Spatial pyramid attention network forimage manipulation localization. In Proceedings of the European Conference on Computer Vision, ECCV, 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [Google Scholar] [CrossRef]
Wang, J.; Wu, Z.; Chen, J. ObjectFormer for Image Manipulation Detection and Localization. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. The three images represent the host image, the forgery image and the ground-truth. (a) Host image. (b) Forgery image. (c) Ground-truth.

Figure 2. The overview of the proposed CAU-Net structure for the task of the location of tampered regions.

Figure 3. Residual propagation.

Figure 4. Residual feedback.

Figure 5. The architecture of the attention gate.

Figure 6. The architecture of Atrous Spatial Pyramid Pooling with CBAM.

Figure 7. The Residual propagation with Efficient Channel Attention module.

Figure 8. Visualization of the forgery detection results predicted by various methods. From top to bottom, we show the forgery image, the GT mask and predicted results of RRU-Net, MCNL-Net and CAU-Net.

Table 1. The results (%) of the proposed method and the other methods on the testing set.

Methods	CASIA		NIST16
Methods	F1 Score	AUC	F1 Score	AUC
SPAN [20]	38.2	83.8	58.2	96.1
RGB-N [8]	40.8	79.5	72.2	93.7
RRU-Net [9]	45.2	79.8	85.1	92.3
MCNL-Net [10]	52.4	81.9	90.6	96.7
PSCCNet [21]	55.4	87.5	81.9	99.6
ObjectFormer [22]	57.9	88.2	82.4	99.6
Ours	58.3	88.4	94.3	97.3

Bold text represents the best results.

Table 2. Experimental data of ablation study results on CASIA dataset.

Attention Gate	CBASPP	ECA	F1 Score	AUC
×	×	×	45.2	79.8
×	✔	✔	53.2	84.3
✔	×	✔	51.5	81.3
✔	✔	×	54.5	80.9
✔	✔	✔	58.3	88.4

Table 3. The forgery detection results (%) under various attacks on the NIST16 dataset. F1 score and AUC are reported.

Attack	F1 Score	AUC
no attack	94.3	97.3
Resize (0.8×)	90.8	94.5
Resize (0.6×)	86.7	92.4
GaussianBlur ( $κ$ = 3)	86.9	92.8
GaussianBlur ( $κ$ = 7)	80.7	87.6
JPEGCompress ( $η$ = 100)	92.8	96.4
JPEGCompress ( $η$ = 80)	91.6	95.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, J.; Li, Y.; Liu, C.; Gao, X. The Circular U-Net with Attention Gate for Image Splicing Forgery Detection. Electronics 2023, 12, 1451. https://doi.org/10.3390/electronics12061451

AMA Style

Peng J, Li Y, Liu C, Gao X. The Circular U-Net with Attention Gate for Image Splicing Forgery Detection. Electronics. 2023; 12(6):1451. https://doi.org/10.3390/electronics12061451

Chicago/Turabian Style

Peng, Jin, Yinghao Li, Chengming Liu, and Xiaomeng Gao. 2023. "The Circular U-Net with Attention Gate for Image Splicing Forgery Detection" Electronics 12, no. 6: 1451. https://doi.org/10.3390/electronics12061451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Circular U-Net with Attention Gate for Image Splicing Forgery Detection

Abstract

1. Introduction

2. Related Work

3. Network Architecture Overview

3.1. Residual Propagation

3.2. Residual Feedback

3.3. Attention Gate

3.4. Atrous Spatial Pyramid Pooling with CBAM

3.5. Efficient Channel Attention

4. Experiments

4.1. Experimental Settings

4.2. Compared Detection Methods

4.3. Ablation Study

4.4. Robustness Evaluation

4.5. Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI