Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection

Zhang, Weinan; Cui, Sanshuai; Zhang, Qi; Chen, Biwei; Zeng, Hui; Zhong, Qi

doi:10.3390/math13091372

Open AccessArticle

Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection

by

Weinan Zhang

¹,

Sanshuai Cui

^1,*

,

Qi Zhang

¹,

Biwei Chen

²

,

Hui Zeng

^3,* and

Qi Zhong

¹

Faculty of Data Science, City University of Macau, Macau SAR, China

²

Belt and Road School, Beijing Normal University at Zhuhai, Zhuhai 519088, China

³

School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang 621010, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1372; https://doi.org/10.3390/math13091372

Submission received: 3 March 2025 / Revised: 12 April 2025 / Accepted: 16 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Mathematics-Based Methods in Artificial Intelligence, Pattern Recognition and Deep Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, with the rapid advancement of deep learning technologies such as generative adversarial networks (GANs), deepfake technology has become increasingly sophisticated. As a result, the generated fake images are becoming more difficult to visually distinguish from real ones. Existing deepfake detection methods primarily rely on training models with specific datasets. However, these models often suffer from limited generalization when processing images of unknown origin or across domains, leading to a significant decrease in detection accuracy. To address this issue, this paper proposes a deepfake image-detection network based on feature aggregation and enhancement. The key innovation of the proposed method lies in the integration of two modules: the Feature Aggregation Module (FAM) and the Attention Enhancement Module (AEM). The FAM effectively aggregates both deep semantic information and shallow detail features through a multi-scale feature-fusion mechanism, overcoming the limitations of traditional methods that rely on a single-level feature. Meanwhile, the AEM enhances the network’s ability to capture subtle forgery traces by incorporating attention mechanisms and filtering techniques, significantly boosting the model’s efficiency in processing complex information. The experimental results demonstrate that the proposed method achieves significant improvements across all evaluation metrics. Specifically, on the StarGAN dataset, the model attained outstanding performance, with accuracy (Acc) and average precision (AP) both reaching 100%. In cross-dataset testing, the proposed method exhibited strong generalization ability, raising the overall average accuracy to 87.0% and average precision to 92.8%, representing improvements of 5.2% and 6.7%, respectively, compared to existing state-of-the-art methods. These results show that the proposed method can not only achieve optimal performance on data with the same distribution, but also demonstrate strong generalization ability in cross-domain detection tasks.

Keywords:

deepfake detection; gradient image; attention enhancement module

MSC:

68T07

1. Introduction

The widespread abuse of fake images presents considerable challenges in both political and economic spheres. In the political realm, Deepfake images and videos have been exploited to spread disinformation, manipulate public opinion, and even damage the reputations of public figures. This can lead to increased societal polarization and undermine trust in media and communication channels. In the economic domain, fake images have the potential to manipulate stock markets, tarnish corporate reputations, and erode consumer confidence. As these risks continue to grow, it becomes crucial to develop robust and reliable methods for detecting Deepfake content across a wide range of applications. To tackle this issue, Deepfake detection methods need to exhibit strong generalization capabilities, which means they must perform well not only on known datasets but also when encountering images from unknown or novel sources. This is a substantial challenge in the field because of the rapid evolution of Deepfake generation techniques. While significant advancements have been made in detecting face-based Deepfakes, general-purpose detection, which extends beyond face images to include a variety of categories and contexts, remains a formidable problem.

Several studies have focused on specific aspects of fake-image detection. A number of research efforts [1,2,3] have concentrated on detecting fake face images, leveraging the relatively structured nature of facial features and artifacts. These studies often employ techniques like analyzing local inconsistencies or unnatural facial movements, which are easier to detect in a well-defined domain such as human faces. However, generalizing beyond faces to other image categories, such as objects, scenery, or animals, introduces additional complexity, as the visual patterns in these categories are more varied and unpredictable. A number of studies [1,2,3] have focused on face images, while other research [4,5,6] has dealt with various categories of images. These methods mainly use local-area artifacts [7,8], mixed boundary [9], global texture [2], and frequency-level artifacts [3,4,5,10]. However, they rely heavily on the training setup, which results in the inability to detect images from unknown categories or GAN models. Test images in real-world scenarios often come from unknown sources [11], making it challenging to develop a universal detector. FakeInversion [12], after extracting features using VLM, resnet50 is used as the classifier. Whodunit [13] employs discrete cosine transform, power spectral density, and autocorrelation as features. Some previous works [5,6] have employed preprocessing, data augmentation, and techniques to reduce the effects of frequency-level artifacts in order to develop more robust detectors. Despite these advances, most of these techniques rely heavily on the training setup and the specific types of images used in model development. This reliance results in the inability to detect images from unknown categories or generated by GAN models, making the models less effective in real-world scenarios. Test images encountered in practical applications often come from sources or use techniques that differ from those seen during training, and this domain shift makes it challenging to develop a universal detector that works across all types of fake images. Existing GAN-based detection methods typically rely on surface-level inconsistencies, such as unnatural texture or color distribution, but they tend to perform poorly when visual artifacts are subtle or absent. Figure 1 illustrates the strategy devised to address this challenge, which serves as the primary motivation for our research.

In the Deepfake detection task, identifying effective features and employing a model that can efficiently leverage them are crucial. This paper introduces gradient images as the input and proposes a novel network based on Multi-type Stepwise Pooling and Attention Enhancement Module. The proposed approach is designed to improve the focus on relevant features, ultimately enhancing detection accuracy.

The contributions of this paper are as follows:

(1) Propose a ResNet50-based attention-filtering network designed to effectively focus on and learn more valuable features, enabling the efficient detection of forged images.

(2) Propose a Multi-type Stepwise Pooling (MSP), which enables the model to learn shallow and deep features simultaneously. This enables the network to capture both low-level details, which may reveal subtle artifacts, and high-level abstractions, aiding in the comprehension of the broader image context and enriching the overall feature representation.

(3) Propose the Attention Enhancement Module (AEM). Existing attention mechanisms often re-weight features without explicitly suppressing irrelevant information, leading to retained low-value responses that hinder focus. AEM addresses this by filtering out low-weight attention values, enabling the model to better attend to semantically meaningful regions.

The experimental results show that the proposed method has achieved excellent detection performance on multiple datasets, especially showing strong generalization ability in cross-dataset testing. This shows that our method can effectively alleviate the dependence of existing detection methods on the distribution of training data.

We introduce a unified detection framework combining a ResNet50-based attention-filtering network, MSP for multi-level feature learning, and AEM for refined attention, enabling robust detection of both GAN and diffusion-generated forgeries.

2. Related Work

This section discusses methods related to generative image detection and the development of GAN models. Earlier generative image-detection methods have sought to leverage spatial information or frequency artifacts as representations of the cues generated by the generative model for detecting fake images. The development of GAN models has further advanced the capabilities of generative image generation.

2.1. Spatial-Based Fake-Image Detection

In the starting phase of deep forgery detection, researchers utilized the color space [3,14,15] and natural texture features [2,16] of images. Rossler et al. [17] proposed FaceForensics++, a face dataset, and employed Xception as a backbone network to recognize fake images. Li et al. [18] leveraged physiological blinking patterns to identify synthetic face manipulations in videos. ForensicTransfer [19] introduced a encoder designed to extract discriminative features. Yu et al. [8] and Marra et al. [20] extracted unique features from the outputs of individual generative models to improve detection accuracy. Bayar [21] designed a CNN architecture to suppress native image content, thereby amplifying manipulation-specific patterns and improving generalization across diverse forgery types. Gram-Net [2] exposed inconsistencies in the global statistical properties of synthetic facial images. Face X-ray [9] used hybrid boundary techniques to detect fake images. Chai et al. [7] proposed a local artifact amplification framework, where the restricted receptive field captured tampering-induced statistical regularities, thereby prioritizing region-specific forgery traces. Yu et al. [3] extracted intrinsic cues from the camera’s imaging process, such as channel and spectral inconsistencies. Wang et al. [6] trained the classifier on a large number of images, and its robustness is improved through data augmentation strategies. PCL [22] used source-specific features for detection. He et al. [1] designed a decomposition and reconstruction verification process that robustly detects facial manipulation traces.

2.2. Frequency-Based Fake Image Detection

Since GAN relies on upsampling, the resulting images often contain frequency-level artifacts that differ from real images [23,24,25]. Several studies [13,20,21,23,26,27,28] utilized the spectral distribution of images to capture artifacts, which were then used for detecting fake images. Masi et al. [27] proposed a two-stream architecture is designed to fuse chromatic consistency features with spectral residual maps, which can accurately localize facial manipulation traces. F3 Net [28] employed a dual-path architecture implemented to reveal manipulation footprints through the collaborative analysis of decomposed frequency features and local probabilities, thereby detecting both subtle and advanced forgery traces. BiHPF [4] combined a frequency-level high-pass filter (HPF) that amplified artifact magnitudes in high-frequency components with a pixel-level HPF that emphasized background pixel values, jointly enhancing artifact visibility. FrePGAN [5] observed that frequency artifacts in generated images are very sensitive to the distribution of the generator and object categories, which impairs the generalization of the model. FrePGAN destroys artifact correlation by injecting controllable spectral perturbations during training to enhance the generalization of the model.

2.3. Development of GAN

In 2014, Mirza et al. proposed CGAN [29], which introduced additional conditions to control the generated image, ensuring that only images satisfying the conditions passed through the discriminator and were output, as opposed to random images. In 2015, Alec Radford et al. introduced DCGAN [30], which combined CNNs with unsupervised GANs and conducted a qualitative analysis of the GAN classifier. In 2016, Mirza and Denton et al. proposed SGAN [31], which aimed to complete a classification task while generating a learning model. Brock et al. introduced BigGAN [32], which addressed the challenge of generating high-resolution and diverse samples. Since 2017, NVIDIA [33,34,35] has pioneered groundbreaking innovations in GAN research, evolving from style transfer solutions for uncontrolled image synthesis to tackling intricate physical phenomena like liquid surface distortions. In 2023, NVIDIA proposed StyleGAN-T [36], a model capable of generating 512 × 512 resolution images in 0.1 s, with a training cost much lower than the diffusion models released that same year. NVIDIA has made great contributions to the research of GAN. Minguk Kang et al. proposed GigaGAN [37], a model that generates 512-resolution images in 0.13 s, surpassing the diffusion models of that year. Parallel developments include 3D-aware architectures [38,39] and efficiency-oriented GAN variants [40,41] that outperform diffusion models in inference speed. R3GAN [42] establishes a modern framework that addresses the inherent training instability of GANs while demonstrating superior performance over diffusion models across key benchmarks.

3. Methodology

Our work is dedicated to efficiently extracting gradient information to distinguish real and fake images. The key to the task of fake image detection is to develop a generalized representation of artifacts. Our work is dedicated to effectively extracting gradient information which was proposed by Tan et al. [43]. Tan et al. [43] showed that gradient images have good performance in fake-image-detection tasks. Inspired by this work, this paper uses gradient images as input features to improve the sensitivity of the classifier to small artifacts in fake images. We introduce a gradient-focused network to extract gradient information.

3.1. Overview of the Proposed Network

The proposed method is based on the ResNet50 architecture, as shown in the Figure 2. Following Tan et al. [43], we adopt ResNet50 for gradient feature extraction, as their work demonstrated its effectiveness in capturing GAN-specific anomalies across diverse datasets. In this network, the gradient image is applied to the input, and all images are transformed into gradient images. The image is transformed into a gradient image through a transformation model, and the transformation model is defined as

M

. The dataset is defined as

I = {\{(I_{i}, y_{i})\}}_{i = 0}^{N - 1}

;

y_{i} = 0, 1

is the label of the dataset, the real image is set to 0, and the fake image is set to 1. For the gradient computation, we first pass the input images

I_{i}

through the model

M

, obtaining an output tensor

l \in R^{n \times c}

. We then compute the gradient:

G = \frac{\partial sum (l)}{\partial I_{i}},

(1)

where G rests on M. Since this work employs a single instantiation of M, G remains invariant throughout the process. G serves as the input to our methodology.

The operation process of the Feature Aggregation Module (FAM) is as follows: The outputs of Block1 and Block2 are aggregated with the output of Block4 through a shortcut connection layer after being processed by MSP1 and MSP2 to form a comprehensive feature representation. The aggregation result is used as the input of the next layer, thereby enhancing the network’s ability to process multi-scale features, which can effectively integrate feature information from different scales and enhance the network’s expression ability and learning efficiency. Let the outputs from ResNet blocks be denoted as

F_{1}

,

F_{2}

, and

F_{4}

from Layer1, Layer2, and Layer4, respectively.

F_{1}

and

F_{2}

are resized to the spatial size of

F_{4}

using MSP, producing

{\tilde{F}}_{1}

and

{\tilde{F}}_{2}

. Each feature map is projected to the same number of channels C via

1 \times 1

convolution. Then, the aggregated feature is computed as:

F_{a g g} = Concat ([{\tilde{F}}_{1}, {\tilde{F}}_{2}, F_{4}])

This fused tensor is passed to AEM for attention enhancement, forming the final feature

F_{f i n a l}

.

In terms of the model’s attention mechanism, we proposed an innovative Attention Enhancement Module (AEM). The main function of this module is to better activate the spatial attention mechanism so that the network can process richer information in multiple dimensions. Specifically, the AEM can weight features in space by enhancing the model’s attention to important features, thereby strengthening the influence of valuable information and suppressing the influence of low-value information. In this way, the network can capture key features more accurately, improve discrimination, and significantly improve performance.

In our experiments, the AEM enhances feature expressiveness by combining the spatial attention and attention filtering (AF) modules, allowing the network to perform more effectively across a broader range of application scenarios. This aggregated tensor is processed by the Attention Enhancement Module (AEM), denoted as

A (\cdot)

, which outputs refined features:

F_{A E M} = A (F_{a g g})

(2)

By integrating multi-level feature aggregation with attention mechanisms, our network architecture demonstrates significant advantages in image forgery-detection tasks.

3.2. Multi-Type Stepwise Pooling (MSP)

We construct a hybrid pooling layer by alternating average pooling and max pooling to optimize feature map processing and enhance the model’s overall performance.

At the output of Layer1, a combination of average pooling, max pooling, and average pooling is employed to adjust the output size to match that of Layer4. Average pooling computes the average value within a region, reducing noise and preserving overall information, while max pooling emphasizes the most salient feature points. By alternating these pooling operations, we overcome the limitations of using a single pooling layer and ensure consistent output sizes. A similar strategy is applied to the output of Layer2, alternating between average pooling and max pooling to ensure consistency in output size with Layer4. Finally, the processed outputs from Layer1 and Layer2 are aggregated with the output from Layer4. Through this shortcut connection, shallow details and deep abstract information are integrated, fully utilizing information from each layer, enhancing the model’s ability to capture complex patterns, and improving robustness in challenging tasks. The Multi-type Stepwise Pooling (MSP) consists of alternating max and average pooling operations. The time complexity of the Multi-type Stepwise Pooling (MSP) operation alternates between max pooling and average pooling. Let H, W, and C denote the height, width, and number of channels of the input feature map, respectively. Each pooling operation has a complexity of

O (H \times W \times C)

. Given N pooling operations, the overall time complexity of MSP is:

O (N \times H \times W \times C)

Although each pooling step reduces the feature map size, the computational overhead remains manageable due to the local nature of pooling operations. Our experiments show that MSP has minimal impact on inference speed, maintaining a balance between performance improvement and computational efficiency.

In summary, we introduce a hybrid pooling layer that alternates average pooling and max pooling, which can optimize feature map processing. Additionally, we promote the fusion of shallow features and deep features through skip connections. This design enhances the model’s ability to fuse multi-layer features, ultimately improving the accuracy and reliability of the task.

3.3. Attention Enhancement Module (AEM)

The AEM contains a spatial attention mechanism and an algorithm to optimize attention. This mechanism allows the model to focus on information-rich, meaningful, and relevant areas, thereby highlighting valuable features. Usually, the attention mechanism cannot completely avoid retaining some worthless information. In the attention weight matrix

W

, the value of the weight reflects the contribution of the corresponding feature to the task. Lower weights usually mean that the corresponding feature is less relevant to the task, carries less information value, and may even have a negative impact on the task. Therefore, we propose a filtering algorithm (as shown in Algorithm 1) to automatically remove the low-value information and reduce the interference of worthless information.

Algorithm 1 Attention Filtering

1:: Input: Weight matrix $W \in R^{m \times n}$ ; filter factor $0 < p < 1$
2:: Output: Filtered weight matrix $W^{'}$
3:
4:: $W^{'} \leftarrow W$
5:: Step 1: Flatten $W^{'}$ into vector V and sort V:
6:: $V \leftarrow [V_{1}, V_{2}, \dots, V_{m \times n - 1}]$
7:: where $V_{1} \leq V_{2} \leq \dots \leq V_{m \times n - 1}$
8:: Step 2: Compute the number of elements to filter:
9:: $k \leftarrow ⌊ (m \times n) \times p ⌋$
10:: Step 3: Retrieve the k-th smallest element $V_{k}$ from V
11:: For each element w in $W^{'}$ :
12:: If $w < V_{k}$
13:: w = 0
14:: End if
15:: End for
16:
17:: Return: $W^{'}$

Specifically, first copy

W

to

W^{'}

, and set the filtering factor

p

, which represents the percentage of the low-weight part to be filtered in the original matrix. Next, the attention weight matrix

W

is flattened into a one-dimensional vector

V

, and then

V

is sorted from small to large to find the

M \times N \times p

th element, recorded as

V_{k}

, where

M \times N

is the total number of elements in the matrix

W

. Next, all elements lower than

V_{k}

are determined in the matrix

W^{'}

, and these weights are set to 0 to remove invalid features. Through this process, we obtain a new weight matrix

W^{'}

, which only retains high-value information. The filtering operation in AEM involves sorting the spatial attention weights to determine a percentile threshold. This has a computational complexity of

O (n log n)

, where n is the number of spatial positions. Given the small size of attention maps (e.g.,

14 \times 14

), this cost is negligible in practice and does not impact inference efficiency.

After the calculation in Algorithm 1, the histogram of attention weights is shown in Figure 3b. As observed, all weights below 0.5 are set to 0. As mentioned earlier, low values in the attention matrix typically correspond to features with limited relevance to the task. By setting these values to 0, the model effectively ignores or bypasses this low-value information during learning, thereby achieving an explicit filtering effect. This not only reduces noise but also guides the model to focus more on semantically meaningful and task-relevant regions. Our method can adjust the amount of retained data according to task requirements. AEM effectively enhances the influence of valuable information in the task, while reducing the impact of irrelevant data, and improves the overall performance of the model.

4. Experimental Results

This section will describe the experimental setup and verify the effectiveness and robustness of our method through comparative experiments, heatmaps, and ablation experiments.

4.1. Experimental Setup

Dataset. We adopt the dataset provided by Tan et al. [43]. The generated images contain content created by eight distinct generative models. The models involved include StyleGAN [34] and StyleGAN2 [35] from the StyleGAN series; BigGAN [32] based on large-scale training; unpaired style transfer models such as CycleGAN [44] and StarGAN [45]; progressively growing model ProGAN [33]; facial manipulation technology Deepfake [17]; and the semantic image synthesis model GauGAN [46]. The real images are curated from LSUN [47], ImageNet [48], CelebA [49], CelebA-HQ [33], COCO [50], and FaceForensics++ [17,46]. The test set consists of 62,003 images, covering all categories generated by the aforementioned GAN models and real images. The training set also contains both real and fake images, but only includes four categories: horses, chairs, cats, and houses. Additionally, the fake images are solely generated by ProGAN. All of these datasets have been processed by gradient transformation.

Implementation Details. Our modified network is Resnet50 pre-trained by ImageNet. Based on both Tan et al. [43] and our empirical observations, we adopt the Adam optimizer with a learning rate of

5 \times 10^{- 4}

. After every ten epochs, the learning rate decreases by 20%, until completing the full 100 epochs. A batch size of 16 was adopted for this model. Random crops are applied to the training images, and the full images are used for testing. During training, input images are resized to

224 \times 224

.

Evaluation. Based on the baseline, we used the average precision score (AP) and the accuracy (Acc) as evaluation metrics to evaluate the proposed method.

4.2. Detection Performance

To demonstrate the state-of-the-art performance of our approach, we adopt the StyleGAN discriminator trained on the LSUN-Bedroom dataset as our transformation model. This choice is motivated by its strong feature-representation capability and widespread use in previous works. In particular, this setup aligns with the best-performing configuration reported in Tan et al. [43], allowing for a fair and direct comparison. Similarly, we train the classifier using three different training settings: one-class (horse), two-class (chair, horse), and four-class (car, cat, chair, horse) settings. We find that the proposed method successfully outperforms its peers in terms of Acc and AP. Our detector shows better performance than other methods (except StyleGAN2, BigGAN, and Deepfake). The results are shown in Table 1.

In particular, our method achieves best performance in the four-class settings inside. Compared to LGgrad, the Acc of the StyleGAN, BigGAN, CycleGAN, and GauGAN models improved to 96.0%, 84.4%, 88.5%, and 74.4%, respectively. We obtain comparable results on ProGAN, StarGAN and GauGAN. It is worth noting the poor performance on the Deepfake, resulting in an Acc of only 59. 5%. It can also be observed that the AP remains relatively high, achieving the second-best performance, which suggests that the model is still reasonably accurate in classifying real images. However, it must be acknowledged that the classification performance on fake images is poor, which contributes to the low Acc.

To further evaluate the performance of our model across diverse generative architectures, we conduct a stratified analysis based on GAN type. Since the detector is trained solely on ProGAN-generated images, it naturally achieves strong performance on ProGAN and similar models. The results on StarGAN remain comparable. While on StyleGAN and StyleGAN2, performance shows a moderate decline but still maintains high accuracy. When tested on structurally distinct models such as BigGAN, CycleGAN, GauGAN, and Deepfake, a performance drop is observed due to distribution shifts. However, the magnitude of this decline is significantly smaller than that of recent baselines such as LGrad, BiHPF, and FrePGAN. This indicates that our approach enhances cross-architecture robustness and mitigates overfitting to architecture-specific artifacts.

The method consistently outperforms in the four-class setting, while Tan’s method excels in the one-class setting. This demonstrates that the proposed approach can effectively handle a broader range of information, thereby enhancing performance. In contrast, as LGgrad utilizes more data, its performance tends to degrade. While training on a single image category is effective for detecting all categories, incorporating two or three categories may yield better results. Nevertheless, it remains crucial to maintain the strategy of using a smaller set of categories for training while ensuring robust detection across most categories. Compared to existing methods, our model shows superior accuracy and average precision across multiple GAN types. While FrePGAN [5] and BiHPF [4] rely on frequency-based cues that may not generalize well, our method exploits gradient representation, which is more robust to domain shift. The design of combining FAM with AEM enables our network to combine global structure and local cues more effectively than LGgrad [43] or [10] F3Net. Based on these data, it can be proved that this method can identify fake images of unknown sources in real complex scenes.

The results of this study not only demonstrate strong performance on benchmark datasets, but also indicate practical value in real-world scenarios. As GAN-generated images become increasingly realistic, distinguishing authenticity with the naked eye becomes difficult, threatening the credibility of content in social media, journalism, forensics, and medical imaging. Our method shows robust detection across various generative models, contributing to digital trust and mitigating the societal risks posed by synthetic imagery.

4.3. Visualization

In this section, we choose CAM as a visualization tool to more intuitively highlight the key areas that the solution focuses on during inspection.All color images are from Wang [6]’s dataset.

4.3.1. Fake-Image CAM Visualization

In Figure 4 and Figure 5, the visualization results of class activation maps (CAM) generated by our method and LGgrad are shown. By comparing the heatmaps, it is evident that our method is similar to LGgrad, with most areas of interest concentrated on the edges of the fake images. The key difference is that our method filters out more relevant regions from those highlighted by LGgrad, while also emphasizing areas that LGgrad does not address. For example, in the StyleGAN image, the proposed method and LGgrad both pay attention to the area with the lower-right corner. The difference is that proposed method pays attention to the area in the lower-right corner, which is slightly smaller than LGgrad, and the weight distribution is different. The proposed method can pay attention to the upper-right corner and the upper-middle area. LGgrad pays attention to these areas as well, but it is considered not very important. We speculate that this reason is that our method aggregates shallow and deep features. Shallow features can help our model capture more detailed local information, so our model can comprehensively and accurately pay attention to important areas.

It is also observed that in Deepfake images, the proposed method continues to focus on areas other than the face. This is likely one of the key reasons for the relatively poor performance in Deepfake detection. Since Deepfake is a forgery technique that primarily affects the face region, so relying on background information to detect Deepfake-generated images proves ineffective. Addressing this limitation will be the focus of future work.

4.3.2. Real-Image CAM Visualization

For real images, our method is more effective in focusing on important areas. We can observe that in Figure 6 our heatmaps exhibit a clearer and more concentrated focus, while LGgrad basically covers most of the detection category, but is not concentrated. Our method focuses on key parts, such as the eyes and nose in the face image, which are critical regions, while LGgrad spreads its attention across most of the face without emphasizing the crucial areas. This ability to pinpoint essential regions significantly helps improve the accuracy of forgery detection. Additionally, in the cat image, our method also focuses on the background. We believe this is not an incorrect decision, as our goal is to detect the image’s authenticity. The background, blurred due to the camera’s focus, represents information that might not be accurately reproduced in a forged image. Therefore, by using this background area as supplementary information, the model can more comprehensively assess the image’s authenticity and improve the generalization ability across different images.

The proposed method demonstrates a consistent attention distribution across both real and fake images, remaining uniform across different image categories. Whether processing images generated by various models or handling data from different categories, the model consistently follows the same attention pattern. This indicates that the model can effectively manage the diversity of fake images, even in complex real-world scenarios. The model’s performance remains unaffected by the input generation model or the category of inputs. Moreover, this consistency suggests that the model has developed a generalized feature-extraction strategy, enabling it to perform effectively across any image type without needing additional training for specific categories. This strategy enhances the robustness and generalization of the model in the task of detecting fake images. The similarity of this specific distribution also simplifies the complex image-processing tasks, allowing the model to efficiently learn the distinctions between real and fake images. As a result, the model can quickly and accurately determine image authenticity, further improving detection speed and precision. Since the fake samples in our study are fully generated and the real samples are completely real, there are no fixed or predefined fake regions in the images. Therefore, we believe that any attention regions identified by the model are valid indicators of synthetic content. Compared with previous methods, our model exhibits a unique attention distribution. The Attention Enhancement Module (AEM) enables the network to capture more representative and discriminative features, usually focusing on structural inconsistencies or high-frequency artifacts. This targeted attention mechanism leads to better generalization and higher detection performance even in the absence of explicit artifact regions annotated.

4.4. Ablation Studies

4.4.1. Effectiveness of Different Components

In this section, we perform ablation studies on different proposed components to assess their effectiveness. The experimental setups in this part are tested with four categories. The results are shown in Table 2.

It is evident that when we use only the AEM, as in Method C, there is a significant improvement compared to Method A. There are increases of 0.6%, and the AP improves by 1.7%. This demonstrates that the AEM module enhances the module’s performance. However, it is important to note that when we only use the FAM, as in Method A, the Acc performance is worse than when no improvements are applied. This is because FAM enables the network to learn not only the features of the deep layers, but also the features of the shallow layers, significantly expanding the feature space. However, this approach may result in an increase in redundant information, which could cause the model to overfit to these additional details and potentially affect Acc performance. However, AEM solves this problem. AEM extracts valuable information from the FAM module that is beneficial for our task and discards information that has a negative impact on our task. At the same time, the FAM module contains a huge amount of information, so when all modules work together, our model will eventually learn more valuable features. We can see that Method D has the best performance when all modules work together.

For the comparison between Method C and Method D, although the AP of Method C is higher, it does not necessarily indicate better performance. The higher AP of Method C is mainly due to its ability to correctly classify real images as real. However, the lower Acc of Method C reveals that the model tends to misclassify fake images as real, which increases the number of real-image classifications while also including fake images in the real category. As a result, Method C is less effective in detecting fake images compared to Method D. In contrast, Method D demonstrates a more accurate classification of both real and fake images, offering a more robust overall performance and stronger generalization capability.

4.4.2. Effectiveness of AF

As shown in Figure 7, the proposed AF mechanism significantly improves the performance index of the model. Specifically, the model’s Acc improved from 86.3% to 87.0%, which represents a 0.7% improvement in accuracy. In addition, the average AP is also improved by 0.4%. These data not only demonstrate the effectiveness of our proposed attention-filtering mechanism, but also demonstrate its potential in optimizing model performance.

The AF can significantly improve the model’s ability to extract key information by accurately focusing on the most important parts of the input data. This mechanism is especially important when dealing with complex tasks, because it helps reduce information redundancy and improve the model’s sensitivity to key features, allowing the model to make more accurate judgments and predictions in practical applications. The improvement in accuracy means that the model performs more reliably in the overall fake-image-detection task, while the improvement in average accuracy suggests that the model is more effective at classifying real images.

Overall, this result shows that the AF not only optimizes the performance of the model, but also improves its practicality and reliability in practical applications. This further validates the important role of this mechanism in improving model performance.

4.4.3. Impact of p

We conducted experiments with different values of p, keeping all other experimental settings consistent with the previous ones. The results are presented in Table 3. By comparing these outcomes, we found that the model achieved optimal performance when the p-value was set to 0.6.

A detailed analysis reveals that when 80% of the data are filtered out, the model’s performance reaches its lowest point. The loss of the majority of critical information leaves only a small portion of relevant data, making it difficult for the model to learn meaningful patterns. As a result, AP drops sharply. When the filtering ratio is set to 40%, the model retains a greater proportion of the original data; however, the improvement in performance remains limited. This suggests that while a 40% filtering ratio reduces the extent of information loss, it does not achieve optimal performance. One possible explanation is that the amount of filtered data is still insufficient, leaving a significant amount of low-value information in the dataset. This retained low-value information likely restricts the potential for further performance improvements. Through experimentation, it was found that filtering 60% of the information offers the best balance. This filtering ratio enables the model to retain sufficient original data while effectively removing redundant information, thus enhancing performance. The 60% filtering ratio strikes a balance, preventing excessive information loss while preserving key features, which leads to significant improvements in both Acc and AP. This result demonstrates that filtering 60% of the data not only meets the model’s informational requirements but also yields considerable performance gains.

In summary, our experimental results clearly show that filtering 60% of the information is the optimal choice. This filtering ratio not only effectively retains necessary information, but also optimizes the performance of the model to achieve the best results.

5. Conclusions

This paper proposes a novel attention-filtering network for deep image recognition. Specifically, the pre-trained ResNet50 is modified by integrating deep and shallow features and optimizing the attention mechanism. This approach enhances information extraction and improves feature utilization efficiency. The experimental results demonstrate that proposed method significantly boosts performance. Additionally, the effectiveness of each component is validated, with the AEM component contributing to enhanced network performance. However, the model performs suboptimally on Deepfake images.

Future work will address this limitation by incorporating Deepfake-specific submodules and more diverse data augmentation to enhance generalization. We also plan to explore alternative generative architectures, including hybrid diffusion and traditional GAN models, which produce features in novel regions of the feature space—posing new challenges for detectors trained on conventional distributions.

Author Contributions

Conceptualization, S.C.; Methodology, W.Z.; Software, W.Z.; Validation, W.Z.; Formal analysis, S.C.; Investigation, S.C., B.C., H.Z. and Q.Z. (Qi Zhong); Resources, S.C.; Data curation, W.Z.; Writing—original draft, W.Z.; Writing—review & editing, S.C., Q.Z. (Qi Zhang), B.C., H.Z. and Q.Z. (Qi Zhong); Visualization, W.Z.; Supervision, S.C. and Q.Z. (Qi Zhang); Project administration, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Start-up Scientific Research Project for Introducing Talents of Beijing Normal University at Zhuhai under grant number 312200502504, and by the Opening Project of the Guangdong Province Key Laboratory of Information Security Technology under grant number 2023B1212060026.

Data Availability Statement

The original data presented in the study are openly available in chuangchuangtan/LGrad at https://github.com/chuangchuangtan/LGrad, accessed on 15 April 2025.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, Y.; Yu, N.; Keuper, M.; Fritz, M. Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Montreal, QC, Canada, 19–27 August 2021; Zhou, Z.H., Ed.; International Joint Conferences on Artificial Intelligence Organization: San Francisco, CA, USA, 2021; pp. 2534–2541. [Google Scholar] [CrossRef]
Liu, Z.; Qi, X.; Torr, P.H. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8060–8069. [Google Scholar]
Yu, Y.; Ni, R.; Zhao, Y. Mining generalized features for detecting ai-manipulated fake faces. arXiv 2020, arXiv:2010.14129. [Google Scholar]
Jeong, Y.; Kim, D.; Min, S.; Joe, S.; Gwon, Y.; Choi, J. Bihpf: Bilateral high-pass filters for robust deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 48–57. [Google Scholar]
Jeong, Y.; Kim, D.; Ro, Y.; Choi, J. Frepgan: Robust deepfake detection using frequency-level perturbations. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1060–1068. [Google Scholar]
Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8695–8704. [Google Scholar]
Chai, L.; Bau, D.; Lim, S.N.; Isola, P. What makes fake images detectable? Understanding properties that generalize. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 103–120. [Google Scholar]
Yu, N.; Davis, L.S.; Fritz, M. Attributing fake images to gans: Learning and analyzing gan fingerprints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7556–7566. [Google Scholar]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-Ray for More General Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
Hulzebosch, N.; Ibrahimi, S.; Worring, M. Detecting CNN-generated facial images in real-world scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 642–643. [Google Scholar]
Cazenavette, G.; Sud, A.; Leung, T.; Usman, B. FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion. arXiv 2024, arXiv:2406.08603. [Google Scholar]
Wißmann, A.; Zeiler, S.; Nickel, R.M.; Kolossa, D. Whodunit: Detection and Attribution of Synthetic Images by Leveraging Model-specific Fingerprints. In Proceedings of the 3rd ACM International Workshop on Multimedia AI against Disinformation, Phuket, Thailand, 10–14 June 2024; MAD ’24. pp. 65–72. [Google Scholar] [CrossRef]
Uhlenbrock, L.; Cozzolino, D.; Moussa, D.; Verdoliva, L.; Riess, C. Did You Note My Palette? Unveiling Synthetic Images Through Color Statistics. In Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Baiona, Spain, 24–26 June 2024; IH&MMSec ’24; pp. 47–52. [Google Scholar] [CrossRef]
Tan, C.; Liu, H.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Rethinking the Up-Sampling Operations in CNN-Based Generative Network for Generalizable Deepfake Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 28130–28139. [Google Scholar] [CrossRef]
He, Z.; Chen, P.Y.; Ho, T.Y. RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection. arXiv 2024, arXiv:2405.20112. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Li, Y.; Chang, M.C.; Lyu, S. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
Cozzolino, D.; Thies, J.; Rössler, A.; Riess, C.; Nießner, M.; Verdoliva, L. Forensictransfer: Weakly-supervised domain adaptation for forgery detection. arXiv 2018, arXiv:1812.02510. [Google Scholar]
Marra, F.; Gragnaniello, D.; Verdoliva, L.; Poggi, G. Do gans leave artificial fingerprints? In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 506–511. [Google Scholar]
Bayar, B.; Stamm, M.C. A deep learning approach to universal image manipulation detection using a new convolutional layer. In Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, Vigo, Spain, 20–22 June 2016; pp. 5–10. [Google Scholar]
Zhao, T.; Xu, X.; Xu, M.; Ding, H.; Xiong, Y.; Xia, W. Learning self-consistency for deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15023–15033. [Google Scholar]
Durall, R.; Keuper, M.; Keuper, J. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7890–7899. [Google Scholar]
Bergmann, S.; Moussa, D.; Brand, F.; Kaup, A.; Riess, C. Forensic analysis of AI-compression traces in spatial and frequency domain. Pattern Recognit. Lett. 2024, 180, 41–47. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, X. Diffusion Noise Feature: Accurate and Fast Generated Image Detection. arXiv 2023, arXiv:2312.02625. [Google Scholar]
Bappy, J.H.; Simons, C.; Nataraj, L.; Manjunath, B.; Roy-Chowdhury, A.K. Hybrid lstm and encoder–decoder architecture for detection of image forgeries. IEEE Trans. Image Process. 2019, 28, 3286–3300. [Google Scholar] [CrossRef] [PubMed]
Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAlmageed, W. Two-branch recurrent network for isolating deepfakes in videos. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 667–684. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–103. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Radford, A. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Odena, A. Semi-supervised learning with generative adversarial networks. arXiv 2016, arXiv:1606.01583. [Google Scholar]
Brock, A. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Karras, T. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4401–4410. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 2020; pp. 8110–8119. [Google Scholar]
Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; Aila, T. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 30105–30118. [Google Scholar]
Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10124–10134. [Google Scholar]
Trevithick, A.; Chan, M.; Takikawa, T.; Iqbal, U.; Mello, S.D.; Chandraker, M.K.; Ramamoorthi, R.; Nagano, K. What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–24 June 2024; pp. 22765–22775. [Google Scholar]
Lei, B.; Yu, K.; Feng, M.; Cui, M.; Xie, X. DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 10487–10497. [Google Scholar] [CrossRef]
Pan, X.; Tewari, A.; Leimkühler, T.; Liu, L.; Meka, A.; Theobalt, C. Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023. SIGGRAPH ’23. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A. IC-GAN: An Improved Conditional Generative Adversarial Network for RGB-to-IR image translation with applications to forest fire monitoring. Expert Syst. Appl. 2024, 238, 121962. [Google Scholar] [CrossRef]
Huang, N.; Gokaslan, A.; Kuleshov, V.; Tompkin, J. The GAN is dead; long live the GAN! A Modern GAN Baseline. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Wei, Y. Learning on gradients: Generalized artifacts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12105–12114. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2337–2346. [Google Scholar]
Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv 2015, arXiv:1506.03365. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chil, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]

Figure 1. Deepfake detection Using gradient images. The gradient images are generated using a conversion model, which is trained on both real images and images generated by GANs. After training, the classifier is able to infer whether a given gradient image is real or fake.

Figure 2. An illustration of the proposed classifier. The gradient image is input into ResNet50, and the shallow and deep features are aggregated using concatenation. The aggregated features are then processed by the AEM for further enhancement, and the final representation is used for classification.

Figure 3. (a) Histogram of the Original Attention Weight. (b) Histogram of the Optimized Attention Weight. They show the original attention matrix and the attention matrix processed by the AF algorithm.

Figure 4. Fake-image CAM visualization of our method and LGgrad-1.

Figure 5. Fake-image CAM visualization of our method and LGgrad-2.

Figure 6. Real-image CAM visualization of our method and LGgrad.

Figure 7. Effectiveness of AF.

Table 1. Classification across models and categories.

Methods	Settings	Test Models
	Class	ProGAN		StyleGAN		Stylegan2		BigGAN		CycleGAN		StarGAN		GauGAN		Deepfake		Mean
	Class	Acc	AP	Acc	AP	Acc	AP	Acc	AP	Acc	AP	Acc	AP	Acc	AP	Acc	AP	Acc	AP
Wang [6]	1	50.4	63.8	50.4	79.3	68.2	94.7	50.2	61.3	50.0	52.9	50.0	48.2	50.3	67.6	50.1	51.5	52.5	64.9
Frank [10]	1	78.9	77.9	69.4	64.8	67.4	64.0	62.3	58.6	67.4	65.4	60.5	59.5	67.5	69.0	52.4	47.3	65.7	63.3
Durall [23]	1	85.1	79.5	59.2	55.2	70.4	63.8	57.0	53.9	66.7	61.4	99.8	99.6	58.7	54.8	53.0	51.9	68.7	65.0
BiHPF [4]	1	82.5	81.4	68.0	62.8	68.8	63.6	67.0	62.5	75.5	74.2	90.1	90.1	73.6	92.1	51.6	49.9	72.1	72.1
FrePGAN [5]	1	95.5	99.4	80.6	90.6	77.4	93.0	63.5	60.5	59.4	59.9	99.6	100.0	53.0	49.1	70.4	81.5	74.9	79.3
LGgrad [43]	1	99.4	99.9	96.0	99.6	93.8	99.4	79.5	88.9	84.7	94.4	99.5	100.0	70.9	81.8	66.7	77.9	86.3	92.7
Ours	1	99.1	99.9	91.9	99.4	93.3	99.5	79.8	89.5	87.2	94.2	98.0	99.9	71.0	76.9	70.9	82.0	86.4	92.7
Wang [6]	2	64.6	92.7	52.8	82.8	75.7	96.6	51.6	70.5	58.6	81.5	51.2	74.3	53.6	86.6	50.6	51.5	57.3	79.6
Frank [10]	2	85.7	81.3	73.1	68.5	75.0	70.9	76.9	70.8	86.5	80.8	85.0	77.0	67.3	65.3	50.1	55.3	75.0	71.2
Durall [23]	2	79.0	73.9	63.6	58.8	67.3	62.1	69.5	62.9	65.4	60.8	99.4	99.4	67.0	63.0	50.5	50.2	70.2	66.4
BiHPF [4]	2	87.4	87.4	71.6	74.1	77.0	81.1	82.6	80.6	86.0	86.6	93.8	80.8	75.3	88.2	53.7	54.0	78.4	79.1
FrePGAN [5]	2	99.0	99.9	80.8	92.0	72.2	94.0	66.0	61.8	69.1	70.3	98.5	100.0	53.1	51.0	62.2	80.6	75.1	81.2
LGgrad [43]	2	99.8	100.0	94.8	99.7	92.4	99.6	82.5	92.4	85.9	94.7	99.7	99.9	73.7	83.2	60.6	67.8	86.2	92.2
Ours	2	99.6	100.0	94.2	99.6	86.7	99.2	89.0	95.5	85.7	94.8	99.7	100.0	78.8	85.1	54.7	60.4	86.1	91.8
Wang [6]	4	91.4	99.4	43.8	91.4	76.4	97.5	52.9	73.3	72.7	88.6	63.8	90.8	63.9	92.2	51.7	62.3	67.1	86.9
Frank [10]	4	90.3	85.2	74.5	72.0	73.0	71.4	88.7	86.0	75.5	71.2	99.5	99.5	69.2	77.4	60.7	49.1	78.9	76.5
Durall [23]	4	81.1	74.4	54.4	52.6	66.8	62.0	60.1	56.3	69.0	64.0	98.1	98.1	61.9	57.4	50.2	50.0	67.7	64.4
BiHPF [4]	4	90.7	86.2	76.9	75.1	76.2	74.7	84.9	81.7	81.9	78.9	94.4	94.4	69.5	78.1	54.4	54.6	78.6	77.9
FrePGAN [5]	4	99.0	99.9	80.7	89.6	84.1	98.6	69.2	71.1	71.1	74.4	99.9	100.0	60.3	71.7	70.9	91.9	79.4	87.2
LGrad [43]	4	99.9	100.0	94.8	99.9	96.0	99.9	82.9	90.7	85.3	94.0	99.6	100.0	72.4	79.3	58.0	67.9	86.1	91.5
Ours	4	99.9	100.0	96.0	99.9	93.5	99.7	84.4	93.1	88.5	96.1	99.7	100.0	74.4	83.2	59.5	70.6	87.0	92.8

In the overall summary, red is used to mark the best result and blue to mark the suboptimal result. For each individual detection task, the best result is highlighted in bold, and the suboptimal result is underlined.

Table 2. Role of different components.

MethodAcc	FAM	AEM	Acc	AP
A			86.1	91.5
B	*		85.8	91.9
C		*	86.7	93.2
D	*	*	87.0	92.8

The ones marked with ‘*’ contain components, while those without the symbol do not.

Table 3. Effects of different p-values.

p	Acc	AP
0.4	86.4	91.6
0.6	87.0	92.8
0.8	85.3	90.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Cui, S.; Zhang, Q.; Chen, B.; Zeng, H.; Zhong, Q. Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection. Mathematics 2025, 13, 1372. https://doi.org/10.3390/math13091372

AMA Style

Zhang W, Cui S, Zhang Q, Chen B, Zeng H, Zhong Q. Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection. Mathematics. 2025; 13(9):1372. https://doi.org/10.3390/math13091372

Chicago/Turabian Style

Zhang, Weinan, Sanshuai Cui, Qi Zhang, Biwei Chen, Hui Zeng, and Qi Zhong. 2025. "Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection" Mathematics 13, no. 9: 1372. https://doi.org/10.3390/math13091372

APA Style

Zhang, W., Cui, S., Zhang, Q., Chen, B., Zeng, H., & Zhong, Q. (2025). Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection. Mathematics, 13(9), 1372. https://doi.org/10.3390/math13091372

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Feature Fusion and Enhanced Attention Mechanism for Robust GAN-Generated Image Detection

Abstract

1. Introduction

2. Related Work

2.1. Spatial-Based Fake-Image Detection

2.2. Frequency-Based Fake Image Detection

2.3. Development of GAN

3. Methodology

3.1. Overview of the Proposed Network

3.2. Multi-Type Stepwise Pooling (MSP)

3.3. Attention Enhancement Module (AEM)

4. Experimental Results

4.1. Experimental Setup

4.2. Detection Performance

4.3. Visualization

4.3.1. Fake-Image CAM Visualization

4.3.2. Real-Image CAM Visualization

4.4. Ablation Studies

4.4.1. Effectiveness of Different Components

4.4.2. Effectiveness of AF

4.4.3. Impact of p

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI