In this section, we first introduce the evaluation metrics used to evaluate performance. Then, the experimental results are reported based on the metrics adopted on our RSI dataset to evaluate the efficiency of our proposed methods in detecting and localizing GAN-generated images from real ones. Next, an ablation study was conducted to show the effectiveness of our proposed approach in recognizing GAN-synthesized images on different modalities of image synthesis (text, sketch, or image) and different GAN-based image synthesis models.
4.2. Experimental Results on RSI
A comprehensive performance evaluation was conducted on our RSI dataset to validate the performance of our approach in detecting GAN-generated images. In particular, the testing set of RSI, which consisted of 12,000 images split equally between GAN-generated and real images, based on the COCO-Stuff dataset [
36], was used during the evaluation. The eight evaluation metrics were computed on the testing set of our dataset and are reported in
Table 2. As can be seen, EfficientNetB4 achieved the best performance in recognizing and localizing GAN-generated images, with a 100% accuracy, followed by InceptionV3 with 98% in terms of accuracy on this particular dataset, i.e., the RSI dataset. Instances of GAN-generated and real images from the RSI dataset that were classified properly via our best model (EfficientNetB4), with a 100% accuracy, are illustrated in
Figure 5.
Furthermore, to locate the region where our model was looking at during the classification process, different types of Class Activation Maps (CAM) were integrated. Specifically, GradCAM [
52], AblationCAM [
53], LayerCAM [
54], and Faster ScoreCAM [
55] were adopted to inspect the images by identifying which parts of the image contributed more to the classification decision of our model. A visualization of the four different CAM methods is demonstrated in
Figure 6.
As shown, the model looked mostly to the background of the GAN-generated images. However, it also looked at some regions where distortion and anomalies appeared in the synthetic images. To faithfully explain the predictions of our method, an explanation technique called LIME [
56] was leveraged, as shown in
Figure 7.
4.3. Effectiveness of Our Model
To show the effectiveness of our model in recognizing GAN-generated images for other models that were not trained on them, we further conducted one more experiment. In this experiment, we trained three models separately based on the input modality. Specifically, we trained a model on sketch-to-image and text-to-image models, named (S2I_T2I). In total, eight models were used for both modalities. Then, the trained model was tested on an image-to-image modality which consisted of four models. The second experiment (I2I_T2I) was that the model was trained on image-to-image and text-to-image models, and then tested on sketch-to-image modality models. The same specification was followed in the third and last experiment, where we trained the model on image-to-image and sketch-to-image models (I2I_S2I), and then tested the trained model on text-to-image models. The results of these experiments are reported in
Table 3. As can be seen, while the performance of the S2I_T2I model, when excluding I2I models, achieved 0.99 in terms of accuracy, the I2I_S2I model reached 0.95. In the meantime, the I2I_T2I model achieved less accuracy with 0.83 due to the image improvement step used in sketch-to-image models. This image improvement step attempts to enhance the background by replacing the generated background with a real background consistent and aligned with the context/objects in the image. Even though this step enhances the generated results, our GAN detection model was still able to detect GAN-generated images with a high accuracy of 0.83.
Consequently, our current model proved its capability in detecting GAN-generated images from real ones with a high accuracy, even though the model and its modality were not trained on. Thus, our GAN detection model is able to recognize GAN-generated images if a new GAN model is proposed in the future.
4.4. Ablation Study
To demonstrate the effectiveness of our method in detecting GAN-based synthetic images, we evaluated our second-best model, InceptionV3, on the testing set of different input modalities separately. We further assessed the performance of our second-best model on the testing set of image synthesis models individually. The reason that the second-best model, InceptionV3, was used during the ablation study was due to the fact that our best model EfficientNetB4 achieved 100% accuracy. Thus, there were no differences in the performance when part of the dataset was evaluated.
While
Table 4 reports the performance based on the eight common evaluation metrics,
Figure 8 presents the confusion matrices of our second-best model on the testing set generated from the image-to-image, sketch-to-image, and text-to-image synthesis models. The highest accuracy was accomplished with the GAN-based synthetic images generated from the texts, at 98.35%. On the contrary, the lowest accuracy was achieved when the inputs to the image synthesis models were semantic segmentation maps, at 96.90%.
Moreover, we studied the performance of our second-best model on the GAN-synthesized images separately produced by various synthesis images models.
Table 5 provides a comprehensive performance evaluation of the models in detecting GAN-generated images produced by a single image synthesis model. From
Table 5, we can conclude that it might be more challenging to detect GAN-generated images obtained from semantic segmentation mask maps than ones produced by natural language descriptions. Since the layout, shape, size, and semantic information are maintained in image-to-image synthesis, the generated images are more photo-realistic and naturalistic than the ones produced from text-to-image synthesis.
4.5. Experimental Results on Other Datasets
To show our model’s efficacy in detecting generated images from real ones, we further ran one more experiment. In this experiment, the first model in each modality from
Table 1 was selected. More specifically, OASIS [
14], S2I-DetectoRS [
18], and AttnGAN [
19] were used as image-to-image, sketch-to-image, and text-to-image synthesis models, respectively. Then, different datasets, which our model was not trained on, were utilized to generate synthetic images. Particularly, the ADE20K [
57], Sketchy [
58], and Caltech-UCSD Birds-200-2011 (CUB-200-2011) [
59] datasets were leveraged. Following this, both generated and real images were fed into our best detector, i.e., EfficientNetB4, to detect synthetic images from real ones. Finally, eight common evaluation metrics were leveraged to measure our model’s performance. Mainly, Precision, Recall, F1 score, Accuracy, Average Precision (AP), Area Under Curve of Receiver Operating Characteristic (ROC-AUC), False Positive Rate (FPR), and False Negative Rate (FNR) were adopted. The details are explained as follows.
With regard to the OASIS image-to-image synthesis model [
14], the testing set of the ADE20K dataset [
57], which consisted of 2000 semantic segmentation images, was fed into the pre-trained OASIS model [
14]. This step output synthesized images from the semantic segmentation images. Then, the generated and real images of the ADE20K testing set [
57] were input into our best model, and the performance was evaluated and recorded.
As for S2I-DetectoRS [
18], a subset of the Sketchy dataset [
58], which contained 2470 sketches, was used as an input into the S2I-DetectoRS sketch-to-image synthesis model [
18]. This subset of the Sketchy dataset [
58] was used because S2I-DetectoRS [
18] was trained on the MS COCO dataset [
27], and only 35 classes from the Sketchy dataset [
58] matched the classes in the MS COCO dataset [
27]. These classes were airplane, alarm_clock, apple, banana, bear, bench, bicycle, car_(sedan), cat, chair, couch, cow, cup, dog, door, elephant, eyeglasses, giraffe, hat, horse, hotdog, knife, motorcycle, pickup_truck, pizza, sailboat, scissors, sheep, shoe, spoon, table, teddy_bear, umbrella, window, and zebra. In the generation phase, sketches were converted into colored images. After that, both generated and real images were fed into our best model, and the eight-evaluation metrics were measured.
With respect to the AttnGAN text-to-image synthesis model [
19], the same procedure was followed. The only difference was the dataset, where the testing set of Caltech-UCSD Birds-200-2011 (CUB-200-2011) [
59] was used. The testing set was composed of 5794 captions/descriptions. After generating the corresponding images for the input text descriptions, both synthetic and real images were fed into our best model, i.e., EfficientNetB4. In the final step, the performance was assessed based on our adopted evaluation metrics.
The experimental results were recorded and are shown in
Table 6. As can be seen, our model was able to detect and recognize the GAN-generated images from real ones, even with other datasets that our classifier was not trained on. While the highest accuracy roughly reached 98% with the CUB-200-2011 dataset [
59] and the AttnGAN [
18] text-to-image synthesis model, our model achieved 89% in terms of accuracy with the ADE20K dataset [
57] and the OASIS [
14] image-to-image synthesis model. This may be attributed to the dataset complexity and generator capability. Hence, our model could be used as an evaluation tool regardless of the input modality.