4.5. Qualitative Evaluations
Figure 5 shows the qualitative comparison results on the Paris StreetView dataset. The EC model generates an edge map corresponding to the final output and identifies the global arrangement and long-range features by employing dilated convolutional layers. However, minute textural details were not extracted, resulting in an offset in the local target. For example, in the first row, the straight line above the front sign was not recovered, which corrupted the distinct boundary. In the second to fourth rows, the texture of the tree, the structure of the bricks between the middle highest window, and the pattern of the building surface were distorted. RFR sequentially fills in the missing pixels by circular feature inference, generating high-fidelity visual effects; however, serious artifacts developed as well. For example, in the first to third rows, wrinkled patterns are produced in the large masked regions. In the fourth row, checkerboard artifacts are found on the bricks, which is a common problem of transposed convolutional layers [
69]. CTSDG binds the texture and structure to each other; however, the boundaries were blurred owing to the implicit usage of the structure. For instance, in the first to third rows, the straight lines are obscured, distorting the distinct boundaries of semantically different regions. In the fourth row, cross-patterned deformities developed throughout the region. SPL conducts knowledge distillation on the pretext models and modifies the features for image-inpainting. This helps in understanding the global context while providing structural supervision for the restoration of local texture. Nonetheless, the local texture was smoothed out, resulting in blurring effects. For example, although solid lines are retained in all the rows, the texture of the leaves and brick patterns are not retained in the second to fourth rows. In contrast, our proposed method restores images with a suitable balance of low- and high-level features. For all rows, the pixels were filled with clear boundaries and a semantically plausible texture, as seen in the second row. This result is attributed to the use of ESA, where the model is able to obtain hints of the texture from all areas of the corresponding feature maps.
Figure 6 shows the qualitative comparison results on the CelebA dataset. EC obtained unsatisfactory results, i.e., the facial structures were extremely distorted. For instance, in the first row, the position of the left eye is not symmetrical to the right eye. In the second row, the nose does not have the appropriate shape, while the mouth is barely visible. In the fourth row, although the eyes and nose have the proper silhouettes, the mouth can hardly be seen. RFR provides better results than EC, though the final outputs are not improved. Although the model generates eyes with a normal shape, the mouths in all the rows are distorted, which ruins the degree of image restoration. CTSDG had the least favorable results of all. For all rows, the facial structures are not retained due to blurring effects, and checkerboard artifacts are found in all inpainted regions. While SPL sufficiently recovered the images, there were a few implausible regions remaining. For instance, in the first row, the size of the left eye is different and relatively smaller than the right eye. In the fourth row, the wrinkles and beard on the face disappear owing to excessive smoothing. In contrast, our model generated images with the best quality. For example, in the first row, the size of the left eye is similar to that of the right eye, and is at a suitable location. In the third row, unlike the other models, our method generates a mouth with teeth, very close to the ground-truth image. In the fourth row, the wrinkles and beard with a proper mouth are retained, and there is less perceived difference compared to the other methods.
Figure 7 shows the qualitative comparison results on the Places2 dataset. EC restored images with an acceptable quality using an edge map; however, certain areas are not appealing. For example, in the third row, the rails of the roller coaster are connected by curved lines, which is unrealistic. In the fourth row, the leaves filled with generated pixels do not have a consistent color compared with the other regions. CTSDG produced images with indistinct boundaries, i.e., unrealistic results. For instance, in the second row, the structure of the window is not fully retained owing to the blurriness of the bottom-left region. In the third row, the ride paths appear disconnected, which is unrealistic. In the fourth row, the texture of the leaves contrasts with the other regions and is not harmonized with different areas. CR-Fill trains the generator by adopting an auxiliary contextual reconstruction task that makes the generated output more plausible, even when restored by the surrounding regions. Hence, CR-Fill reconstructed images with an acceptable quality; however, a few regions can be perceived as different. For instance, in the first and third rows, the boundaries of the trees are not obvious, and the color of the middle–right part of the ride is inconsistent. SPL produced outputs with distinct lines connecting the masked regions; however, key textures and patterns were lost owing to excessive smoothing. For example, in the first, third, and fourth rows, the textures of the objects are blurred. The generated image in the second row contains checkerboard artifacts that distort the texture and quality of the image. In contrast to other methods, our proposed model achieved a balance between the apparent boundaries and textures of various objects. For instance, all the rows have straight lines separating semantically different areas. Furthermore, the textures of the objects were effectively restored, leading to plausible results.
In summary, our proposed method effectively balances low- and high-level feature restoration. This proves the generalizability of the proposed method based on qualitative evaluations.
4.6. Quantitative Evaluations
To analyze the inpainting results of our proposed method and those of other models, we applied four different metrics: the Fréchet inception distance (FID) [
70], learned perceptual image patch similarity (LPIPS) [
71], structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). The FID is a widely used quantitative metric in the field of image generation; it measures the Wasserstein-2 distance between the generated and target images utilizing a pretrained Inception-V3 model [
46]. Except for the FID, the other metrics are full-reference image quality assessments, in which restored images are compared with their corresponding ground truth images. The LPIPS evaluates the restoration effect by computing the similarity between the deep features of two images using AlexNet [
72]. The SSIM calculates the difference between two images in terms of their brightness, contrast, and structure. Finally, the PSNR analyzes the restoration performance by measuring the distances between the pixels of two images. The quantitative comparison results on the Paris StreetView, CelebA, and Places2 datasets are listed in
Table 2,
Table 3 and
Table 4, respectively. For all the results, the first and second highest values are labeled in bold and underlined (↓ indicates that lower is better; ↑ indicates that higher is better).
On the Paris StreetView dataset, our PEIPNet method ranked first or second for all metrics. For mask rates of and , PEIPNet achieved the best results, similar to the FID and LPIPS. However, for mask rates of and , RFR had the best results, similar to the FID and LPIPS, while PEIPNet had the second-best results. For SSIM and PSNR, SPL and PEIPNet had the best and second-best results, respectively, for all mask rates. The excellent performance of PEIPNet is attributed to the low number of artifacts in the generated images. The textures of different objects were retained as well, which FID and LPIPS are highly sensitive to. Hence, PEIPNet can fill in small masked regions; however, its strength decreased on the large-hole inpainting task. This is because the DDCM and ESA encourage PEIPNet to obtain various meaningful hints from different regions of the feature maps with small masked areas by identifying global long-range and local features through dilated convolution and nonlocal attention. If there are insufficient regions from which to obtain information, the aforementioned mechanism results in reduced performance of PEIPNet.
On the CelebA dataset, PEIPNet ranked first or second for the LPIPS, SSIM, and PSNR. EC had the best outcomes for the FID with all mask rates, followed by SPL. However, the FID difference between SPL and PEIPNet was very small, except with the mask rate of . For the LPIPS, PEIPNet had the best results with the first three mask rates and the second-best with the highest mask rate. The opposite was true for the best and second-best results of the RFR. For the SSIM and PSNR, SPL had the best values, followed by PEIPNet. As on the Paris StreetView dataset, the disparity in inpainting performance compared with the best method continued to increase as the mask rate increased, for the same reason mentioned previously.
On the Places2 dataset, PEIPNet again had either the best or second-best LPIPS, SSIM, and PSNR. Unlike on the CelebA dataset, PEIPNet had the second-best outcome for the FID with mask rates of and . For the LPIPS, PEIPNet had the lowest values with the first three mask rates and the second lowest with the highest mask rate; the opposite was true for SPL. For the SSIM and PSNR, PEIPNet had the second highest values for all mask rates, while SPL had the best outcomes. The same phenomenon of increased inpainting accuracy difference compared with the best result was observed on the Places2 dataset.
The proposed PEIPNet method showed exceptional performance for all metrics: FID, LPIPS, SSIM, and PSNR. In most cases, PEIPNet had the best or second-best outcome; this tendency was not observed for the other methods. Specifically, PEIPNet achieved at least the second-best results on the Paris StreetView dataset, indicating the advantage of having a small number of model parameters when training with a limited number of samples. Thus, our quantitative evaluations confirmed the generalizability of the proposed method.
4.7. Ablation Studies
To verify the effects of the introduced the DDCM and ESA, ablation studies using our method were conducted on the Paris StreetView dataset. Specifically, we divided the DDCM into two parts for analysis, namely, the dilated convolution and the dense block. To reduce the training time, we altered the batch size to eight for all combinations.
The quantitative results with different combinations of DDCM and ESA on the Paris StreetView dataset are listed in
Table 5. For the DDCM, eliminating the entire module affected model performance; the average FID and LPIPS increased by 5.3607 and 0.0102, while the SSIM and PSNR decreased by 0.0083 and 0.4658, respectively, compared with the original model. Comparison of the two parts of the DDCM showed that applying dilated convolutional layers yielded better results for all metrics, which indicates the importance of long-range feature extraction in the image-inpainting task. ESA plays a crucial role, as the average FID and LPIPS increased by 2.32 and 0.0034, while the SSIM and PSNR decreased by 0.0025 and 0.0676, respectively. However, the decline in performance without ESA was lower than that without the DDCM, indicating its dominance in the proposed method.
The qualitative results with different combinations of the DDCM and ESA on the Paris StreetView dataset are shown in
Figure 8. Unlike the original model, the remaining combinations did not retain the streetlight structure. Specifically, the pillar was disconnected from the head of the lamp, which is unrealistic. The authentic model provided the best restoration of the texture of the leaves, demonstrating the strength of the proposed modules.
Finally, we calculated the complexities of different combinations of models, as described in
Section 4.4 and summarized in
Table 6. The contribution of dilated convolution was minor, with almost no change in the memory when this process was eliminated. Removing the dense block had a greater impact on the memory compared with dilated convolution, though the change remained insignificant. On the other hand, eliminating ESA had a significant impact on memory through a 4.51% reduction in the computational cost. Thus, despite its structural efficiency, adopting self-attention remains costly.