Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception

Liu, Xiaotong; Wan, Jin; Wang, Nan

doi:10.3390/app14198777

Open AccessArticle

Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception

by

Xiaotong Liu

¹,

Jin Wan

^2,*

and

Nan Wang

²

¹

Chinese Painting Department, Xi’an Academy of Fine Arts, Xi’an 710065, China

²

Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8777; https://doi.org/10.3390/app14198777 (registering DOI)

Submission received: 6 September 2024 / Revised: 24 September 2024 / Accepted: 26 September 2024 / Published: 28 September 2024

(This article belongs to the Special Issue Advanced Technologies in Cultural Heritage)

Download

Browse Figures

Versions Notes

Abstract

:

Ancient paintings, as a vital component of cultural heritage, encapsulate a profound depth of cultural significance. Over time, they often suffer from different degradation conditions, leading to damage. Existing ancient painting inpainting methods struggle with semantic discontinuities, blurred textures, and details in missing areas. To address these issues, this paper proposes a generative adversarial network (GAN)-based ancient painting inpainting method named RG-GAN. Firstly, to address the inconsistency between the styles of missing and non-missing areas, this paper proposes a Regional Attention-Style Transfer Module (RASTM) to achieve complex style transfer while maintaining the authenticity of the content. Meanwhile, a multi-scale fusion generator (MFG) is proposed to use the multi-scale residual downsampling module to reduce the size of the feature map and effectively extract and integrate the features of different scales. Secondly, a multi-scale fusion mechanism leverages the Multi-scale Cross-layer Perception Module (MCPM) to enhance feature representation of filled areas to solve the semantic incoherence of the missing region of the image. Finally, the Global Context Perception Discriminator (GCPD) is proposed for the deficiencies in capturing detailed information, which enhances the information interaction across dimensions and improves the discriminator’s ability to identify specific spatial areas and extract critical detail information. Experiments on the ancient painting and ancient Huaniao++ datasets demonstrate that our method achieves the highest PSNR values of 34.62 and 23.46 and the lowest LPIPS values of 0.0507 and 0.0938, respectively.

Keywords:

ancient painting inpainting; generative adversarial network; regional attention-style transfer module; global context perception discriminator

1. Introduction

Ancient paintings, as a form of art that integrate architecture with color and form, are subject to damage in their local regions due to environmental influences, major natural disasters, and the risk of intentional destruction [1]. However, because of influences from the environment, significant calamities of nature, and the risk of deliberate damage, this treasure trove of artistic legacy has become exceedingly delicate. Traditional ancient painting inpainting works [2] are mainly done by experienced experts using manual reproduction methods. This approach relies on the expertise of the restorer but is also inefficient. More importantly, there may be subjective misunderstanding of the contents of ancient paintings, leading to irreversible damage to them. Ancient painting inpainting has recently emerged as a fundamental research topic in computer vision [3].

Since the advent of deep learning [4], numerous studies in image processing and computer vision have leveraged its powerful capabilities for feature extraction and learning, achieving groundbreaking results. The integration of deep learning technology into the domain of image inpainting has addressed the limitations of traditional methods, which often fail to consider comprehensive image information. Among these advancements, generative adversarial networks (GANs [5]) stand out due to their distinctive adversarial approach and superior generative capabilities. They have been extensively applied across various image inpainting applications. GAN-based image inpainting technology [6], which combines a potent generative network with a discriminative network, has demonstrated significant advantages. The generative network is tasked with creating realistic image content, while the discriminative network assesses the authenticity of the output. This adversarial training significantly enhances the quality and precision of the details in the restored images. Because the U-Net [7] network’s excellent feature fusion can learn different levels of information in an image, researchers have tried to apply it to image inpainting tasks and have achieved excellent inpainting results. In addition, Ou et al. [8] used GAN for image refinement, combined with the use of the TranMSMrmer [9] module to restore refinement features for superior performance in image recovery tasks. Abderrazak et al. [10] proposed an antagonistic generation network based on universal multi-core filters to extract more relevant and discriminating features related to image tasks. Sun and others [11] proposed an edge-guided ancient painting inpainting method using edge image guidance network training through self-attention to improve the attention to edge and detail features. Furthermore, Mardieva et al. [12] demonstrated significant improvements in image quality using a lightweight image super-resolution approach with deep residual feature distillation, which is relevant to enhancing performance in image restoration tasks.

Ancient painting inpainting requires a delicate balance between preserving the original artistic integrity and addressing the damages incurred over time. While considerable progress has been made in the field, existing GAN-based methods often struggle to effectively capture intricate details and maintain semantic coherence, especially in severely damaged or missing regions. To address these challenges, we propose a novel ancient painting inpainting network, RG-GAN, which introduces three key modules—RASTM (Regional Attention-Style Transfer Module), MFG (Multi-scale Fusion Generator), and GCPD (Global Context Perception Discriminator)—each designed to target specific limitations of prior inpainting methods. Firstly, RASTM improves the preservation of stylistic consistency by incorporating regional attention mechanisms, ensuring that transferred styles match the original artistic context more closely. Secondly, MFG enhances the reconstruction of fine-grained details across various scales, addressing the challenge of perceptual inconsistency between the inpainted areas and the original artwork. Finally, the GCPD further strengthens the model’s ability to capture global contextual relationships, enabling a more accurate and detailed inpainting, even in areas with extensive damage. By explicitly addressing these gaps, RG-GAN offers a more comprehensive and nuanced solution to ancient painting inpainting. The experimental results on both the ancient paintings and the ancient Huaniao++ datasets demonstrate that our method achieves superior semantic coherence and detail restoration compared to existing approaches.

The contributions of this article can be summarized as follows:

To achieve an accurate match between the features of the missing region and the normal region, a regional attention-style transfer module is designed, which uses an attention mechanism to transfer the feature distribution of the content in the normal area to the missing area, so as to achieve fine-grained alignment between the content in the missing area and the content in the normal area, thus achieving complex style transfer while maintaining the authenticity of the content.
A new Multi-scale Fusion Generator (MFG) is designed. The encoder uses the multi-scale residual downsampling module to reduce the size of the feature map and at the same time effectively extracts and integrates the features of different scales, which provides a rich feature representation for the network. Then, similar semantic features of different scales extracted from the encoder are fused, and information interaction is carried out through a skip connection.
We propose the Multi-scale Cross-layer Perception Module (MCPM) that leverages the deep semantic features across multiple scales from the encoder, enriching the feature representation of the inpainted areas. By integrating multi-level semantic information from the encoder, the decoder can fully exploit the encoder’s rich features, leading to more coherent and detailed inpainting details.
We propose the Global Context Perception Discriminator (GCPD) that uses the feature extractor and global attention mechanism designed in order to deepen the feature extraction process so that the network can capture more subtle and complex features of ancient painting images. At the same time, it can effectively retain the information of channels and spatial dimensions, thus strengthening cross-dimensional information interaction. The design of the GCPD significantly improves the discriminator’s ability to identify specific spatial areas and extract critical detail information.

2. Related Work

2.1. Ancient Painting Inpainting

Ancient painting is not only of great artistic value, it is also the testimony of history and culture, recording the development of human society and reflecting the social landscape of different historical periods [13]. The preservation of these precious works of art is important because they are an irreplaceable reserve of civilization and cultural diversity. Traditional ancient painting inpainting methods have a long history, which relies mainly on the hand-craftsmanship of restorers and their deep understanding of materials and techniques. The inpainting process usually involves careful examination of the painting, as well as cleaning, decontaminating, coloring, repairing damage to the canvas or paper, etc. These methods emphasize respect for the original. Traditional inpainting methods have accumulated a wealth of experience in practice, but there are certain limitations, such as the high demand for inpainting skills and the length of time needed for the inpainting process.

As deep learning has swiftly advanced [14,15], it has demonstrated an unmatched ability to revitalize the domain of ancient painting inpainting [3]. Ancient painting inpainting requires not only the inpainting of their physical form but also the preservation of their artistic value and historical information. Deep learning enables the intelligent inpainting of missing regions [16]. The filled content should not only integrate visually with the original but also maintain a high degree of consistency in artistic style and historical context [8]. Refs. [3,17] utilized a neural network to identify and understand complex features in ancient paintings, such as textures, edges, and styles, to perform in-depth analysis and highly accurate inpainting of ancient paintings. In addition, generative adversarial networks (GANs [5]) have proven to be adept at the inpainting of ancient paintings. Comprising a generator that produces the reconstructed image and a discriminator that assesses the image’s genuineness, GANs employ an adversarial training approach. This dynamic balance ensures the restored paintings are visually indistinguishable from their authentic counterparts [18]. Additionally, ancient paintings often have fine lines and complex patterns. GAN can learn these detailed features and generate them seamlessly based on the original painting for the inpainting part [19]. Achieving such high-quality inpainting is challenging with conventional methods. The evolution of ancient painting inpainting methods exemplifies a transition from artisanal skills to the incorporation of contemporary technological advancements.

2.2. Multi-Scale Fusion

Multi-scale fusion [20] plays a pivotal role in image inpainting within the framework of generative adversarial networks (GANs). This approach operates at various resolution levels to comprehensively reconstruct both the structure and content of an image. At a coarser scale, the network initially captures and reconstructs the fundamental framework and principal elements of the image, ensuring macroscopic accuracy and consistency of the inpainting results [21]. Subsequently, during downscaling, the network concentrates on the recovery of subtle image features such as edges, textures, and local patterns. These detailed features are essential for enhancing the authenticity of the restored images [22]. In image inpainting, multi-scale fusion typically encompasses a spectrum of features from coarse to fine, facilitating multi-level processing [23]. At the lower levels, the focus shifts towards the meticulous reconstruction of details and textures. The advantage of multi-scale fusion lies in its ability to balance the significance of global and local features, thereby generating richer and more realistic image outputs. The application of multi-scale fusion in GAN-based image inpainting not only significantly enhances the visual quality and detail richness of the restored images but also contributes to improved model stability and training efficiency. This is particularly beneficial when addressing complex and large-scale image inpainting tasks [24].

3. RG-GAN

3.1. Overview

RG-GAN is tailored to restore images with a high degree of fidelity, particularly focusing on ancient paintings where the preservation of historical and artistic integrity is paramount. As shown in Figure 1, RG-GAN consists of a multiscale fusion generator and a global context perception discriminator. The encoder of the generator is responsible for converting damaged or incomplete ancient images into a set of feature representations. The decoder of the generator then uses it to reconstruct the image and fill in the missing areas [25]. It ensures that the high-level semantic features captured by the encoder are effectively transmitted and maintained during the reconstruction process, thus solving the problems of semantic incoherence and missing high-level feature information. The discriminator is used to identify subtle differences between the inpainting results and the real ancient painting images [26] and provide feedback to the generator to guide it to generate more realistic and consistent images, further improving the quality and artistry of the inpainting results.

3.2. Multi-Scale Fusion Generator

In image inpainting, the U-Net architecture typically faces challenges, particularly when processing images with missing areas [27]. This is because the skip connections have almost no input signal in the center of the missing area, making it difficult to effectively pass texture information to the decoder, which can result in texture blurring and structural distortion in the repaired image. Additionally, the U-Net model has deficiencies in both global semantic consistency and local feature continuity, leading to semantically incoherent inpainting results.

In this paper, we propose an innovative multi-scale fusion generator called the Multi-scale Fusion Generator (MFG). By designing a regional attention-style transfer module, it realizes fine-grained alignment between the content in the missing area and the content in the normal area, thus achieving complex style transfer while maintaining the authenticity of the content. This enhancement improves the quality of information passed through the skip connections. The MFG further leverages multi-scale fusion technology to progressively integrate filled high-level features with low-level features. This approach ensures that basic features can access and integrate semantic insights from advanced-level features, thereby improving the conveyance and application of feature data. Additionally, the MFG decoder outputs a repaired image at each scale, which is used to calculate pyramid losses, optimizing the filling effect on the missing area at each scale. The multi-scale fusion generator, which combines regional attention-style transfer modules, can adapt to various image inpainting scenarios, thereby improving the quality of inpainting.

As shown in Figure 1, MFG consists of six layers of multi-scale residual downsampling (MRD), which reduces the size of the input feature map from 128 × 128 to 4 × 4. The multi-scale residual downsampling module is shown in Figure 2. The decoder part also contains a six-layer structure and interacts with the encoder through a connection. The MRD module consists of two branches. The left branch uses the 1 × 1 convolution module and the 3 × 3 depth convolution module to extract fine-grained features, while the right half uses the depth convolution module to improve the inpainting function of ancient painting images while reducing the model parameters. Finally, the output features of the two branches are fused, effectively extracting and integrating features of different scales, providing rich feature representation for the network and reusing input features when establishing residuals. MRD can be expressed as:

X 1 = {Conv}_{1} ({Conv}_{1} ({Conv}_{3} ({Conv}_{1} (F)))) \oplus F,

(1)

X 2 = {PWConv}_{1} ({DWConv}_{3} (F)),

(2)

X = {Conv}_{1} (Cat (X 1, X 2)),

(3)

where DWConv₃ refers to the 3 × 3 depth-wise convolution, PWConv₁ denotes the 1 × 1 pointwise convolution, Conv₃ and Conv₁ represent the standard 3 × 3 and 1 × 1 convolutions, respectively, Cat is the concatenation operation, and ⊕ indicates element-wise addition.

3.3. Multi-Scale Cross-Layer Perception Module

Regional Attention-Style Transfer Module (RASTM): To more accurately match the characteristics of the missing area and the normal area, it is ensured that the style of the missing image area is consistent with that of the undamaged image. This paper designs a Regional Attention-Style Transfer Module (RASTM), as shown in Figure 3. Firstly, the input feature F is multiplied by the Mask to obtain the missing area content, and the input feature F is multiplied by the 1-Mask to obtain the normal area style content. Then, the attention mechanism is used to transfer the feature distribution of the content in the normal area to the missing area, in order to achieve fine-grained alignment between the content in the missing area and the content in the normal area. We then ensure that the contents of the missing area and the normal area are more accurately matched in different spatial positions in order to realize complex style transfer while maintaining the authenticity of the content.

Conventional multi-scale fusion mechanisms often convey feature data from only a single layer of the encoder to the decoder, thereby limiting the low-level features’ ability to discern semantic cues from higher-level features. This limitation leads to discontinuity in both the features and the semantic context within the restored area. To overcome this limitation, we adopt an improved multi-scale fusion mechanism that enhances the transfer of feature maps of missing areas filled by the Regional Attention-Style Transfer Module (RASTM). This enables the decoder to use the encoder’s multi-layer features more effectively, thereby improving the lack of high-level feature information in the inpainting results.

Multi-scale Cross-layer Perception Module (MCPM): As shown in Figure 4, the MCPM progressively incorporates the already filled high-level features into the low-level features, thereby improving feature integration. For example, let F1 represent high-level features in the encoder and F2 and F3 represent mid-level features. After processing by RASTM, SF1, SF2, and SF3 are obtained. Then, SF1 is sampled and fused with SF2. The feature is refined using 1 × 1 convolution and 3 × 3 convolution operations, followed by Softmax extraction to obtain DF2, which is then passed to the decoder and continues to be sampled to blend with subsequent feature maps. This skip connection not only transmits information from a single feature map but also facilitates the low-level feature’s perception of the semantic information of the high-level feature.

Firstly, the continuous features are calculated by the regional attention-style transfer module as follows:

SF 1 = RASTM (F 1),

(4)

SF 2 = RASTM (F 2),

(5)

SF 3 = RASTM (F 3),

(6)

where RASTM is the Regional-Attention Style Transfer Module, which uses attention mechanisms to enhance specific regional features during style transfer, and where SF1, SF2, and SF3 are the features obtained from the RASTM.

Secondly, fusion features are obtained using skip connections and convolution operations as follows:

DF 2 = Softmax ({Conv}_{3} ({Conv}_{1} (SF 2 \oplus Up (SF 1)))),

(7)

DF 3 = Softmax ({Conv}_{3} ({Conv}_{1} (SF 3 \oplus Up (SF 2)))),

(8)

where ⊕ denotes element-wise addition at the pixel level, Up refers to upsampling, and Softmax is the activation function used to normalize the output into a probability distribution.

3.4. Global Context Perception Discriminator

In generative adversarial networks, the discriminator is critical in evaluating the authenticity of the generated image and in providing feedback to guide the generator in creating more realistic images. To improve this process, a Global Context Perception Discriminator (GCPD) is proposed. It utilizes a global attention mechanism to preserve information across channels and spatial dimensions, reinforcing the importance of cross-dimensional interactions. This enables the discriminator to more accurately understand and evaluate inpainted images, thereby enhancing the model’s performance in diverse inpainting tasks.

The structure design of GCPD is shown in Figure 5, which consists of several key parts: feature extractor, global attention module (GAM), convolution module, and full connection layer FC. Specifically, the feature extractor is composed of two branches—the left branch is composed of three continuous 3 × 3 convolution layers, which are used to extract initial visual features from the input ancient painting images, and the right branch further deepens the feature extraction process by using 3 × 3 and 1 × 1 convolution so that the network can capture more subtle and complex ancient painting image features. Then, the features extracted from the two branches are fused, and the number of channels is adjusted by a 1 × 1 convolution. Then, the global attention mechanism is applied to capture the key multi-dimensional features of the image and retain the information of the channel and spatial dimensions. This design not only reduces the model size of the convolutional neural network but also improves its performance, making GCPD more efficient and accurate in evaluating generated images.

As depicted in Figure 6, the global attention module (GAM) operates by sequentially integrating channel attention with spatial attention. It initiates the process with a three-dimensional configuration designed to preserve the volumetric features of the image through channel attention components. Then, the inter-dimensional channel and spatial correlations are amplified through the application of a multi-layer perceptron (MLP). This enhancement is fused with the initial input features via a residual link, which serves to boost the feature’s descriptive power.

In the spatial attention section, two 7 × 7 convolution layers are used to effectively capture and blend spatial information. GAM is designed to accurately identify and emphasize key channel and spatial features in images, providing a richer and more accurate representation of features for subsequent image processing tasks, which can be expressed as:

F 1 = σ (InversePermute (MLP (Permute (F)))) ⊙ F,

(9)

F 2 = σ ({Conv}_{7} ({Conv}_{7} (F 1))) ⊙ F 1,

(10)

where

σ

is the sigmoid activation function, whose mathematical definition is

σ (x) = \frac{1}{1 + e^{- x}}

, MLP refers to a shared perception machine, and Permute and InversePermute are used to rearrange the order of feature dimensions. Specifically, Permute reorders the dimensions of the input tensor to facilitate certain operations, and InversePermute restores the original dimension order after processing. The symbol ⊙ represents an element-by-element (or pixel-wise) multiplication, and

{Conv}_{7}

denotes convolutions with a 7 × 7 kernel size.

3.5. Loss Function

Within the framework of the RG-GAN, the design of the loss function includes several components due to the innovative integration of a regional attention style transfer and multi-scale fusion technology, as well as the multi-scale color repairing of the image using the decoder output. Specifically, the loss function consists of several key components, including adversarial and reconstruction losses.

3.5.1. Adversarial Loss

In GAN training, the discriminator may not be updated after the optimal state is reached, resulting in erratic training. To solve this problem, WGAN [28] was introduced, which uses weight clipping to stabilize training. However, this approach has its limitations because it relies on a fixed range of weights, limiting the expressive power of the network. Further proposed is WGAN-GP [29], which enhances the stability of the network by optimizing the target function by setting separate penalty terms for each sample. In this paper, WGAN-GP is used to calculate the adversarial loss to achieve more stable and efficient image generation training, balancing the generator and discriminator to promote high-quality image generation.

L_{adv} = E [D (x)] - E [D (G (z))] + λ_{gp} \cdot L_{gp},

(11)

where

E

denotes the expectation operator,

x

represents the real image,

z

denotes the image to be repaired,

D (x)

is the output of the discriminator function applied to real data

x

, and

D (G (z))

is the output of the discriminator when applied to the generated data

G (z)

, with

G

being the generator. The term

L_{gp}

is the gradient penalty, which involves the gradient of the discriminator’s output with respect to the input data, serving to enforce smoothness and regularization in the training process.

3.5.2. Reconstruction Loss

This paper introduces the reconstruction loss to ensure that the details of the inpainting area are preserved while maintaining the original semantics. The reconstruction loss is defined as:

L_{rec} = E_{x_{a}} {∥x^{a} - G (x^{a}, a)∥}_{1},

(12)

where

{∥\cdot∥}_{1}

denotes the

L_{1}

norm, which suppresses the blurring effect of the image,

a

is the original property label, and

x^{a}

is the reconstructed image.

3.5.3. Total Loss Function

L_{total} = u_{1} L_{adv} + u_{2} L_{rec},

(13)

where

L_{adv}

is the adversarial loss,

L_{rec}

is the reconstruction loss, and

u_{1}

and

u_{2}

are the weights of the corresponding loss functions. The weights are defined as follows:

u_{1} > 0

and

u_{2} > 0

(indicating that neither

L_{adv}

nor

L_{rec}

can be excluded from

L_{total}

).

4. Experiments

4.1. Datasets

In this paper, we build two thematic datasets of ancient paintings for the study of ancient painting inpainting. The first dataset consists of 5798 landscape-themed ancient paintings from several historical periods, each of which is a uniform size of 256 × 256 pixels. The second dataset contains 4638 ancient paintings of flower and bird subjects, also from different historical periods and styles, unifying each picture to 256 × 256 pixels. We divide the two ancient datasets into training and validation sets by 8:2, as shown in Table 1. Finally, a particular script is used to generate a mask image, resize the image to the specified size, and automatically generate multiple masks, pairing each image with a different mask image, resulting in an image corresponding to a mask image.

4.2. Experimental Setup

The environment of this experiment is NVIDIA GTX4090, torch2.0.0, CUDA11.8 running video card and 24 GB running memory. In this experiment, Adam’s algorithm was used to optimize the inpainting model parameters, with a hyperparameter of 0.00001, a learning rate of 0.0001, batch size of 16, and 400 epochs.

4.3. Evaluation Metrics

In the conducted experiment, a suite of metrics was employed to assess the model’s efficacy, including the mean absolute error (MAE) for measuring the average magnitude of the error, the peak signal-to-noise ratio (PSNR) to evaluate the quality of the reconstruction, the structural similarity index measure (SSIM) for gauging the similarity in structural patterns between the model’s output and the reference, and the learned perceptual image patch similarity (LPIPS) to quantify perceptual similarity based on learned features.

MAE is expressed by

M A E = \frac{1}{M N} \sum_{i = 0}^{M - 1} \sum_{j = 0}^{N - 1} {∥ I (i, j) - K (i, j) ∥}_{1},

(14)

where I(i,j) and K(i,j) are the pixel values of images I and K at position (i,j) in the original and reconstructed images, respectively.

PSNR is expressed by

M S E = \frac{1}{M N} \sum_{i = 0}^{M - 1} \sum_{j = 0}^{N - 1} {[I (i, j) - K (i, j)]}^{2},

(15)

P S N R = 10 \cdot {log}_{10} (\frac{M A X_{I}^{2}}{M S E}),

(16)

where

M A X_{I}

is the maximum possible value of the image pixel.

SSIM can expressed by

SSIM = [l {(x, y)}^{α} c {(x, y)}^{β} s {(x, y)}^{ν}] = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 τ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (τ_{x}^{2} + τ_{y}^{2} + C_{2})},

(17)

where l(x,y), c(x,y), and s(x,y) respectively represent the similarity of luminance, contrast, and structure. Additionally,

u_{x}

and

u_{y}

are the average values of x and y respectively,

τ_{x}

and

τ_{y}

are their standard deviations,

τ_{xy}

is their covariance, and

C_{1}

and

C_{2}

are constants to maintain stability.

LPIPS is expressed by

LPIPS (x, y) = \sum_{l} w_{l} \cdot {∥f_{l} (x) - f_{l} (y)∥}_{2}^{2},

(18)

where

f_{l}

(x) and

f_{l}

(y) are features extracted from images x and y through the pre-trained deep neural network at layer l,

∥\cdot∥

is between

f_{l}

(x) and

f_{l}

(y) in Euclidean distance,

w_{l}

is the weight of the l-th layer, and the summation is performed on the selected l-layer feature layer.

4.4. Experiments on the Ancient Painting Dataset

4.4.1. Quantitative Comparison

In this paper, we compare four models: PI [30], RFR [31], EC [32], and FcF [33]. As can be seen from Table 2, our RG-GAN model has a PSNR value of 34.62, which is the highest among the models. This indicates that our model has achieved the best quality in reconstructing and restoring images, demonstrating a superior ability to recover the original information of ancient images more accurately. Additionally, the RG-GAN model has an LPIPS value of 0.0507, which is the lowest among all models, indicating that it has the smallest perceptible difference. This is because, by combining Regional Attention-Style Transfer and multi-scale cross-layer fusion technology, this model can realize complex style transfer while maintaining the authenticity of the restored image content and at the same time enable the decoder to comprehensively utilize the feature information from different levels, thus improving the inpainting quality problem caused by the lack of high-level semantic features. In addition, this model also proposes a new Global Context Perception Discriminator, which enhances the discriminator’s ability in spatial region location and detail information extraction through the global attention mechanism and significantly improves the performance of the model in various inpainting tasks.

From the results of the comparison of the five methods on the PSNR indicator in Figure 7, it can be observed that RG-GAN has achieved the best results during training compared to other methods. The PSNR and SSIM metrics of our method reached 34.37 and 0.928, respectively, the highest values in all models, and the MSE and LPIPS metrics achieved the lowest values. This is because the model combines stable field operators and multi-scale cross-layer fusion techniques. We have solved the problem of the U-Net model dealing with invalid pixel transfer in missing areas. This enables the decoder to leverage feature information from different levels. This improves the inpainting quality problems caused by a lack of high-level semantic features. In addition, this paper proposes a new Global Context Perception Discriminator, which enhances the ability of the global attention mechanism in spatial localization and detail extraction and improves the performance of the model in various inpainting tasks.

4.4.2. Qualitative Comparison

Figure 8 shows the visualization results of different methods on landscape painting datasets. As can be seen, Figure 8c,d have lost details and exhibit blurred edges during image inpainting, resulting in poor inpainting results. Figure 8e,f show incoherent inpainting results in the missing areas of the image during the inpainting process. In contrast, our method produces higher-quality inpainting images compared to other methods. The preservation of edge details is particularly prominent in optimal inpainting.

4.5. Experiments on the Ancient Huaniao++ Dataset

4.5.1. Quantitative Comparison

To verify the generalization of our method, we compared it with four mainstream models on the ancient Huaniao++ dataset. The experimental results are shown in Table 3. This model achieves the highest PSNR and SSIM values, 23.46 and 0.857, respectively, indicating that our RG-GAN excels at maintaining image structures and textures similar to the original. The repaired image is visually closer to the original and has the lowest MSE and LPIPS values. Our model’s performance on the ancient flowers and birds image dataset has proven to be superior to other mainstream image inpainting models.

As can be seen from the visualization of the results on the ancient Huaniao++ dataset using the five methods in Figure 9, our RG-GAN model has the highest PSNR value. The increasing gap with other models fully illustrates the excellent performance of our method in image inpainting and significantly improves the inpainting task for ancient paintings. Furthermore, our RG-GAN, with a model size of 16.7 M, processes each 256 × 256 image in approximately 72 ms. By comparison, the RFR [31] model, which has a larger size of 33 M, takes 95 ms for the same image size. This highlights the efficiency of our approach, as it not only reduces computational overhead but also delivers faster inference times, making it well-suited for real-world applications.

4.5.2. Qualitative Comparison

Figure 10 shows the visualization results of different methods on the ancient Huaniao++ dataset. As can be seen from Figure 10, there is a certain degree of ambiguity in the inpainting results of Figure 10c–f in terms of improving the fuzziness of the repaired images; however, there is a loss of detail features. In contrast, our method produces better results than other image inpainting methods, showing a semantically consistent repair for missing regions and excellent inpainting results in image detail and edge texture.

4.6. Ablation Study

Table 4 illustrates the impact of various models on the inpainting of ancient paintings.

Effectiveness of the Regional Attention-Style Transfer Module (RASTM). As observed in the second row of Table 4, the Regional Attention-Style Transfer Module uses an attention mechanism to transfer the feature distribution of the missing area content and the normal area style content, in order to achieve fine-grained alignment between the missing area content and the normal area content and to realize complex style transfer while maintaining the authenticity of the content. Consequently, the PSNR and SSIM metrics increased by 0.87 and 0.010, respectively. Figure 11c demonstrates that there is local fuzziness in image inpainting when the RASTM module is omitted.

Effectiveness of the Multi-scale Cross-layer Perception Module (MCPM). This paper introduces the Multi-scale Cross-layer Perception Module designed to ensure that skip connections are not solely reliant on a single-scale feature but also incorporate multiple levels of semantic information from the encoder. This allows the decoder to fully exploit the deep semantic features of the encoder, thereby enhancing the semantic coherence and detailed richness of the repaired images. Figure 11d indicates that there is no semantic inconsistency in the image missing regions when the MCPM module is employed. Additionally, the PSNR and SSIM indicators are improved by 2.38 and 0.029, respectively.

Effectiveness of the Global Context Perception Discriminator (GCPD). The Global Context Perception Discriminator (GCPD) addresses the shortcomings of existing models in capturing details during the inpainting of ancient paintings. By utilizing a global attention mechanism, it effectively retains information across channel and spatial dimensions, enhancing cross-dimensional information interaction and significantly improving model inpainting. As depicted in Figure 11e, the PSNR and SSIM metrics increased by 1.19 and 0.018, respectively, for image inpainting without the GCPD module.

Figure 11f displays the image inpainting effect of our method. The repaired image produced by our method closely resembles the original, validating the effectiveness of our method.

5. Conclusions

To address the challenges in ancient painting inpainting, this paper proposes a novel ancient painting inpainting method with multi-scale fusion and global context perception to counteract the damage suffered by ancient paintings over time. Firstly, to deal with the inconsistency between the styles of missing areas and those of non-missing areas, a Regional Attention-Style Transfer Module is proposed in this paper to achieve fine-grained alignment between the contents of missing areas and normal areas and to achieve complex style migration while maintaining the authenticity of the contents. By providing reasonable initial values to the missing areas of the feature map for the decoder, semantic consistency and feature continuity are ensured, thus improving the accuracy of the inpainting. Secondly, to address the lack of detailed information extraction, we design a multi-scale fusion mechanism. This mechanism enhances the feature representation of the filled area by gradually refining the method, enabling the decoder to fully utilize the multi-level semantic information from the encoder, effectively improving the semantic coherence and detailed richness of the repaired image. Additionally, a Global Context Perception Discriminator is proposed to overcome the limitations of existing models in detail capture. Utilizing the global attention module, the discriminator’s ability to locate spatial areas and extract critical detail information is enhanced, further improving the overall performance of the model. Experiments on the ancient paintings and ancient Huaniao++ datasets demonstrate that our method achieves the peak PSNR values of 34.62 and 23.46, along with the lowest LPIPS scores of 0.0507 and 0.0938, respectively. In the future, we are working on finer ancient painting inpainting methods to achieve complete inpainting of detailed textures.

Author Contributions

Project administration, X.L.; data curation, X.L.; writing—X.L.; methodology, X.L. and J.W.; investigation, N.W.; writing—review and editing, X.L., J.W. and N.W.; validation, X.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities (2020YJS031).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank all the reviewers for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Barani, S.; Poornapushpakala, S.; Subramoniam, M.; Vijayashree, T.; Sudheera, K. Analysis on image inpainting of ancient paintings. In Proceedings of the 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), Chennai, India, 28–29 January 2022; pp. 1–8. [Google Scholar]
Baiandin, S.; Ivashko, Y.; Dmytrenko, A.; Bulakh, I.; Hryniewicz, M. Use of historical painting concepts by modern methods in the inpainting of architectural monuments. Int. J. Conserv. Sci. 2022, 13, 2. [Google Scholar]
Baath, H.; Shinde, S.; Keniya, J.; Mishra, P.; Saini, A.; Dhiraj. Damage segmentation and inpainting of ancient wall paintings for preserving cultural heritage. In Proceedings of the International Conference on Computer Vision and Image Processing, Okinawa, Japan, 12–14 May 2023; pp. 102–113. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.; Kainz, B. Others Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Ouyang, X.; Chen, Y.; Zhu, K.; Agam, G. Image inpainting Refinement with Uformer GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 17–21 June 2024; pp. 5919–5928. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. TranMSMrmer in tranMSMrmer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Chahi, A.; Kas, M.; Kajo, I.; Ruichek, Y. MFGAN: Towards a generic multi-kernel filter based adversarial generator for image inpainting. Int. J. Mach. Learn. Cybern. 2024, 15, 1113–1136. [Google Scholar] [CrossRef]
Sun, Z.; Lei, Y.; Wu, X. Chinese Ancient Paintings Inpainting Based on Edge Guidance and Multi-Scale Residual Blocks. Electronics 2024, 13, 1212. [Google Scholar] [CrossRef]
Mardieva, S.; Ahmad, S.; Umirzakova, S.; Rasool, M.A.; Whangbo, T.K. Lightweight image super-resolution for IoT devices using deep residual feature distillation network. Knowl.-Based Syst. 2024, 285, 111343. [Google Scholar] [CrossRef]
Luo, R.; Guo, L.; Yu, H. An ancient Chinese painting inpainting method based on improved generative adversarial network. J. Phys. Conf. Ser. 2022, 2400, 012005. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef]
Messeri, L.; Crockett, M. Artificial intelligence and illusions of understanding in scientific research. Nature 2024, 627, 49–58. [Google Scholar] [CrossRef] [PubMed]
Zeng, Y.; Gong, Y. Nearest neighbor-based digital inpainting of damaged ancient Chinese paintings. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 1–5. [Google Scholar]
Rakhimol, V.; Maheswari, P. inpainting of ancient temple murals using cGAN and PConv networks. Comput. Graph. 2022, 109, 100–110. [Google Scholar] [CrossRef]
Wenjun, Z.; Benpeng, S.; Ruiqi, F.; Xihua, P.; Shanxiong, C. EA-GAN: Inpainting of text in ancient Chinese books based on an example attention generative adversarial network. Herit. Sci. 2023, 11, 42. [Google Scholar] [CrossRef]
Ren, H.; Sun, K.; Zhao, F.; Zhu, X. Dunhuang murals image inpainting method based on generative adversarial network. Herit. Sci. 2024, 12, 39. [Google Scholar] [CrossRef]
Niu, A.; Zhu, Y.; Zhang, C.; Sun, J.; Wang, P.; Kweon, I.; Zhang, Y. MS2Net: Multi-scale and multi-stage feature fusion for blurred image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5137–5150. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
Shi, B.; Xiong, B.; Yu, Y. Research on a multi-scale degradation fusion network in all-in-one image inpainting. IET Image Process. 2024, 18, 3070–3081. [Google Scholar] [CrossRef]
Lv, X.; Wang, C.; Fan, X.; Leng, Q.; Jiang, X. A novel image super-resolution algorithm based on multi-scale dense recursive fusion network. Neurocomputing 2022, 489, 98–111. [Google Scholar] [CrossRef]
Yeh, C.; Lin, C.; Lin, M.; Kang, L.; Huang, C.; Chen, M. Deep learning-based compressed image artifacts reduction based on multi-scale image fusion. Inf. Fusion 2021, 67, 195–207. [Google Scholar] [CrossRef]
Cer, D.; Yang, Y.; Kong, S.; Hua, N.; Limtiaco, N.; John, R.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder for English. In Proceedings of the 2018 Conference On Empirical Methods In Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 169–174. [Google Scholar]
Nguyen, T.; Le, T.; Vu, H.; Phung, D. Dual discriminator generative adversarial nets. Adv. Neural Inf. Process. Syst. 2017, 30, 11. [Google Scholar]
Yan, L.; Zhao, M.; Liu, S.; Shi, S.; Chen, J. Cascaded tranMSMrmer U-net for image inpainting. Signal Process. 2023, 206, 108902. [Google Scholar] [CrossRef]
Weng, L. From gan to wgan. arXiv 2019, arXiv:1904.08994. [Google Scholar]
Li, J.; Niu, K.; Liao, L.; Wang, L.; Liu, J.; Lei, Y.; Zhang, M. A generative steganography method based on WGAN-GP. In Proceedings of the Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, 17–20 July 2020; Proceedings, Part I 6. pp. 386–397. [Google Scholar]
Zheng, C.X.; Cham, T.J.; Cai, J.F. Pluralistic image completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1438–1447. [Google Scholar]
Li, J.Y.; Wang, N.; Zhang, L.F.; Du, B.; Tao, D.C. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7760–7768. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Jain, J.; Zhou, Y.; Yu, N. Keys to better image inpainting: Structure and texture go hand in hand. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 208–217. [Google Scholar]

Figure 1. An overview of our proposed RG-GAN, including the Multi-scale Fusion Generator and Global Context Perception Discriminator.

Figure 2. Multi-scale residual downsampling module.

Figure 3. Regional Attention-Style Transfer Module.

Figure 4. Multi-scale Cross-layer Perception Module structure diagram.

Figure 5. Global Context Perception Discriminator.

Figure 6. Global attention module.

Figure 7. Comparative visualization of different methods on PSNR indicators.

Figure 8. Image inpainting effects of different methods on the ancient paintings dataset: (a) the ground truth image, (b) the input image, (c–g) image inpainting results for PI, RFR, EC, FcF, and our method, respectively. The side-by-side comparison, highlighted with red boxes, demonstrates the superior performance of RG-GAN in capturing intricate patterns and textures of the ancient painting dataset compared to other models.

Figure 9. Comparative visualization of different methods on the PSNR indicators of the ancient Huaniao++ dataset.

Figure 10. Image inpainting effects of different methods on the ancient Huaniao++ dataset: (a) the ground truth image, (b) the input image, (c–g) the image inpainting results for PI, RFR, EC, FcF, and our method, respectively. The visual comparison, with red boxes highlighting key areas, clearly shows RG-GAN’s ability to handle the intricate patterns and textures of traditional Huaniao paintings more effectively than the other models.

Figure 11. Visual comparisons of inpainting results between our method and baselines (ablation study): (a) the input image, (b) the repaired image with baseline, (c) the repaired image without RASTM, (d) the repaired image without MCPM, (e) the repaired image without GCPD, and (f) the repaired image using our method.

Table 1. Distribution of different datasets.

Ancient Painting Dataset		Ancient Huaniao++ Dataset
Total	5798	Total	4638
Training set	4638	Training set	3710
Validation set	1160	Validation set	928
Size	256 × 256	Size	256 × 256

Table 2. Comparison of different methods applied to the ancient painting dataset.

Model	PSNR↑	SSIM↑	MSE↓	LPIPS↓
PI [29]	32.41	0.892	0.0009	0.0762
RFR [30]	33.42	0.921	0.0008	0.0720
EC [31]	33.53	0.913	0.0007	0.0523
FcF [33]	33.72	0.914	0.0008	0.0532
RG-GAN (Ours)	34.62	0.930	0.0005	0.0507

Table 3. Comparison of different models on the ancient Huaniao++ dataset.

Model	PSNR↑	SSIM↑	MSE↓	LPIPS↓
PI [30]	19.15	0.723	0.0182	0.1896
RFR [31]	20.12	0.745	0.0151	0.1637
EC [32]	19.52	0.724	0.0154	0.1756
FcF [33]	21.34	0.793	0.0095	0.1874
RG-GAN (Ours)	23.46	0.857	0.0040	0.0938

Table 4. Ablation study of different models on the ancient painting dataset.

Model	PSNR↑	SSIM↑	MSE↓	LPIPS↓
Baseline	30.56	0.890	0.0065	0.0558
w/o RASTM	33.75	0.920	0.0024	0.0532
w/o MCPM	32.24	0.901	0.0042	0.0546
w/o GCPD	33.43	0.912	0.0032	0.0538
RG-GAN (Ours)	34.62	0.930	0.0005	0.0507

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Wan, J.; Wang, N. Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception. Appl. Sci. 2024, 14, 8777. https://doi.org/10.3390/app14198777

AMA Style

Liu X, Wan J, Wang N. Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception. Applied Sciences. 2024; 14(19):8777. https://doi.org/10.3390/app14198777

Chicago/Turabian Style

Liu, Xiaotong, Jin Wan, and Nan Wang. 2024. "Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception" Applied Sciences 14, no. 19: 8777. https://doi.org/10.3390/app14198777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ancient Painting Inpainting with Regional Attention-Style Transfer and Global Context Perception

Abstract

1. Introduction

2. Related Work

2.1. Ancient Painting Inpainting

2.2. Multi-Scale Fusion

3. RG-GAN

3.1. Overview

3.2. Multi-Scale Fusion Generator

3.3. Multi-Scale Cross-Layer Perception Module

3.4. Global Context Perception Discriminator

3.5. Loss Function

3.5.1. Adversarial Loss

3.5.2. Reconstruction Loss

3.5.3. Total Loss Function

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Experiments on the Ancient Painting Dataset

4.4.1. Quantitative Comparison

4.4.2. Qualitative Comparison

4.5. Experiments on the Ancient Huaniao++ Dataset

4.5.1. Quantitative Comparison

4.5.2. Qualitative Comparison

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI