Context-Encoder-Based Image Inpainting for Ancient Chinese Silk

Wang, Quan; He, Shanshan; Su, Miao; Zhao, Feng

doi:10.3390/app14156607

Open AccessArticle

Context-Encoder-Based Image Inpainting for Ancient Chinese Silk

¹

College of Textile Science and Engineering (International Silk Institute), Zhejiang Sci-Tech University, Hangzhou 310018, China

²

International Silk and Silk Road Research Center, Hangzhou 310018, China

³

School of Arts and Archaeology, Zhejiang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6607; https://doi.org/10.3390/app14156607 (registering DOI)

Submission received: 17 June 2024 / Revised: 17 July 2024 / Accepted: 24 July 2024 / Published: 28 July 2024

(This article belongs to the Special Issue AI-Based Image Processing: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The rapid advancement of deep learning technologies presents novel opportunities for restoring damaged patterns in ancient silk, which is pivotal for the preservation and propagation of ancient silk culture. This study systematically scrutinizes the evolutionary trajectory of image inpainting algorithms, with a particular emphasis on those firmly rooted in the Context-Encoder structure. To achieve this study’s objectives, a meticulously curated dataset comprising 6996 samples of ancient Chinese silk (256 × 256 pixels) was employed. Context-Encoder-based image inpainting models—LISK, MADF, and MEDFE—were employed to inpaint damaged patterns. The ensuing restoration effects underwent rigorous evaluation, providing a comprehensive analysis of the inherent strengths and limitations of each model. This study not only provides a theoretical foundation for adopting image restoration algorithms grounded in the Context-Encoder structure but also offers ample scope for exploration in achieving more effective restorations of ancient damaged silk.

Keywords:

ancient Chinese silk; deep learning; image inpainting; context-encoder-based model; mask awareness

1. Introduction

Ancient silk patterns, developed over centuries, are culturally significant visual symbols that embody aesthetic values and the diverse ethos of civilizations. Despite the extensive corpus of ancient silk artifacts, many suffer degradation from aging and mechanical damage. Currently, restoration is largely a manual process, reliant on skilled conservators with access to detailed pattern information and extensive experience. This reliance can lead to deviations from the original patterns owing to indistinct textures or misinterpretations. Additionally, the specialized nature of textile conservation and the lengthy training required limit the pool of qualified restorers. Therefore, advancing digital restoration technology is crucial for accurately reconstructing damaged ancient silk patterns, preserving their historical authenticity, and ensuring the continuation of silk heritage.

Image inpainting [1] encompasses techniques for reconstructing missing corrupted regions in images using available information, which has had extensive applications in artifact restoration [2], image enhancement [3], and content removal [4]. Image inpainting algorithms are traditionally categorized into pixel-wise [5] and patch-based methodologies [6], as well as contemporary deep-learning-based approaches. Classical algorithms are adept at repairing small-scale damage, yet they falter with extensive deterioration owing to the limited high-level comprehension of image content and structure, often failing to generate semantically meaningful reconstructions. With the rise of deep learning technologies, researchers have made significant strides in integrating these models into computer vision tasks [7,8], with notable success. Consequently, deep-learning-based image inpainting algorithms [9] have been developed, optimizing image reconstruction by incorporating various constraints within deep neural network frameworks. These sophisticated models can learn advanced image features, thus ensuring structural and textural coherence of the restored images. Notably, Context-Encoder models have demonstrated remarkable efficacy in the image restoration domain [10].

This study delineates a systematic review of advancements in Context-Encoder-based restoration models, detailing an assembled dataset comprising 6996 ancient Chinese silk samples with a resolution of 256 × 256 pixels. This research implements three distinct Context-Encoder-based models—LISK, MADF, and MEDFE—for restoring damaged patterns in ancient Chinese silks and provides a comprehensive assessment of the results. The efficacy and limitations of each model are critically evaluated and discussed. This study offers guidance and reference for further improvements and developments in ancient silk pattern restoration techniques based on the Context-Encoder model.

2. Related Work

The inception of deep learning in the image inpainting domain was marked by the introduction of Context-Encoders [11], leveraging the encoder–decoder neural network framework originally posited by Rumelhart et al. in 1986 [12]. This seminal architecture, comprising an input layer, multiple hidden layers, and an output layer, was initially conceptualized for extracting salient features and reducing data dimensionality. Its subsequent adaptation for image inpainting tasks has underscored its versatility and efficacy in reconstructing coherent structures within damaged or incomplete image data. In the context of image defect restoration, the Context-Encoder within the network adeptly leverages both local and global image information, seamlessly aligning the generated data with the original image. Algorithms grounded in the Context-Encoder structure have gained prominence. A structural representation is shown in Figure 1. The fully connected channel between the encoder and decoder effectively reduces the number of network parameters. The Context-Encoder seamlessly integrates intricate image details around the defective area and semantic information spanning the entire image. During the training phase, Context-Encoders incorporate reconstruction loss for consistency between the overall structure and context of the missing regions. Additionally, Encoders introduce adversarial loss to enhance the flexibility of the generated data. The overarching loss function of this approach combines the aforementioned reconstruction and adversarial losses through a weighted combination.

Subsequently, Liao et al. [13]. extended the framework of Context-Encoders by incorporating an edge-centric context decoder. Their methodology commences with extracting edge information from the image with Fully Convolutional Networks (FCNs) [14] to restore missing details. The recovered edge information seamlessly integrates with the incomplete image through the Context-Encoder, contributing to further refinement. To address challenges associated with pronounced negative responses during training, Yang et al. [15] strategically replaced the ReLU and Leaky ReLU layers in Context-Encoders with ELU layers, strengthening overall model stability. Furthermore, Vo et al. [16]. introduced a novel linear combination of pixel reconstruction loss and feature reconstruction loss, designated as structural loss, building on the foundational Context-Encoder framework to significantly improve the precision of the model.

To enhance the homogeneity of the structure and the post-repair image, recently Liu et al. [17] proposed rethinking image inpainting via a Mutual Encoder-Decoder with Feature Equalizations (MEDFE), anchored in a mutual encoding–decoding network structure. Shallow encoder-learned features represent texture, while deep features comprehensively encapsulate structural details. A feature equalization method was introduced to ensure consistency between the structural and texture features, which involved reweighting channels post-feature-concatenation and introducing a bilateral propagation activation function. These modifications effectively eradicate blurriness and artifacts arising from inconsistent structural and texture features. Continuing the discourse, Yang et al. [18] introduced an image inpainting model with Learning to Incorporate Structure Knowledge (LISK), incorporating a structure-embedding scheme for enhanced fusion of structural information. The model implements an attention mechanism for nuanced adjustments to the generated structure and content, accompanied by a novel pyramid structure loss designed to oversee acquisition of structural knowledge in image restoration. Meanwhile, Zhu et al. [19] proposed a Mask-Aware Dynamic Filtering (MADF) module for efficient feature learning in the encoder. Additionally, they introduced point-wise normalization (PN) to refine batch normalization, preventing disappearance of the semantic layout. By employing a multi-scale feature space through an end-to-end approach, this restoration model underscores the efficacy of the proposed Cascaded Refinement Neural Network structure in achieving stable image restoration.

3. Approach

3.1. The Principle of the LISK Model

The LISK [18] restoration model excels in acquiring and integrating structural knowledge for image restoration. Employing a multitask learning approach, it simultaneously generates both complete images and corresponding structural information, enriching our comprehension of image content. Including structural embedding and attention mechanisms further enhances the use of structural information in the restoration process, as depicted in Figure 2.

The LISK restoration model generator comprises a spatial context encoder featuring two downsampling stages, eight residual blocks, and a decoder, as delineated in the figure. The spatial Context-Encoder captures the image’s spatial information through downsampling, while the decoder, leveraging residual blocks, upscales to produce an image of the original size. Importantly, the encoder is shared between image and structure generation, facilitating the exchange of learned features. The decoder is tailored to a multi-scale style for embedding and outputting structure information at different scales. Two modules leverage structure information: the structural embedding layer and the attention layer.

The structural embedding layer seamlessly integrates structural features into various decoding stages, which serve as priors for image generation. This process involves segregating and predicting structural features, subsequently merging them through concatenation using standard residual blocks. The attention layer, inspired by non-local mean mechanisms, computes the response at a position in the output feature map as a weighted sum of the entire feature in the input feature map. Through attention, similar features from the surroundings are conveyed to the missing region, enhancing the generated content and structure and effectively addressing issues such as smooth artifacts and improved details.

The LISK restoration model is trained using a combined loss approach, encompassing pyramid structural loss (L_structure) and blended image loss (L_image). This methodology aims to capture structural knowledge and supervise the image restoration process. The overall loss function is defined as follows:

L_{t o t a l} = L_{i m a g e} + α L_{s t r u c t u r e}

(1)

The pyramid structural loss (L_structure) is primarily employed to guide the generation and embedding of structures, incorporating structural information into the generation process. The formula for L_structure is defined as follows:

L_{s t r u c t u r e} = \sum_{s}^{n_{s}} [| | C_{p r e d}^{(s)} - C^{(s)} | |_{1} + β L_{e d g e}^{(s)}]

(2)

where

L_{e d g e}^{(s)}

denotes the regularization term, with β as its corresponding coefficient, and n_s represents the total number of scales.

C_{p r e d}^{(s)}

represents the predicted gradient map at scale s, while C^(s) represents the incomplete gradient map at the same scale. The parameters α and β serve as hyperparameters to balance the contributions of the different loss terms, α = 0.1 and β = 100. The formula is expressed as follows:

L_{e d g e}^{(s)} = | | C_{p r e d}^{(s)} - C^{(s)} | |_{1} ⊙ M_{E}^{(s)}

(3)

where

M_{E}^{(s)}

denotes the weighted edge mask. To apply regularization to edge structures, the binary true edge map, E^(s), is convolved with a Gaussian filter, g. The formula is expressed as follows:

M_{E}^{(s)} = g * E^{(s)}

(4)

The composite image loss (L_image) encompasses pixel-level reconstruction loss (L_rec), perceptual loss (L_perc), style loss (L_style), and adversarial loss (L_G), outlined as follows:

L_{i m a g e} = λ_{r} L_{r e c} + λ_{p} L_{p e r c} + λ_{s} L_{s t y l e} + λ_{G} L_{G}

(5)

where λ_r, λ_p, λ_s, and λ_G are hyperparameters that balance the contributions of different loss terms. The model empirically sets λ_r = 1, λ_p = 0.1, λ_s = 250, and λ_G = 0.4.

The pixel-level reconstruction loss measures the difference between the generated image (I_out) and its corresponding ground-truth image (I_gt) using the pixel-wise L1 distance, as indicated in the following formula:

L_{r e c} = | | I_{o u t} - I_{g t} | |_{1}

(6)

Perceptual loss is computed by measuring the L1 distance in the feature space of the pretrained VGG-19 network [20] between the generated image (I_out) and the ground truth (I_g), trained on the ImageNet dataset [21]. The formula is defined as:

L_{p e r c} = \sum_{i} | | φ_{i} (I_{o u t}) - φ_{i} (I_{g t}) | |_{1}

(7)

where ϕ_i represents the feature map of the i-th layer in VGG-19. In this study, layers relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1 were used.

Style loss compares the L1 distance between images in feature space, but first it calculates the corresponding Gram matrices [22] for each selected feature map. The formula is defined as:

L_{s t y l e} = \sum_{i} | | G_{φ_{i}} (I_{o u t}) - G_{φ_{i}} (I_{g t}) | |_{1}

(8)

where G_φi is the C_i × C_i Gram matrix composed of feature maps (ϕ_i) with a size of H_i × W_i × C_i.

The adversarial loss LISK repair model also employs an adversarial training strategy, using PatchGAN [23] as the discriminator, D, and representing its adversarial loss as:

L_{D} = E_{I} [\log D (I)] + E_{I_{c o m p}} \log [1 - D (I_{c o m p})]

(9)

The adversarial loss for the generator of this model can be written as:

L_{G} = E_{I_{c o m p}} \log [1 - D (I_{c o m p})]

(10)

3.2. The Principle of the MEDFE Model

The MEDFE [17] restoration model introduces a network characterized by mutual encoding and decoding for image restoration. Within this framework, CNN features from the shallow encoder are designated to represent texture, while deep features comprehensively capture the image’s structural components. To guarantee coherence between structural and textural features, a feature-balancing method is proposed. This method involves concatenating features, reweighting channels, and introducing a bilateral propagation activation function for spatial consistency across the entire feature, effectively mitigating blurring and artifacts resulting from inconsistencies between structural and textural features. A schematic representation of the MEDFE restoration model framework is delineated in Figure 3.

Figure 3 shows that the MEDFE restoration model categorizes texture features as F_te and structure features as F_st. Throughout the reorganization process, the model adjusts the sizes of CNN feature maps from different convolutional layers, ensuring uniformity and appropriate connections. Following this reorganization, the MEDFE restoration model encompasses two branches—the structure branch and the texture branch—for hole filling in F_te and F_st. Both branches share identical structures, each containing three parallel streams for hole filling at multiple scales. Each stream comprises five partial convolutions with the same kernel size, while the kernel size varies among streams. By employing different kernel sizes, the MEDFE model achieves multi-scale hole filling on the input CNN features for each branch. The features filled from three streams (i.e., three scales) are concatenated and mapped to the same size as the input feature map through a 1 × 1 convolution. The model designates the output of the texture branch as F_fte and the output of the structure branch as F_jst. To ensure that hole filling emphasizes texture and structure, supervision is applied to Fjst and Ffte.

F_fst and F_fte are mapped to color images I_ost and I_ote, respectively, through a 1 × 1 convolution. The pixel-level L1 loss is formulated as follows:

L_{r s t} = | | I_{o s t} - I_{s t} | |_{1}

(11)

L_{r t e} = | | I_{o t e} - I_{g t} | |_{1}

(12)

where I_gt represents the authentic image and I_st denotes the structural image derived from I_gt using the edge-preserving smoothing technique RTV [24], following Structureflow [25].

The hole regions within F_te and F_st undergo distinct filling processes via the structure branch and the texture branch. Nevertheless, the incongruent feature representations in F_fte and F_fst are insufficient in accurately capturing the recovered structure and texture, consequently giving rise to blurriness and artifacts in and around the hole regions. In addressing these concerns, the model concatenates F_fte and F_fst, followed by a straightforward fusion employing a 1 × 1 convolution layer to generate F_sf. Feature normalization is systematically applied at discrete CNN feature levels to enhance the refinement of texture and structure representations within F_sf.

The MEDFE restoration model introduces a suite of loss functions to quantify distinctions between structure and texture. These loss functions encompass pixel reconstruction loss, perceptual loss, style loss, and relative mean LS adversarial loss [26]. The comprehensive loss function is expressed as follows:

L_{t o t a l} = λ_{s t} L_{r s t} + λ_{t e} L_{r t e} + λ_{r} L_{r e c} + λ_{p} L_{p r e c} + λ_{s} L_{s t y l e} + λ_{a d v} L_{a d v}

(13)

where λ_r, λ_p, λ_s, λ_adv, λ_st, and λ_te are weighting parameters. The model empirically sets λ_st = 1, λ_te = 1, λ_r = 1, λ_p = 0.1, λ_s = 250, and λ_adv = 0.2.

Pixel Reconstruction Loss: The model systematically evaluates pixel differences from dual perspectives. The initial perspective encompasses loss terms in Formulas (11) and (12), providing supervision for the texture and structure branches. The second perspective gauges the similarity between the network output and ground truth, as articulated in Formula (6).

Perceptual Loss: To capture high-level semantics and emulate human perception of image quality, the model employs perceptual loss defined by the VGG-16 feature backbone pretrained on ImageNet [27], as delineated in Formula (7). In this context, Φ_i corresponds to feature maps from layers ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and ReLu5_1.

Style Loss: Introducing checkerboard-like artifacts through transposed convolution layers in the decoder prompts the model to introduce style loss, as outlined in Formula (8).

Relative Average Least Squares (LS) Adversarial Loss: Following CA’s [28] methodology for perceptual enhancement with global and local discriminators, the model incorporates relative average LS adversarial loss in the discriminator. For the generator, adversarial loss is defined as follows:

L_{a d v} = - E_{x_{r}} [\log (1 - D_{r a} (x_{r}, x_{f}))] - E_{x_{f}} [\log (D_{r a} (x_{f}, x_{r}))]

(14)

where D_ra(x_r,x_f) = sigmoid(C′(x_r) − Ex_f [C′(x_f)]) and C(.) act as discriminators without a final sigmoid function. The process involves sampling pairs of real and fake data (x_r, x_f) from actual and generated datasets. x_r is a sample from real data, and x_f is a sample from generated data.

3.3. The Principle of the MADF Model

The MADF [19] restoration model presents a novel paradigm in image processing. Using dynamic convolution filters, it adaptively generates a series of convolution kernels contingent upon configuring effective pixels within each local convolution window, concurrently incorporating mask information. Departing from conventional convolutional methodologies, this model not only recalibrates output activations but also tailors an individualized set of convolution kernels for each spatial position. This customized approach augments sensitivity and adaptability to local features, imparting a distinctly personalized character to the model. This adaptability enables the model to adeptly accommodate diverse image content at varying convolution positions, demonstrating superior efficacy in image restoration endeavors.

In Figure 4, the MADF restoration model takes the damaged image (I_in) and its binary mask (M) as inputs. The model includes three components: (a) the encoder (E) encodes the damaged image with a mask into high-level feature maps; (b) the restoration decoder (R) fills in the missing parts of the feature maps; and (c) a series of refinement decoders {F₁, F₂, … F_k} refine the feature maps K times and decode the features back to lower-level pixels. Empirically, two refinement decoders balance efficiency and performance. MADF^l and R^l are the MADF module and restoration decoder block for the l-th level, respectively. F^l,k represents the k-th refinement decoder block at the l-th level. m^l is the mask feature map for the l-th level of the encoder. Green-marked convolution operations in the encoder take e^l⁻¹ as the input, with convolution kernels generated from the corresponding region of m^l⁻¹. “DConv” denotes transpose convolution, “LReLU” is leaky ReLU, and “up” signifies increased channels using a 1 × 1 convolution layer. u^l⁻¹ and r^l are inputs to R^l, generating feature map r^l-l, and F^l,k takes f^l^−1,k−1 and f^l,k as inputs to generate feature map f^l^−1,k. Additionally, r^l and f^L,k are equivalent to u^L, and f^l,1 equals r^l.

The model uses a combination of L1 and perceptual losses (λ₁L₁ + λ_pL_perc) for the first refinement decoder’s output. Meanwhile, the second refinement decoder’s output is supervised by the L_total loss function, incorporating pixel reconstruction, perceptual, style, and total variation losses. The formula for L_total is outlined below:

L_{t o t a l} = λ_{1} L_{1} + λ_{p} L_{p e r c} + λ_{s} L_{s t y l e} + λ_{t} L_{t v}

(15)

where λ₁, λ_p, λ_s, and λ_t are weight parameters. According to reference [29], the model empirically sets λ₁ = 1, λ_p = 0.05, λ_s = 120, and λ_t = 0.1.

Pixel Reconstruction Loss: For the output of the restoration decoder, the model employs the L1 loss function as follows:

L_{1} = L_{v a l i d} + λ_{h} L_{h o l e}

(16)

where λ_h is weight parameters. According to reference [29], the model empirically sets λ_h = 6. Hole Pixel Loss (L_hole) and Non-Hole Pixel Loss (L_valid) are defined as follows:

L_{h o l e} = \frac{1}{N_{I_{g t}}} | | (1 - M) ⊙ (I_{o u t} - I_{g t}) | |_{1}

(17)

L_{v a l i d} = \frac{1}{N_{I_{g t}}} | | M ⊙ (I_{o u t} - I_{g t}) | |_{1}

(18)

where

N_{I_{g t}}

represents the number of elements in the actual image I_gt, M is the mask, I_out is the generated image, and I_gt is the real image. ⊙ represents the convolution operation.

Perceptual Loss: The model employs loss calculations based on the pool1, pool2, and pool3 layers of the VGG-16 pretrained on ImageNet. See Formula (7) for details.

Style Loss: This loss computes the difference between the style of the reconstructed image and the style of the real image to make the texture of the generated image similar to that of the actual image. Refer to Formula (8) for specifics.

Total Variation Loss: This loss function makes the output smoother, avoiding excessively sharp or noisy edges and textures and enhancing the image quality. The formula is as follows:

L_{t v} = \sum_{(x, j) \in R, (x, j + 1) \in R} \frac{| | I_{c o m}^{i, j + 1} - I_{c o m}^{i, j} | |_{1}}{N_{I_{c o m}}} + \sum_{(x, j) \in R, (x + 1, j) \in R} \frac{| | I_{c o m}^{i + 1, j} - I_{c o m}^{i, j} | |_{1}}{N_{I_{c o m}}}

(19)

where R represents the region of the holes with a 1-pixel dilation.

4. Experiments

4.1. Experimental Settings

A dataset comprising 6996 ancient Chinese silk pattern images (256 × 256 pixels) was curated from diverse sources, including the “Jinxiu · World Silk Interactive Map,” the China National Silk Museum, the Palace Museum, the Beijing Art Museum, the Fujian Museum, the Hunan Provincial Museum, the Dunhuang Research Institute, and scholarly publications on ancient Chinese silk art. All images are in PNG format with a bit depth of 24 and RGB color mode. This dataset was the basis for evaluating the model’s proficiency in restoring the collected samples. Figure 5 depicts a selection of images from the dataset, primarily encompassing geometric, botanical, zoological, and natural and utensil patterns, among others. For an equitable comparison, all experimental outcomes are exclusively derived from the pretrained model’s outputs, without applying any post-techniques. The computer for modeling and analysis was configured with an NVIDIA RTX 3090 (24G) GPU (NVIDIA, Santa Clara, CA, USA) and an Intel i5-12600KF CPU (Intel, Santa Clara, CA, USA).

This study used a suite of established objective metrics, including L1, PSNR, SSIM [30], UQI [31], VIF [32], and FID [33], to assess the performance of three pretrained models in reconstructing images obscured by randomly generated masks of varying configurations. The analyses focused on two primary types of occlusions: irregular random test masks, as shown in Figure 6, analogous in design to PConv [29] masks, and square random masks measuring 128 × 128 pixels. The irregular masks were systematically classified into six intervals based on the proportion of occluded image area: (0.01, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3,0.4], (0.4, 0.5], and (0.5, 0.6]. Each interval comprises 1000 masks, differentiated further by the presence or absence of boundary constraints.

While objective indices provide partial insight into model performance, subjective indices are considered the gold standard in evaluating restoration models. Therefore, this study also undertook extensive qualitative evaluations alongside a triangulated user survey to facilitate a comprehensive comparative analysis. Moreover, it detailed illustrative case studies of partial restorations executed by the three advanced pretrained models on authentic silk materials.

4.2. Evaluation Index

Objective Evaluation Metrics Selected and Their Justifications:

L1: Predominantly utilized in image restoration, the L1 norm quantifies per-pixel reconstruction accuracy by computing the Mean Absolute Error (MAE) between the restored image and the original.

Peak Signal-to-Noise Ratio (PSNR): A classical metric in image quality assessment, an elevated PSNR value indicates superior image quality.

Structural Similarity (SSIM): This index evaluates the restored images by comparing their brightness, contrast, and structural attributes to those of the original images. An increase in SSIM values correlates with enhanced image restoration effectiveness.

Fréchet Inception Distance (FID): Employed for assessing the plausibility of image representations, FID measures the distance between the feature distributions of authentic and synthesized images. Lower FID scores suggest a closer approximation to the distribution of real images, generally implying higher-quality generated images.

Universal Quality Image Index (UQI): This index gauges the degree of distortion between the content of generated images and real images. Higher UQI values denote improved image quality.

Visual Information Fidelity (VIF): Based on a natural scene statistics (NSS) model, VIF evaluates the extent of image distortion and the fidelity of human visual perception. Higher VIF values indicate better image quality.

4.3. Qualitative Evaluation

Figure 7 illustrates a qualitative comparative analysis of the LISK, MEDFE, and MADF methodologies using an ancient Chinese silk dataset under irregular random masking conditions. The observations indicate that the MEDFE model, which harmonizes deep structural features with shallow textural features, effectively synthesizes realistic textures, though some artifacts remain evident. By contrast, the LISK model, which integrates a multitask learning framework with an attention mechanism, leverages similar patterns within the images to refine the synthesized structure and content, demonstrating superior performance in generating more authentic structures and textures. Furthermore, the MADF model outperforms the previous models in detail enhancement, maintaining edge clarity and smoothness even in the presence of extensive and complex damage. This improvement can be attributed to MADF’s implementation of mask-aware perception and its cascaded refinement approach. By employing mask perception, MADF enhances its ability to generate more accurate feature representations for image content decoding. Additionally, during the decoding phase, point-wise normalization (PN) is applied to mitigate the effects of covariate shifts during feature normalization, combined with an advanced refinement of the decoding process, rendering the outcomes more credible.

Figure 8 illustrates several exemplars of restoration using square random masks. The results indicate that the efficacy of all three models is compromised in scenarios characterized by extensive missing regions. In such cases, the models predominantly interpolate the voids with homogeneous content derived from adjacent areas; deterioration in structural fidelity and textural resolution is more pronounced toward the center of the voids. Owing to the scant contextual data, the restoration algorithms fail to effectively reconstruct the lost segments.

4.4. Quantitative Evaluation

In this study, L1 loss and PSNR were employed to quantify the pixel-level similarity between pairs of images. SSIM and the UQI assess the level of distortion between the generated image content and the authentic imagery. Furthermore, the VIF, which correlates closely with human visual perception, and the FID, a prevalent metric in the image generation domain, were applied to evaluate the overall visual quality. These metrics were derived through comparative analyses between the restored images and their original counterparts. The outcomes of this study are summarized in Table 1.

The findings reveal that MADF demonstrates superior performance in 32 out of 48 metrics, whereas LISK exhibits commendable results in 16 metrics. Notably, MEDFE does not outperform in any metric. More specifically, MADF outperforms in L1 loss, FID, and VIF values. It also surpasses in SSIM values for masks within the range of (0.01, 0.1] and UQI values, excluding square masks. LISK excels in PSNR and SSIM metrics, including SSIM values for (0.01, 0.1] masks. This emphasizes the importance of considering diverse metrics, extending beyond conventional measures such as PSNR and SSIM, for a more comprehensive evaluation of model performance.

4.5. User Study

In response to the divergence between quantitative evaluation metrics and human perceptual judgment, we conducted a user study involving 10 volunteers with expertise in textile pattern research. We employed three distinct methodologies, as follows:

① A sample of 25 images was randomly selected from the test dataset, each featuring a random mask. The restoration outputs of three different models were compared with the original images. The participants were instructed to select the image that appeared more visually coherent and natural from each pair. The preference rate, PRa, was calculated as the proportion of choices favoring the model-generated outcomes over the original images.

② From the test dataset, 50 images with random masks were randomly extracted for a comparative analysis of three distinct model-generated restorations. The participants were requested to identify the most visually coherent and natural image among the trio. The preference rate, PRb, represents the frequency at which the participants selected the optimal restoration from the three models presented.

③ An additional set of 25 images with random masks was drawn from the test dataset for paired A/B testing. In each session, the participants were shown pairs of images produced by two different restoration methods. They were asked to choose the image they deemed more visually coherent and natural. The preference rate, PRc, quantifies the proportion of participant selections favoring each of the three model-generated outcomes. The results are summarized in Table 2.

The data presented in the table demonstrate that the MADF method exhibits superior performance across three distinct investigative approaches. Research Method ① indicates relatively small deviations in PRa for the LISK, MEDFE, and MADF restoration methods compared with the authentic images, with respective values of 14.00%, 15.20%, and 20.40%. Research Method ② entails selecting the optimal result from three restoration attempts, and a significant contrast can be observed between the LISK and MADF methods, with PRb values of 25.00% and 69.00%, respectively. Research Method ③ involves pairwise comparisons of restoration methods and consistently shows a higher preference for the MADF method, with a PRc reaching 82.2%. These findings suggest that Research Methods ② and ③ more discernibly quantify the differences in efficacy among the restoration techniques.

4.6. Practice and Challenge

To further elucidate the generalization capacity of the models, three distinct pretrained architectures were deployed on a dataset comprising damaged silk textiles. Figure 9 provides illustrative examples of the model performances in silk textile restoration, underscoring its significance as a quintessential application in image restoration. Notably, the MADF method demonstrated substantial generalization prowess. These findings suggest that image restoration algorithms leveraging Context-Encoder frameworks hold considerable promise for authentically reconstructing historically significant damaged silk artifacts.

However, it is evident that, although Context-Encoder-based image restoration technology has strong capabilities, targeted training datasets are necessary for the specific task of ancient silk restoration. This requirement arises because ancient silk patterns have unique characteristics in terms of color and texture details. Furthermore, the significant computational cost associated with large network structures currently limits these methods to primarily low-resolution images, making high-resolution image restoration relatively challenging. Thus, extensive research is needed to expedite the training of various network structures. Additionally, recent advancements in GPU computational power are expected to greatly facilitate high-resolution image restoration. Many image restoration methods use a coarse-to-fine progressive approach, leading to increased computational resource consumption, higher training difficulty, and potential adverse impacts between different restoration stages. Therefore, optimizing the internal structure of model networks and adopting fully automated deep neural network architectures for image restoration to establish end-to-end single-step restoration models remains a highly valuable research endeavor.

5. Conclusions

This study explored the application of Context-Encoder-based image restoration algorithms to effectively restore damaged ancient silks, offering ample room for imagination. It primarily discussed the research progress in Context-Encoder-based image restoration algorithms using a dataset of ancient Chinese silk samples. Three Context-Encoder-based models—LISK, MADF, and MEDFE—were employed to repair damaged patterns in ancient Chinese silks, followed by a comprehensive evaluation of the results. The findings suggest that MADF outperforms the other two methods in subjective visual quality and objective quantitative measures. The specific results are as follows:

1. In the realm of subjective visual assessment, the multitask architecture of LISK markedly surpasses the MEDFE method in synthesizing more plausible structures and enriched details. This phenomenon may be attributed to the introduction of a structural embedding scheme and an attention mechanism within the LISK model. The structural embedding scheme provides essential structural information of the image, while the self-attention mechanism leverages similar patterns within the image to further refine the generated structures and textures. Further improvements can be seen with the MADF method, which augments detail while preserving crisp and seamless edges across expansive and intricate areas of damage.

2. Our quantitative analysis showed that, across a suite of 48 evaluation metrics, the MADF method outperforms in 32 metrics, whereas the LISK method leads in 16; notably, the MEDFE method does not predominate in any metrics. This underscores the limitation of relying solely on PSNR and SSIM as quantitative metrics when holistically evaluating restoration efficacy.

3. In user-based evaluations, MADF is favored when comparing real images, choosing the best among the three models, and during pairwise comparisons. This improvement can be attributed to the adoption of the MADF module. The MADF module dynamically generates customized convolution kernels for each of convolution windows based on the corresponding mask information. Furthermore, this method incrementally refines the restoration results through a cascaded approach, thus significantly enhancing the performance of image restoration.

4. For large missing regions, all three models are less effective owing to very limited contextual information; the models tend to fill these areas with smooth content based on the surrounding environment and fail to generate the missing content accurately.

Author Contributions

Conceptualization, Q.W., M.S. and F.Z.; methodology, Q.W., M.S. and F.Z.; software, Q.W.; validation, M.S., S.H. and F.Z.; formal analysis, all authors; research, Q.W. and M.S.; resources, M.S. and F.Z.; data curation, all authors; writing—preparation of the original draft, all authors; drafting—revising and editing, all authors; visualization, Q.W. and M.S.; supervision, F.Z.; project management, M.S. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by The National Key Research and Development Program of China (Grant number: 2019YFC1521301); The National Social Science Fund of China (Grant number: 20WYSB006); and The Zhejiang Science and Technology Projects of Cultural Relics Protection (Grant number: 2021016).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The database and code may be made available for use upon prior request and approval of the authors. The data are not publicly available due to privacy.

Acknowledgments

The first author would like to thank Zhejiang Sci-Tech University for the scholarship given.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000. [Google Scholar]
Sun, X.; Jia, J.; Xu, P.; Ni, J.; Shi, W.; Li, B. Structure-guided virtual restoration for defective silk cultural relics. J. Cult. Heritage 2023, 62, 78–89. [Google Scholar] [CrossRef]
Wang, C.; Wu, H.; Jin, Z. Fourllie: Boosting low-light image enhancement by fourier frequency information. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Telea, A. An image inpainting technique based on the fast marching method. J. Graph. Tools 2004, 9, 23–34. [Google Scholar] [CrossRef]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Gu, T.; Chen, W.; Chen, C. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv 2024, arXiv:2403.01779. [Google Scholar]
Gong, L.; Zhu, Y.; Li, W.; Kang, X.; Wang, B.; Ge, T.; Zheng, B. Atomovideo: High fidelity image-to-video generation. arXiv 2024, arXiv:2403.01800. [Google Scholar]
Huang, W.; Deng, Y.; Hui, S.; Wu, Y.; Zhou, S.; Wang, J. Sparse self-attention transformer for image inpainting. Pattern Recognit. 2024, 145, 109897. [Google Scholar] [CrossRef]
Yu, Y.; Zhan, F.; Lu, S.; Pan, J.; Ma, F.; Xie, X.; Miao, C. Wavefill: A wavelet-based generation network for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Rumelhart, D.; Hinton, G.; Williams, R. Learning Internal Representations by Error Propagation; MIT Press: Cambridge, MA, USA, 1985. [Google Scholar]
Liao, L.; Hu, R.; Xiao, J.; Wang, Z. Edge-aware context encoder for image inpainting. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Vo, H.V.; Duong, N.Q.; Pérez, P. Structural inpainting. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018. [Google Scholar]
Liu, H.; Jiang, B.; Song, Y.; Huang, W.; Yang, C. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Yang, J.; Qi, Z.; Shi, Y. Learning to incorporate structure knowledge for image inpainting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Zhu, M.; He, D.; Li, X.; Li, C.; Li, F.; Liu, X.; Ding, E.; Zhang, Z. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Xu, L.; Yan, Q.; Xia, Y.; Jia, J. Structure extraction from texture via relative total variation. ACM Trans. Graph. 2012, 31, 1–10. [Google Scholar] [CrossRef]
Ren, Y.; Yu, X.; Zhang, R.; Li, T.H.; Liu, S.; Li, G. Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Jolicoeur-Martineau, A. The relativistic discriminator: A key element missing from standard gan. arXiv 2018, arXiv:1807.00734. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.-C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]

Figure 1. The architecture of the Context-Encoder framework.

Figure 2. Schematic of the LISK multitask framework.

Figure 3. MEDFE framework schematic.

Figure 4. Schematic of the MADF framework.

Figure 5. Presentation of a selection of patterns from the ancient Chinese silk dataset: (a) geometric patterns; (b) botanical patterns; (c) zoological patterns; (d) natural and utensil patterns; and (e) other patterns.

Figure 6. A portion of the dataset containing irregular random masks. The first line displays examples with a border constraint, while the second line showcases examples without such a constraint.

Figure 7. Qualitative comparison results with irregular random masks.

Figure 8. Qualitative comparison results with square random masks.

Figure 9. Application demonstration on damaged ancient Chinese silk.

Table 1. Quantitative comparison results.

	Models	(0.01, 0.1]	(0.1, 0.2]	(0.2, 0.3]	(0.3, 0.4]	(0.4, 0.5]	(0.5, 0.6]	ALL	Square
L1% ¶	LISK	0.87	2.22	4.04	6.18	8.67	12.55	5.66	5.92
	MEDFE	0.73	1.72	3.01	4.44	5.99	8.27	3.96	3.87
	MADF	0.49	1.35	2.46	3.71	5.04	7.06	3.26	3.43
PSNR †	LISK	35.19	30.49	27.56	25.31	23.55	21.48	27.40	24.91
	MEDFE	32.60	27.28	24.34	22.28	20.76	18.92	24.47	22.59
	MADF	34.24	28.57	25.46	23.32	21.80	20.01	25.73	23.11
SSIM †	LISK	0.9752	0.9344	0.8779	0.8124	0.7402	0.6391	0.8322	0.8263
	MEDFE	0.9703	0.9142	0.8374	0.7485	0.6510	0.5001	0.7718	0.7253
	MADF	0.9775	0.9325	0.8685	0.7919	0.7055	0.5565	0.8082	0.7405
FID ¶	LISK	7.77	15.51	24.26	36.19	51.32	83.80	21.26	35.54
	MEDFE	6.77	18.94	37.09	61.10	87.47	126.92	36.51	45.67
	MADF	2.06	6.12	12.35	21.34	33.41	54.72	10.43	12.22
UQI †	LISK	0.9895	0.9864	0.9824	0.9749	0.9652	0.9472	0.9755	0.9748
	MEDFE	0.9960	0.9888	0.9789	0.9671	0.9542	0.9331	0.9704	0.9701
	MADF	0.9974	0.9922	0.9849	0.9758	0.9658	0.9480	0.9782	0.9737
VIF †	LISK	0.9011	0.8376	0.7666	0.6755	0.5754	0.4487	0.7060	0.7164
	MEDFE	0.9496	0.8545	0.7454	0.6369	0.5367	0.4192	0.6939	0.7150
	MADF	0.9612	0.8912	0.8012	0.7038	0.6085	0.4778	0.7437	0.7326

Note: “ALL” indicates measuring in the whole irregular random masks set. “Square” means square random masks measuring 128 × 128 pixels. ¶ means lower is better; † means higher is better.

Table 2. User survey results.

	LISK	MEDFE	MADF
PR_a	14.00%	15.20%	20.40%
PR_b	25.00%	6.00%	69.00%
PR_c	50.40%	17.40%	82.20%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Q.; He, S.; Su, M.; Zhao, F. Context-Encoder-Based Image Inpainting for Ancient Chinese Silk. Appl. Sci. 2024, 14, 6607. https://doi.org/10.3390/app14156607

AMA Style

Wang Q, He S, Su M, Zhao F. Context-Encoder-Based Image Inpainting for Ancient Chinese Silk. Applied Sciences. 2024; 14(15):6607. https://doi.org/10.3390/app14156607

Chicago/Turabian Style

Wang, Quan, Shanshan He, Miao Su, and Feng Zhao. 2024. "Context-Encoder-Based Image Inpainting for Ancient Chinese Silk" Applied Sciences 14, no. 15: 6607. https://doi.org/10.3390/app14156607

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Context-Encoder-Based Image Inpainting for Ancient Chinese Silk

Abstract

1. Introduction

2. Related Work

3. Approach

3.1. The Principle of the LISK Model

3.2. The Principle of the MEDFE Model

3.3. The Principle of the MADF Model

4. Experiments

4.1. Experimental Settings

4.2. Evaluation Index

4.3. Qualitative Evaluation

4.4. Quantitative Evaluation

4.5. User Study

4.6. Practice and Challenge

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI