Image Structure-Induced Semantic Pyramid Network for Inpainting

Huang, Rong; Zheng, Yuhui

doi:10.3390/app13137812

Open AccessArticle

Image Structure-Induced Semantic Pyramid Network for Inpainting

by

Rong Huang

^*

and

Yuhui Zheng

Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7812; https://doi.org/10.3390/app13137812

Submission received: 15 May 2023 / Revised: 24 June 2023 / Accepted: 29 June 2023 / Published: 3 July 2023

(This article belongs to the Special Issue Advances and Application of Intelligent Video Surveillance System)

Download

Browse Figures

Versions Notes

Abstract

The existing deep-learning-based image inpainting algorithms often suffer from local structure disconnections and blurring when dealing with large irregular defective images. To solve these problems, an image structure-induced semantic pyramid network for inpainting is proposed. The model consists of two parts: the edge inpainting network and the content-filling network. U-Net-based edge inpainting network restores the edge of the image defect with residual blocks. The edge inpainting map is input into the pyramid content-filling network together with the image in the prior condition. In the content-filling network, the attention transfer module (ATM) is designed to reconfigure the encoding features of each scale step by step, and the recovered feature map is linked to the decoding layer and the corresponding potential feature fusion decoding to improve the global consistency of the image and finally obtain the restored image. The quantitative analysis shows that the average L1 loss is reduced by about 1.14%, the peak signal-to-noise ratio (PSNR) is improved by about 3.51, and the structural similarity (SSIM) is improved by about 0.163 on the CelebA-HQ and Places2 datasets compared with the current mainstream algorithms. The qualitative analysis shows that this model not only generates semantically sound content as a whole but also better matches the human visual perception in terms of local structural connectivity and texture synthesis.

Keywords:

image inpainting; edge inpainting; attention; residual blocks

1. Introduction

Image inpainting technically means to infer and restore the missing content according to the known content of images so that the inpainted image meets the human visual needs as much as possible. As an important component of computer vision and computer graphics, image inpainting is used widely in cultural and lifestyle applications, such as the conservation of damaged paintings, calligraphy and other digital cultural heritage, old photo restoration, and object removals.

Prior to the widespread use of deep learning in image inpainting, the most common techniques used were traditional image inpainting techniques. The methods can be classified into two categories. One is the diffusion image inpainting method. Based on the edge information of the missing regions, the known pixels are propagated in a certain direction to the unknown region to obtain a better inpainting effect. The most representative models are the BSCB model raised by Bertalmio et al. [1] and the curvature diffusion (CDD) inpainting model introduced by Chan et al. [2]. However, this method has some limitations, as it is suitable for relatively narrow missing areas, such as cracks and scratches, and it is difficult to infer the specific texture content. The other is the sample-based image inpainting method [3,4,5], in which similar sample blocks are searched for in the complete region of the image to fill the missing region of the image. Among them, the most representative is the block-based texture synthesis algorithm of Criminisi et al. [4]. Criminisi’s method can restore some large missing areas and generate clear texture details. However, such images are prone to repeated images, and reasonable structures cannot be generated in complex cases.

Deep-learning-based image restoration methods address the lack of understanding of traditional methods. Pathak et al. [6] combined a coding-decoding network with a convolutional neural network in 2016 and presented a network model named the context encoder (CE). The model uses adversarial loss [7] to train a contextual encoder to predict the damaged areas of the image, but this method may suffer from image edge distortion and local ill-definedness. Based on this, Lizuka et al. [8] introduced the global discriminator into the context encoder (CE) model to make the restored image more consistent with the global semantics. However, this method has difficulty achieving complex texture inpainting, and the postprocessing is complex. Nazeri et al. [9] put forward a two-stage image inpainting adversarial model called EdgeConnect, which uses edge repair results to constrain texture synthesis. Compared with the single-stage forward unified recovery strategy of structure and texture, this method more easily reconstructs clear semantic boundaries and salient objects. The StructureFlow (SF) model proposed by Ren et al. [10] adopts a similar inpainting strategy as EC—that is, a smooth image with boundary preservation is used to constrain texture synthesis, which alleviates the problem of texture distortion and artefacts. The Learning to Incorporate Structure Knowledge (LISK) model proposed by Yang et al. [11] uses gradient maps containing structural information and local texture details to constrain the inpainting procedure, reducing network parameters and improving local details.

The work in this paper is a deep-learning-based image inpainting approach focusing on the cross-layer transfer of attention to missing images based on the U-Net edge generation structure. U-Net is a deep network structure proposed by Ronneberger et al. [12] for the field of image segmentation, which can handle the feature extraction task of deep networks well due to its unique upsampling and downsampling plus jump connections. Many advances have been made in research related to U-Net-based image inpainting [13,14,15,16,17,18]. Barnes et al. [13] put forward a joint optimization framework and modelled the global content control and local texture control using a convolutional neural network; they further introduced a multiscale neural network patch synthesis algorithm for high-resolution image inpainting based on the joint optimization framework, which is similar to the style-aware multiscale neural patch synthesis for high-resolution image restoration work proposed by Yang et al. [14]. Yan et al. [15] proposed an image inpainting method by introducing a Shift Connection (SC) layer in the U-Net structure, which uses the SC layer to replace the fully connected layer to transfer the feature information of the background region of the image. This design can handle arbitrarily shaped missing regions and can obtain finer textures and visually reasonable restoration results in less time. To address the lack of colour differences, blurring, and edge inconsistencies in the restoration results, Liu et al. [16] used partial convolution with automatic mask update in a U-Net structure to achieve image restoration without any additional postprocessing operations, effectively eliminating the artefact problem. Moreover, MouC et al. [17] proposed a new model, deep generalized unfolded network (DGU-Net), which integrated the gradient estimation strategy into the steps of calculating the gradient descent (PGD) algorithm but was not successful with regards to large-area missing images. Wu et al. [18] raised an end-to-end generative model method for boundary and high-texture regions. First, the local binary pattern (LBP) learning network with the U-Net architecture was used to forecast the structural information of the missing areas, and as a guide, the upgraded spatial attention mechanism was added to the image inpainting network so that the algorithm could better restore the missing pixels.

Attentional mechanisms have also been widely used in the field of image inpainting. Yu et al. [19] developed a context attention model that uses the sample block matching method on the feature graph to reconstruct the damaged part of the feature graph with similar undamaged parts. Zeng et al. [20] put forward an attention transfer network, which is a cross-layer attention mechanism that uses the feature map of the previous layer to get attention scores to restore the feature map of the current layer. Zheng et al. [21] introduced a short-term plus long-term attention mechanism into the image inpainting model. Thus, feature inpainting can use not only the feature information in the decoder but also the feature information in the encoder. Liu et al. [22] proposed the coherent semantic attention (CSA) module. The CSA layer reconstructs the missing features through two processes of search and generation. First, the most relevant sample blocks are found in the undamaged area to replace the sample blocks in the damaged area. Then, the CSA layer guides feature reconstruction by combining the relationship between the sample blocks to be reconstructed and the most relevant sample blocks as well as the relationship between the sample blocks to be reconstructed and the adjacent sample blocks. Wang et al. [23] and Li et al. [24] introduced a pixel-level attention mechanism into the image inpainting task.

Although the methods based on structural constraints are more suitable for reconstructing images with large irregular missing regions, these methods still have some defects. First, due to the influence of network depth and structure sparsity, the context information of the deep feature space is lost, which leads to the lack of structure in the central region of the inpainting result. Second, in the phase of texture synthesis, incomplete structure reconstruction results may lead to semantic loss and texture ambiguity in the final output.

To solve these problems, a context-aware image inpainting model based on edge and semantic pyramids is presented here. Specifically, similar to the state-of-the-art two-stage inpainting strategy, the proposed method is classified into two parts: an edge inpainting network and a content-filling network. Moreover, the edge inpainting result is used to constrain the image texture synthesis. In the edge inpainting network, the residual U-Net network combined with convolution and jump connection is used as the edge generator to strengthen the consistency of the contextual structure and alleviate the problem of missing edges in missing areas. In the content-filling network, U-Net is used as the image generator to generate the texture under the constraint of the edge inpainting graph. In addition, based on the complementarity between deep adjacent features, a cross-layer attention transfer model (ATM) is introduced and applied to all coding layers of the image generator to ensure the context consistency of the decoding features at all levels to the maximum extent. The skip connection in the edge inpainting network improves the context structural connectivity, and the content-filling network is guided by the edge inpainting graph to fill smooth and homogeneous texture information in different regions composed of edge information. The application of ATM reduces the loss of structural information and texture details in the inpainting process. The proposed method can generate high-quality recovery results with complete semantics and coherent structures. Specifically, the main work of this article is as described below:

A semantic pyramid inpainting network induced by image structure is proposed in this study. The edge inpainting map generated by the residual U-Net is input into the pyramid content-filling network together with the defective image as a prior condition, and the final inpainted image is obtained through the encoder-decoder process.
The ATM module transfers the similarity of feature blocks inside and outside the missing region of the high-level feature map to the lower-level feature map and fills the missing area at the feature level based on this.
Experiments were conducted on multiple standard datasets, and qualitative and quantitative comparisons showed that when the restoration task involved large areas of missing or complex structures, the proposed method had higher restoration quality than the existing mainstream methods.

2. Approach

The whole structure of this article’s model is shown in Figure 1. The model is composed of two parts: an edge inpainting network and a texture generation network. First, the edge inpainting network generates a reasonable edge profile in the defective area based on the grey scale values of the pixels surrounding the defective image and the edge information of the undamaged area. Then, the image inpainting network rebuilds the feature map layer by layer from deep to shallow based on the previous conditions of the edge inpainting map, transfers the ATM feature map to the respective decoding layer via a jump connection, and forms the decoding convolution layer by layer to improve the consistency of the decoding feature map. Finally, the feature map for each layer in the decoding procedure is converted to an RGB image to gain the final restored result. The sections of the model are shown in detail below.

2.1. Edge Inpainting Network

The edge inpainting model is a residual U-Net network that combines convolution and jump connections, including a generative network and a discriminant network. Figure 2 shows the generator model in the edge prediction model of this paper. The left half of the network is the encoder, which is used to downsample the import image. Each downsampling is composed of two 3*3 convolutions and ReLU activation functions, followed by a 2*2 maximum pooling layer. The right half is the decoder, which is used for the upsampling of features. Each upsampling consists of a cascade layer consisting of a 2*2 convolution layer and a ReLU activation function plus the corresponding feature graph of the encoder layer and the upsampling result, followed by two 3*3 convolution layers. The last layer changes the number of channels to one-dimensional channels by a 1*1 convolution. The above part is not very different from the U-Net network. However, on this basis, a residual connection is added after each upsampling and downsampling, and the feature graph before and after convolution is added to achieve an effect similar to the residual network.

The import part of the discriminator is image edge information, and the overall structure is four continuous subsampling layers. The final output is a scalar, indicating the probability that the input edge information is the real edge information, to determine the quality of the result generated by the generator.

2.2. Attention Transfer Model

During the feature extraction process, the advanced features have distinct semantic information, while the shallow low-level features contain more structural information and texture details. To fill pixel information in the defect feature map and ensure overall semantic consistency, the ATM is proposed in this paper. By taking the pixel patch similarity inside and outside the missing area of advanced features as guidance and filling effective pixel content in the missing area of low-level features, the ATM can accurately reconstruct the feature map with complete semantics and rich texture details. The specific operation can be seen in Figure 3.

First, the ATM calculates the cosine similarity of patches (3*3) of the same size inside and outside the defect area of advanced feature figure

ψ^{l}

as the basis for calculating the attention score.

m_{i}^{l}

represents the ith feature block in the background region of

ψ^{l}

, and

m_{j}^{l}

represents the jth feature block in the background region of

ψ^{l}

. Then, the corresponding cosine similarity calculation formula is as follows:

t_{i, j}^{l} = 〈 \frac{m_{i}^{l}}{‖ m_{i}^{l} ‖_{2}}, \frac{m_{j}^{l}}{‖ m_{j}^{l} ‖_{2}} 〉

(1)

After obtaining the cosine similarity of each patch block in the missing region, the softmax function is used to map the import to a real number between [0, 1] as the attention score of each patch block. The attention score is calculated as follows:

X_{j, i}^{l} = \frac{\exp (t_{i, j}^{l})}{\sum_{i = 1}^{N} \exp (t_{i, j}^{l})}

(2)

After obtaining the attention scores for each patch block in the missing area

ψ^{l}

, the similarity relationship is applied to the adjacent low-level feature map

\emptyset^{l - 1}

with a higher resolution, that is, a weighted copy of the patch block based on the attention score is used to complete the entire missing region. Specifically,

m_{i}^{l - 1}

indicates the ith patch in the scene

\emptyset^{l - 1}

, and

m_{j}^{l - 1}

shows the jth patch to be filled in the missing region. Then, the filling process can be expressed as:

m_{j}^{l - 1} = \sum_{i = 1}^{N} X_{j, i}^{l} m_{i}^{l - 1}

(3)

2.3. Texture Generation Network

Similar to the edge inpainting network, the texture generation network is made up of a generator and a discriminator. In combination with the structural characteristics of the ATM and U-Net, the generator applies the ATM layer by layer from deep to shallow after the completion of encoding and then transfers the ATM reconstructed feature map to the multiscale decoder through the jump connection, which is mixed with the corresponding potential feature map for layer-by-layer decoding. After decoding, the features of each decoding layer are converted into RGB pictures of the same size, and the L1 loss is designed to contrast the differences between them and the true image, forcing the generator to improve the context consistency of each decoding feature. Finally, through perception loss, style loss, and adversarial loss, the final repair results are evaluated, and the network parameters are optimized.

To show fine-grained inpainting effects, it is significant to decrease the local detail loss of output features at all levels of the network layer. General convolution with a convolution kernel size of 3 × 3 and a step size of 1 is designed in both the coding and decoding stages of the image inpainting network instead of expanded convolution with different expansion rates. This is because in the U-Net network, with an increasing network depth, the scale of the feature graphs at all levels decreases step by step. However, in feature graphs at 64 × 64 and lower scales, using expanded convolution with the grid effect will lose the continuity between adjacent pixels and ignore local important features, which will reduce the effect of subsequent ATM reconstruction features.

After encoding is performed, the generator applies the ATM layer by layer to rebuild the feature map. Given a 6-layer encoder, from depth to shallow, the output feature map of the encoding procedure is recorded as

ψ^{6}

,

ψ^{5}

,

ψ^{4}

,

ψ^{3}

,

ψ^{2}

,

ψ^{1}

, and

F (\cdot)

represents the ATM operation. Then, the reconstruction features at different levels are as follows:

\emptyset^{5} = F (ψ^{5}, ψ^{6}) \emptyset^{4} = F (ψ^{4}, ψ^{5}) ⋮ \emptyset^{1} = F (ψ^{1}, ψ^{2})

(4)

The multiscale decoder decodes the reconstructed feature map from the ATM in a jump-joined manner along with the underlying feature from the encoder as import. The output characteristics of each layer of the multiscale decoder are represented as

γ^{5}

,

γ^{4}

,

γ^{3}

,

γ^{2}

, and

γ^{1}

,

L (\cdot)

is the transposed convolution operation, and

\oplus

is the feature splicing. Then, the decoding process can be represented as:

γ^{5} = L (\emptyset^{5} \oplus L (ψ^{6})) γ^{4} = L (\emptyset^{4} \oplus γ^{5}) ⋮ γ^{1} = L (\emptyset^{1} \oplus γ^{2})

(5)

On the one hand, the feature map reconstructed by the ATM can encode more low-level information for the lacking area. At the same time, even though the relevant pixel content cannot be extracted from the missing region, the feature information acquired from the compact potential feature through convolution can synthesize new targets in the missing area. Combining these two influences, multiscale decoders can utilize new substances with complete semantics and realistic texture by using the context information of images.

After the completion of multiscale decoder decoding, the feature maps output at every level of the decoding layer is concatenated with the corresponding ATM feature maps to form an RGB image. The last RGB image is restored to 256 × 256 by upsampling to get the restored image. The tanh activation function is applied in each convolution layer of the RGB image network converted from decoding feature maps at various scales.

The discriminator D1 of the edge-inpainting network uses the same network structure and parameter settings as the discriminator D2 of the content-filling network and improves the discriminator’s ability to recognize the authenticity of images by minimizing counter loss. The specific network structure parameters are shown in Table 1.

2.4. Loss Function

The loss function of the model in this article includes two parts, the loss function of the edge inpainting model and the loss function of the texture-generating model, as shown below:

L = L_{e d g} + L_{i m g}

(6)

where

L_{e d g}

represents the loss function of the edge inpainting model, which is made up of adversarial loss and feature matching loss [25].

L_{i m g}

represents the content-filling model inpainting loss, which is composed of four parts: the adversarial loss, perceptual loss [26], style loss [27], and reconstruction loss.

2.4.1. Edge Inpainting Loss

The input of the network consists of three parts: the mask, the greyscale of the missing image, and the edge information of the missing image.

I_{g t}

is used to represent the original image, and

C_{g t}

and are

I_{g r e y}

used to indicate the edge image and grey image. In the edge generator, we can obtain the following input:

{\begin{matrix} {\tilde{I}}_{g r e y} = I_{g r e y} \otimes (1 - M) \\ {\tilde{C}}_{g t} = C_{g t} \otimes (1 - M) \end{matrix}

(7)

In the above equation,

{\tilde{I}}_{g r e y}

and

{\tilde{C}}_{g t}

represent the grey image and edge image covered by the mask, and M represents the mask, which has only two values. When the value is 1, the information is missing; when the value is 0, the information is not missing; the symbol

\otimes

represents elementwise multiplication.

G_{1} (\cdot)

represents the edge generator operation, and

D_{1} (\cdot)

represents the edge discriminator operation. The edge graph of the mask-covered area predicted by the generator can be expressed as:

C_{p r e d} = G_{1} ({\tilde{I}}_{g r e y}, {\tilde{C}}_{g t}, M)

(8)

C_{g t}

and

C_{p r e d}

are used as the inputs of the discriminator network to determine the veracity of the edge image. The loss of the network is composed of two parts, and the specific expression is as follows:

L_{e d g} = λ_{1} L_{a d v 1} + λ_{F M} L_{F M}

(9)

In the above equation,

L_{a d v 1}

represents the confrontation loss,

L_{F M}

represents the feature matching loss, and

λ_{1}

and

λ_{F M}

are constant coefficients. The specific definition of

L_{a d v 1}

is as follows:

L_{a d v 1} = E_{(C_{g t}, I_{g r e y})} \log [D_{1} (C_{g t}, I_{g r e y})] + E_{I_{g r e y}} \log [1 - D_{1} (C_{p r e d}, I_{g r e y})]

(10)

The feature matching loss stabilizes the training process by comparing the activation graphs between the discriminator intermediate layers, and it is similar to the perceptual loss. The specific definition of

L_{F M}

is:

L_{F M} = E [\sum_{i = 1}^{L} \frac{1}{N_{i}} {‖ D}_{1}^{(i)} (C_{g t}) - D_{1}^{(i)} {(C_{p r e d}) ‖}_{1}]

(11)

where L represents the number of convolution layers of

D_{1}

,

N_{i}

represents the number of elements in the ith activation layer of

D_{1}

, and

D_{1}^{(i)}

represents the activation function in the ith activation layer of

D_{1}

.

In this paper, after many independent experiments, the final weight of each loss is set to

λ_{1} = 1

and

λ_{F M} = 15

.

2.4.2. Content-Filling Loss

Using

I_{g t}

to show the original image and M to indicate the mask, the missing image

{\tilde{I}}_{g t}

is denoted as follows

{\tilde{I}}_{g t} = I_{g t} \otimes (1 - M)

(12)

Using

G_{2} (\cdot)

to denote the content-fill generator operation and

D_{2} (\cdot)

to denote the image discriminator operation, the resulting image is represented as:

I_{pred} = G_{2} ({\tilde{I}}_{g t}, C_{pred}, M)

(13)

The ultimate output of the entire inpainting network is:

I = {\tilde{I}}_{g t} + I_{p r e d} \otimes M

(14)

The adversative loss

L_{a d v 2}

is introduced as follows to train the content-filling network:

L_{a d v 2} = E_{(I_{g t}, I_{p r e d})} \log [D_{2} (I_{g t}, C_{p r e d})] + E_{C_{p r e d}} \log [1 - D_{2} (I_{p r e d}, C_{p r e d})]

(15)

The perceptual loss compares the feature maps acquired after the same convolution operation between the true image and the generated image and minimizes the difference between them to improve the high-level semantic consistency of the two kinds of images. Specifically, in this paper, the true image and the inpainted image are compared through the corresponding activation feature maps of pool-i (i = 1, 2, 3) in the network layer of VGG-16 trained on ImageNet.

N_{k}

indicates the amount of elements in the

k^{t h}

activation layer, and

σ_{k}

represents the activation diagram of the corresponding layer. The formula for calculating perceptual loss is as below:

L_{p e r c} = E [\sum_{k} \frac{1}{N_{k}} {‖ σ}_{k} (I_{g t}) - σ_{k} {(I_{p r e d}) ‖}_{1}]

(16)

The style loss can be defined as the correlation factor between the activation values of each channel of the activation feature, measured using the activation of layer k. In this article, a network activation layer consistent with the perceived loss is chosen and its correlation is represented by calculating the eccentric covariance between activation feature maps at different scales. Specifically, the style loss is defined as follows:

L_{s t y l e} = E_{k} [‖ G_{k}^{σ} (I_{g t}) - G_{k}^{σ} {(I_{p r e d}) ‖}_{1}]

(17)

where

G_{k}^{σ}

is a Gram matrix of

C_{k} \times C_{k}

size composed of

σ_{k}

. The introduction of the style loss can availably counter the fuzzy impact caused by transposed convolution.

To complete the transmission of semantic information and detailed features in the decoding process, L1 loss is applied to the RGB image output and transformed at all levels of decoding layers. Compared with the L1 norm, the L2 norm can partly raise the convergence speed of the model. However, in the early stage of network training, the L2 norm will increase the difference between the restoration results and the corresponding pixel points of the original image. The application of the L2 norm at all levels of network will further expand this difference, easily causing gradient explosion. In contrast, the L1 norm has a steady gradient for any input value and does not cause gradient explosion concerns. By scaling the real image size to be consistent with the RGB image size output at each decoding layer and calculating the L1 normalized distance between them, the reconstruction loss between the decoded output at each layer and the real image can be represented as follows:

L_{r e c} = \sum_{l = 1}^{L - 1} {‖ I}_{g t}^{l} - r {(γ^{l}) ‖}_{1}

(18)

Among them is

I_{g t}^{l}

that scales to real images of the same size and the decoding figure

γ^{l}

;

r (\cdot)

is a 1*1 convolution, and it will decode

γ^{l}

for RGB image and maintain the same size.

The overall loss function of the content-filling network can be expressed as:

L_{i m g} = λ_{2} L_{a d v 2} + λ_{p e r c} L_{p e r c} + λ_{s t y l e} L_{s t y l e} + λ_{r e c} L_{r e c}

(19)

In this paper, after many independent experiments, the weight of each loss is finally set to

λ_{2} = 0.2

,

λ_{p e r c} = 0.1

,

λ_{s t y l e} = 200

, and

λ_{r e c} = 0.5

.

3. Experiment and Analysis

In this section, the model is assessed in both qualitative and quantitative terms. In Section 3.1, information about the datasets is presented. In Section 3.2, qualitative comparative analysis is conducted. In Section 3.3, quantitative comparative analysis is conducted. Finally, in Section 3.4, the significance of each module is reflected through an ablation experiment.

3.1. Experimental Settings

To prove the validity of the image inpainting method in this article, the two-stage model in this paper is trained and evaluated by the Place2 [28], CelebA-HQ [29] and Façade [30] datasets. The CelebA-HQ dataset is a high-resolution dataset derived from the CelebA dataset, including 30,000 high-resolution facial pictures. Place2 is a scene image dataset that includes 10 million images of more than 400 different kinds of scenarios. Façades are a set of highly structured buildings from around the world. Masks are classified into regular masks and irregular masks.

Our model ran on an NVIDIA GeForce RTX 3090 with a batch size of 64. We used a learning rate of

10^{- 4}

and a decay rate of 0.1. We used the Adam optimizer with beta (0.5, 0.999). Our code used TensorboardX to observe when the model converges. The code was implemented in PyTorch.

3.2. Qualitative Comparisons

To evaluate the accuracy of the proposed method for the prediction of missing image edges, U-Net and the residual U-Net used by the mentioned method are compared on the CelebA-HQ dataset, and the experimental results are displayed in Figure 4. By comparison, the image edge prediction graph obtained by the residual U-Net network in this article is basically the same as the original image edge structure graph obtained by the Canny operator, and the edge prediction effect is good.

The inpainting outcome of the proposed method on the CelebA-HQ datasets are represented in Figure 5. It can be shown from the figure that the inpainted image is basically the same as the real image in colour, form and style, and the texture particulars of the image are more accurate and vivid. The boundary between the missing area and the background region is naturally clear, with consistent visual effects and no mark of inpainting. This suggests that the proposed method has a big impact on the image inpainting task.

To objectively evaluate the inpainting results of the mentioned method, this method is compared with CE [6], PEPSI [31], EC [9], PEN-Net [20], and CTSDG [32]. Figure 6, Figure 7 and Figure 8 show the comparison results of the proposed method and the above four methods on the Places2, CelebA-HQ and Façade datasets, where the masks are divided into regular masks and irregular masks.

In contrast, the Context encoder method often generates artefacts. The PEPSI method can generate results with relatively continuous textures, but its generated results often have structural errors. The EdgeConnect method has a reasonable semantic structure for the sparse damaged image, but it cannot repair the damaged image at a reasonable pixel level. The PEN-Net method can generate restoration results with clear texture for large-area damaged images, but it is prone to the problem of blurred edges. The CTSDG algorithm is richer in inpainting details than other algorithms, but some areas do not recover the proper structure and there are obvious inpainting traces. The inpainting effect of the proposed method for the Places2 dataset with damaged areas is better than that of other methods. The generated inpainting results not only have reasonable structure but also have clear texture details, and the inpainting effect is clearly superior to other comparative methods.

3.3. Quantitative Comparisons

For edge images, the evaluation criteria in this paper are mean absolute error (MAE) [33] and peak signal-to-noise ratio (PSNR) [34], which are measured by calculating the average difference between each pixel in the predicted edge image and the real edge image to determine the similarity of the images. The MAE formula is:

M A E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} | I_{ρ} (i, j) - I_{g} (i, j) |

(20)

where

I_{ρ}

represents the predicted image,

I_{g}

is the real image, and m and n represent the length and width of the image, respectively.

In this paper, we will compare U-Net and ResU-Net used in this paper, and the two networks will be trained on the CelebA-HQ dataset separately, and the corresponding MAE and PSNR of their output results are shown in Table 2.

As can be seen from Table 2, the results of edge prediction of the residual images using the residual U-Net are indeed better than those of using the U-Net directly. Regardless of the percentage of mask coverage, the MAE and PSNR of the residual U-Net are higher than the results using the U-Net network, and it can be found that the higher the mask coverage is, the more the residual U-Net is numerically ahead compared with the normal U-Net. It can be concluded that using the residual U-Net model for image edge prediction works better than using U-Net.

To objectively evaluate the image inpainting effect of the proposed method, the peak signal-to-noise ratio (PSNR) [34], structural similarity (SSIM) [35] and mean L1 loss are used as evaluation indicators. The results are compared to the CE, PEPSI, EC, and PEN-Net methods using the CelebA-HQ and Places2 datasets and are shown in Table 3 and Table 4. PSNR is a standard to measure the distortion or noise level of an image, SSIM is an index to judge the similarity between the previous image and the inpainted result, and Mean Ll loss directly measures the Ll distance between the reconstructed image and the true image. Using these three evaluation indicators, the inpainting effect of the model can be measured not only at the pixel level but also at the feature level. The higher the values of PSNR and SSIM, the better the image restoration effect, while the opposite is true for L1. As seen from the three tables, the assessment criteria of the proposed method on the three datasets are superior to other comparative methods, indicating that the image inpainting results generated by the proposed method are not only closer to the original image at the structure and pixel levels but also have better effects at the feature and perception levels. This fully proves the advantage of this method.

3.4. Ablation Study

To verify the validity of the edge prediction module, the inpainting effect of the image is compared in this article without using the edge prediction network and the effect after use. The experimental comparison results can be seen in Figure 9, and it can be found that the generated pictures using the edge prediction network have more accurate and complete structural information, thus proving that the module can effectively maintain the completeness of semantic structure information in inpainting.

The comparative experiment was conducted on the CelebA-HQ and Places2 datasets. The quantitative comparison results can be seen in Table 5. The edge prediction network is better for all evaluation indices.

4. Discussion

Compared with the current mainstream algorithms, the model in this paper achieves better results in image inpainting of multiple irregularly shaped missing regions and can accurately reconstruct various high-frequency information in the missing regions of images, and the synthesized textures are clearer. The specific performance is shown in the following aspects: First, in order to strengthen the edge generation network’s ability to generate unknown structural information of the missing region images, jump connections and residual blocks are added to the network, and feature matching losses are introduced in the loss function to generate results that are more similar to the real edges. Second, an attention-shifting module ATM is added to the content-filling module to make the restoration network focus more on the region to be restored during the restoration process, use reconstruction loss to refine the prediction at each scale, and combine perceptual loss and style loss for model training to better reconstruct the contour structure and color texture of the region to be restored. Finally, qualitative and quantitative comparison experiments are conducted with several classical networks, and the results validate the effectiveness of the network designed in this paper.

However, the method also has some limitations. First, some restoration results have more obvious reduction traces and local semantic inconsistency of the overall image. Second, the restoration effect is reduced for images with very large missing areas and complex patterns of missing parts. Finally, this paper adopts a second-order image restoration model with long convergence time, so is time-consuming.

In summary, the next step is to optimize the model from the following perspectives: First is to combine the global consistency constraint condition with the local detail discrimination condition. A new image discriminator is proposed to constrain the content inpainting network to improve the overall consistency of the inpainting results. Second, it is designed to obtain long-range feature information to achieve image restoration of large missing regions. Finally, the overall network structure is simplified, and the network fitting time is reduced on the premise of ensuring the same inpainting effect.

5. Conclusions

To achieve sharper edges and more vivid texture detail, this paper introduces a second-order generative image restoration model incorporating an edge inpainting network and a texture restoration network. First, the edge inpainting network trained by counter loss and feature matching loss recovers the edge binary map of the defective image to obtain the edge restoration map. Then, the edge inpainting map is fed into the network together with the defective image. When the coding is complete, the ATM is used to rebuild the encoded feature layer by layer from deep to superficial levels, and the reconstructed feature is transmitted to the multiscale decoder. By minimizing the reconstruction loss of every decoding layer, the decoder is constrained to output context-consistent decoding features. The image inpainting network is jointly trained by the adversarial loss, perception loss, and style loss.

Finally, the experimental results show that a good-quality inpainting result with coherent structures and vivid textures is obtained. However, there are areas for improvement in future research, such as some blurring that the model still produces when dealing with the restoration of highly detailed texture features and the fitting time that needs to be reduced.

Author Contributions

Conceptualization, R.H. and Y.Z.; methodology, R.H.; software, R.H.; validation, R.H.; formal analysis, R.H.; investigation, R.H.; resources, R.H. and Y.Z.; data curation, R.H.; writing—original draft, R.H.; writing—review and editing, R.H. and Y.Z.; visualization, R.H.; supervision, R.H. and Y.Z.; project administration, R.H.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20211539, in part by the National Natural Science Foundation of China under Grant U20B2065 and Grant U22B2056, in part by the Qing Lan Project, and in part by the Postgraduate Research & Practice Innovation Program of Jiangsu Province Under Grant KYCX23_1372.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data included in this study are available upon request by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image Inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LO, USA, 23–28 July 2000; ACM Press/Addison-Wesley Publishing Co.: New York, NY, USA, 2000; pp. 417–424. [Google Scholar]
Chan, T.F.; Sthen, J. Nontexture inpainting by curvature-driven diffusions. J. Vis. Commun. Image Represent. 2001, 12, 436–449. [Google Scholar] [CrossRef]
Simakov, D.; Caspi, Y.; Shechtman, E.; Irani, M. Summarizing visual data using bidirectional similarity. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Criminisi, A.; Pérez, P.; Toyama, K. Object Removal by Exemplar-Based Inpainting. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, USA, 16–22 June 2003; pp. 721–728. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 107. [Google Scholar] [CrossRef]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Ren, Y.; Yu, X.; Zhang, R.; Li, T.H.; Liu, S.; Li, G. StructureFlow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 181–190. [Google Scholar]
Yang, J.; Qi, Z.; Shi, Y. Learning to Incorporate Structure Knowledge for Image Inpainting. arXiv 2020, arXiv:2002.04170. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Yan, Z.; Li, X.; Li, M.; Zuo, W.; Shan, S. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Mou, C.; Wang, Q.; Zhang, J. Deep Generalized Unfolding Networks for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 17399–17410. [Google Scholar]
Wu, H.; Zhou, J.; Li, Y. Deep Generative Model for Image Inpainting with Local Binary Pattern Learning and Spatial Attention. IEEE Trans. Multim. 2022, 24, 4016–4027. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Cenerative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1486–1494. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic lmage Completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1438–1447. [Google Scholar]
Liu, H.; Jiang, B.; Xiao, Y.; Yang, C. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4170–4179. [Google Scholar]
Wang, N.; Li, J.; Zhang, L.; Du, B. MUSICAL: Multi-Scale Image Contextual Attention Learning for Inpainting. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 3748–3754. [Google Scholar]
Li, J.; Wang, N.; Zhang, L.; Du, B.; Tao, D. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7757–7765. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torraba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Tyleček, R.; Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Proceedings of the German Conference on Pattern Recognition, Saarbrucken, Germany, 3–6 September 2013; pp. 364–374. [Google Scholar]
Sagong, M.C.; Shin, Y.G.; Kim, S.W.; Park, S.; Ko, S.J. PEPSI: Fast image inpainting with parallel decoding network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11360–11368. [Google Scholar]
Guo, X.; Yang, H.; Huang, D. Image inpainting via conditional texture and structure dual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 14134–14143. [Google Scholar]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Avcibas, I.; Sankur, B.; Sayood, K. Statistical evaluation of image quality measures. J. Electron. Imaging 2002, 11, 206–223. [Google Scholar]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; pp. 1398–1402. [Google Scholar]

Figure 1. Overview of the proposed method. The model includes an edge inpainting network, a texture generation network, and two discriminators. (a) The edge inpainting model is a residual U-Net network that combines convolution and jump connections, including a generative network (ResU-Net) and a discriminant network. (b) The texture generation network also includes two parts, a generator and a discriminator, in which the designed attention transfer module (ATM) can improve the inpainting quality.

Figure 2. Overview of ResU-Net. Input of the damaged image, damaged edge image, and grey level image to obtain the predicted edge image.

Figure 3. Illustration of the attention transfer model (ATM). The feature block is extracted from the high-level feature map

ψ^{l}

to get the attention score, the feature map

\emptyset^{l - 1}

of the lower layer is filled based on it, and the feature figure

ψ^{l - 1}

is obtained.

Figure 3. Illustration of the attention transfer model (ATM). The feature block is extracted from the high-level feature map

ψ^{l}

to get the attention score, the feature map

\emptyset^{l - 1}

of the lower layer is filled based on it, and the feature figure

ψ^{l - 1}

is obtained.

Figure 4. Method in this paper is compared with the result of the Canny operator. (a) Ground truth. (b) Damaged greyscale. (c) Canny operator edge diagram. (d) U-Net. (e) Our method.

Figure 5. Method result graph of this paper. (a) Ground truth. (b) Damaged figure. (c) Edge of the figure. (d) Our method.

Figure 6. Comparison of the inpainting effects of different methods on the Places2 dataset. (a) Input. (b) CE. (c) PEPSI. (d) EC. (e) PEN-Net. (f) CTSDG. (g) Our method. (h) GT.

Figure 7. Comparison of the inpainting effects of different methods on the CelebA-HQ dataset. (a) Input. (b) CE. (c) PEPSI. (d) EC. (e) PEN-Net. (f) CTSDG. (g) Our method. (h) GT.

Figure 8. Comparison of the inpainting effects of different methods on the Façade dataset. (a) Input. (b) CE. (c) PEPSI. (d) EC. (e) PEN-Net. (f) CTSDG. (g) Our method. (h) GT.

Figure 9. Comparison of the image inpainting effects with and without edge information. (a) GT. (b) The damaged figure. (c) No edge information inpainting result. (d) Inpainting results using the edge prediction network.

Table 1. Network structure parameters of the discriminator.

Type	Inputs	Kernel	Stride	Activation Function	Outputs
Conv.	3	5	2	LReLU	64
Conv.	64	5	2	LReLU	128
Conv.	128	5	2	LReLU	256
Conv.	256	5	2	LReLU	512
FC	512	5	1	/	1

Table 2. Comparison of edge prediction image results.

	Mask	U-Net	ResU-Net
MAE	centering	10.05	6.74
	10%~20%	0.94	0.42
	20%~30%	3.91	2.93
	30%~40%	9.92	6.25
	40%~50%	12.68	8.41
PSNR	centering	55.86	58.72
	10%~20%	68.89	69.67
	20%~30%	64.42	65.01
	30%~40%	56.95	59.38
	40%~50%	53.17	56.31

Table 3. Quantitative comparisons of the inpainting effects on the Places2 dataset.

↑