1. Introduction
Image inpainting technically means to infer and restore the missing content according to the known content of images so that the inpainted image meets the human visual needs as much as possible. As an important component of computer vision and computer graphics, image inpainting is used widely in cultural and lifestyle applications, such as the conservation of damaged paintings, calligraphy and other digital cultural heritage, old photo restoration, and object removals.
Prior to the widespread use of deep learning in image inpainting, the most common techniques used were traditional image inpainting techniques. The methods can be classified into two categories. One is the diffusion image inpainting method. Based on the edge information of the missing regions, the known pixels are propagated in a certain direction to the unknown region to obtain a better inpainting effect. The most representative models are the BSCB model raised by Bertalmio et al. [
1] and the curvature diffusion (CDD) inpainting model introduced by Chan et al. [
2]. However, this method has some limitations, as it is suitable for relatively narrow missing areas, such as cracks and scratches, and it is difficult to infer the specific texture content. The other is the sample-based image inpainting method [
3,
4,
5], in which similar sample blocks are searched for in the complete region of the image to fill the missing region of the image. Among them, the most representative is the block-based texture synthesis algorithm of Criminisi et al. [
4]. Criminisi’s method can restore some large missing areas and generate clear texture details. However, such images are prone to repeated images, and reasonable structures cannot be generated in complex cases.
Deep-learning-based image restoration methods address the lack of understanding of traditional methods. Pathak et al. [
6] combined a coding-decoding network with a convolutional neural network in 2016 and presented a network model named the context encoder (CE). The model uses adversarial loss [
7] to train a contextual encoder to predict the damaged areas of the image, but this method may suffer from image edge distortion and local ill-definedness. Based on this, Lizuka et al. [
8] introduced the global discriminator into the context encoder (CE) model to make the restored image more consistent with the global semantics. However, this method has difficulty achieving complex texture inpainting, and the postprocessing is complex. Nazeri et al. [
9] put forward a two-stage image inpainting adversarial model called EdgeConnect, which uses edge repair results to constrain texture synthesis. Compared with the single-stage forward unified recovery strategy of structure and texture, this method more easily reconstructs clear semantic boundaries and salient objects. The StructureFlow (SF) model proposed by Ren et al. [
10] adopts a similar inpainting strategy as EC—that is, a smooth image with boundary preservation is used to constrain texture synthesis, which alleviates the problem of texture distortion and artefacts. The Learning to Incorporate Structure Knowledge (LISK) model proposed by Yang et al. [
11] uses gradient maps containing structural information and local texture details to constrain the inpainting procedure, reducing network parameters and improving local details.
The work in this paper is a deep-learning-based image inpainting approach focusing on the cross-layer transfer of attention to missing images based on the U-Net edge generation structure. U-Net is a deep network structure proposed by Ronneberger et al. [
12] for the field of image segmentation, which can handle the feature extraction task of deep networks well due to its unique upsampling and downsampling plus jump connections. Many advances have been made in research related to U-Net-based image inpainting [
13,
14,
15,
16,
17,
18]. Barnes et al. [
13] put forward a joint optimization framework and modelled the global content control and local texture control using a convolutional neural network; they further introduced a multiscale neural network patch synthesis algorithm for high-resolution image inpainting based on the joint optimization framework, which is similar to the style-aware multiscale neural patch synthesis for high-resolution image restoration work proposed by Yang et al. [
14]. Yan et al. [
15] proposed an image inpainting method by introducing a Shift Connection (SC) layer in the U-Net structure, which uses the SC layer to replace the fully connected layer to transfer the feature information of the background region of the image. This design can handle arbitrarily shaped missing regions and can obtain finer textures and visually reasonable restoration results in less time. To address the lack of colour differences, blurring, and edge inconsistencies in the restoration results, Liu et al. [
16] used partial convolution with automatic mask update in a U-Net structure to achieve image restoration without any additional postprocessing operations, effectively eliminating the artefact problem. Moreover, MouC et al. [
17] proposed a new model, deep generalized unfolded network (DGU-Net), which integrated the gradient estimation strategy into the steps of calculating the gradient descent (PGD) algorithm but was not successful with regards to large-area missing images. Wu et al. [
18] raised an end-to-end generative model method for boundary and high-texture regions. First, the local binary pattern (LBP) learning network with the U-Net architecture was used to forecast the structural information of the missing areas, and as a guide, the upgraded spatial attention mechanism was added to the image inpainting network so that the algorithm could better restore the missing pixels.
Attentional mechanisms have also been widely used in the field of image inpainting. Yu et al. [
19] developed a context attention model that uses the sample block matching method on the feature graph to reconstruct the damaged part of the feature graph with similar undamaged parts. Zeng et al. [
20] put forward an attention transfer network, which is a cross-layer attention mechanism that uses the feature map of the previous layer to get attention scores to restore the feature map of the current layer. Zheng et al. [
21] introduced a short-term plus long-term attention mechanism into the image inpainting model. Thus, feature inpainting can use not only the feature information in the decoder but also the feature information in the encoder. Liu et al. [
22] proposed the coherent semantic attention (CSA) module. The CSA layer reconstructs the missing features through two processes of search and generation. First, the most relevant sample blocks are found in the undamaged area to replace the sample blocks in the damaged area. Then, the CSA layer guides feature reconstruction by combining the relationship between the sample blocks to be reconstructed and the most relevant sample blocks as well as the relationship between the sample blocks to be reconstructed and the adjacent sample blocks. Wang et al. [
23] and Li et al. [
24] introduced a pixel-level attention mechanism into the image inpainting task.
Although the methods based on structural constraints are more suitable for reconstructing images with large irregular missing regions, these methods still have some defects. First, due to the influence of network depth and structure sparsity, the context information of the deep feature space is lost, which leads to the lack of structure in the central region of the inpainting result. Second, in the phase of texture synthesis, incomplete structure reconstruction results may lead to semantic loss and texture ambiguity in the final output.
To solve these problems, a context-aware image inpainting model based on edge and semantic pyramids is presented here. Specifically, similar to the state-of-the-art two-stage inpainting strategy, the proposed method is classified into two parts: an edge inpainting network and a content-filling network. Moreover, the edge inpainting result is used to constrain the image texture synthesis. In the edge inpainting network, the residual U-Net network combined with convolution and jump connection is used as the edge generator to strengthen the consistency of the contextual structure and alleviate the problem of missing edges in missing areas. In the content-filling network, U-Net is used as the image generator to generate the texture under the constraint of the edge inpainting graph. In addition, based on the complementarity between deep adjacent features, a cross-layer attention transfer model (ATM) is introduced and applied to all coding layers of the image generator to ensure the context consistency of the decoding features at all levels to the maximum extent. The skip connection in the edge inpainting network improves the context structural connectivity, and the content-filling network is guided by the edge inpainting graph to fill smooth and homogeneous texture information in different regions composed of edge information. The application of ATM reduces the loss of structural information and texture details in the inpainting process. The proposed method can generate high-quality recovery results with complete semantics and coherent structures. Specifically, the main work of this article is as described below:
A semantic pyramid inpainting network induced by image structure is proposed in this study. The edge inpainting map generated by the residual U-Net is input into the pyramid content-filling network together with the defective image as a prior condition, and the final inpainted image is obtained through the encoder-decoder process.
The ATM module transfers the similarity of feature blocks inside and outside the missing region of the high-level feature map to the lower-level feature map and fills the missing area at the feature level based on this.
Experiments were conducted on multiple standard datasets, and qualitative and quantitative comparisons showed that when the restoration task involved large areas of missing or complex structures, the proposed method had higher restoration quality than the existing mainstream methods.
2. Approach
The whole structure of this article’s model is shown in
Figure 1. The model is composed of two parts: an edge inpainting network and a texture generation network. First, the edge inpainting network generates a reasonable edge profile in the defective area based on the grey scale values of the pixels surrounding the defective image and the edge information of the undamaged area. Then, the image inpainting network rebuilds the feature map layer by layer from deep to shallow based on the previous conditions of the edge inpainting map, transfers the ATM feature map to the respective decoding layer via a jump connection, and forms the decoding convolution layer by layer to improve the consistency of the decoding feature map. Finally, the feature map for each layer in the decoding procedure is converted to an RGB image to gain the final restored result. The sections of the model are shown in detail below.
2.1. Edge Inpainting Network
The edge inpainting model is a residual U-Net network that combines convolution and jump connections, including a generative network and a discriminant network.
Figure 2 shows the generator model in the edge prediction model of this paper. The left half of the network is the encoder, which is used to downsample the import image. Each downsampling is composed of two 3*3 convolutions and ReLU activation functions, followed by a 2*2 maximum pooling layer. The right half is the decoder, which is used for the upsampling of features. Each upsampling consists of a cascade layer consisting of a 2*2 convolution layer and a ReLU activation function plus the corresponding feature graph of the encoder layer and the upsampling result, followed by two 3*3 convolution layers. The last layer changes the number of channels to one-dimensional channels by a 1*1 convolution. The above part is not very different from the U-Net network. However, on this basis, a residual connection is added after each upsampling and downsampling, and the feature graph before and after convolution is added to achieve an effect similar to the residual network.
The import part of the discriminator is image edge information, and the overall structure is four continuous subsampling layers. The final output is a scalar, indicating the probability that the input edge information is the real edge information, to determine the quality of the result generated by the generator.
2.2. Attention Transfer Model
During the feature extraction process, the advanced features have distinct semantic information, while the shallow low-level features contain more structural information and texture details. To fill pixel information in the defect feature map and ensure overall semantic consistency, the ATM is proposed in this paper. By taking the pixel patch similarity inside and outside the missing area of advanced features as guidance and filling effective pixel content in the missing area of low-level features, the ATM can accurately reconstruct the feature map with complete semantics and rich texture details. The specific operation can be seen in
Figure 3.
First, the ATM calculates the cosine similarity of patches (3*3) of the same size inside and outside the defect area of advanced feature figure
as the basis for calculating the attention score.
represents the
ith feature block in the background region of
, and
represents the
jth feature block in the background region of
. Then, the corresponding cosine similarity calculation formula is as follows:
After obtaining the cosine similarity of each patch block in the missing region, the softmax function is used to map the import to a real number between [0, 1] as the attention score of each patch block. The attention score is calculated as follows:
After obtaining the attention scores for each patch block in the missing area
, the similarity relationship is applied to the adjacent low-level feature map
with a higher resolution, that is, a weighted copy of the patch block based on the attention score is used to complete the entire missing region. Specifically,
indicates the
ith patch in the scene
, and
shows the
jth patch to be filled in the missing region. Then, the filling process can be expressed as:
2.3. Texture Generation Network
Similar to the edge inpainting network, the texture generation network is made up of a generator and a discriminator. In combination with the structural characteristics of the ATM and U-Net, the generator applies the ATM layer by layer from deep to shallow after the completion of encoding and then transfers the ATM reconstructed feature map to the multiscale decoder through the jump connection, which is mixed with the corresponding potential feature map for layer-by-layer decoding. After decoding, the features of each decoding layer are converted into RGB pictures of the same size, and the L1 loss is designed to contrast the differences between them and the true image, forcing the generator to improve the context consistency of each decoding feature. Finally, through perception loss, style loss, and adversarial loss, the final repair results are evaluated, and the network parameters are optimized.
To show fine-grained inpainting effects, it is significant to decrease the local detail loss of output features at all levels of the network layer. General convolution with a convolution kernel size of 3 × 3 and a step size of 1 is designed in both the coding and decoding stages of the image inpainting network instead of expanded convolution with different expansion rates. This is because in the U-Net network, with an increasing network depth, the scale of the feature graphs at all levels decreases step by step. However, in feature graphs at 64 × 64 and lower scales, using expanded convolution with the grid effect will lose the continuity between adjacent pixels and ignore local important features, which will reduce the effect of subsequent ATM reconstruction features.
After encoding is performed, the generator applies the ATM layer by layer to rebuild the feature map. Given a 6-layer encoder, from depth to shallow, the output feature map of the encoding procedure is recorded as
,
,
,
,
,
, and
represents the ATM operation. Then, the reconstruction features at different levels are as follows:
The multiscale decoder decodes the reconstructed feature map from the ATM in a jump-joined manner along with the underlying feature from the encoder as import. The output characteristics of each layer of the multiscale decoder are represented as
,
,
,
, and
,
is the transposed convolution operation, and
is the feature splicing. Then, the decoding process can be represented as:
On the one hand, the feature map reconstructed by the ATM can encode more low-level information for the lacking area. At the same time, even though the relevant pixel content cannot be extracted from the missing region, the feature information acquired from the compact potential feature through convolution can synthesize new targets in the missing area. Combining these two influences, multiscale decoders can utilize new substances with complete semantics and realistic texture by using the context information of images.
After the completion of multiscale decoder decoding, the feature maps output at every level of the decoding layer is concatenated with the corresponding ATM feature maps to form an RGB image. The last RGB image is restored to 256 × 256 by upsampling to get the restored image. The tanh activation function is applied in each convolution layer of the RGB image network converted from decoding feature maps at various scales.
The discriminator D1 of the edge-inpainting network uses the same network structure and parameter settings as the discriminator D2 of the content-filling network and improves the discriminator’s ability to recognize the authenticity of images by minimizing counter loss. The specific network structure parameters are shown in
Table 1.
2.4. Loss Function
The loss function of the model in this article includes two parts, the loss function of the edge inpainting model and the loss function of the texture-generating model, as shown below:
where
represents the loss function of the edge inpainting model, which is made up of adversarial loss and feature matching loss [
25].
represents the content-filling model inpainting loss, which is composed of four parts: the adversarial loss, perceptual loss [
26], style loss [
27], and reconstruction loss.
2.4.1. Edge Inpainting Loss
The input of the network consists of three parts: the mask, the greyscale of the missing image, and the edge information of the missing image.
is used to represent the original image, and
and are
used to indicate the edge image and grey image. In the edge generator, we can obtain the following input:
In the above equation,
and
represent the grey image and edge image covered by the mask, and M represents the mask, which has only two values. When the value is 1, the information is missing; when the value is 0, the information is not missing; the symbol
represents elementwise multiplication.
represents the edge generator operation, and
represents the edge discriminator operation. The edge graph of the mask-covered area predicted by the generator can be expressed as:
and
are used as the inputs of the discriminator network to determine the veracity of the edge image. The loss of the network is composed of two parts, and the specific expression is as follows:
In the above equation,
represents the confrontation loss,
represents the feature matching loss, and
and
are constant coefficients. The specific definition of
is as follows:
The feature matching loss stabilizes the training process by comparing the activation graphs between the discriminator intermediate layers, and it is similar to the perceptual loss. The specific definition of
is:
where
L represents the number of convolution layers of
,
represents the number of elements in the
ith activation layer of
, and
represents the activation function in the
ith activation layer of
.
In this paper, after many independent experiments, the final weight of each loss is set to and .
2.4.2. Content-Filling Loss
Using
to show the original image and M to indicate the mask, the missing image
is denoted as follows
Using
to denote the content-fill generator operation and
to denote the image discriminator operation, the resulting image is represented as:
The ultimate output of the entire inpainting network is:
The adversative loss
is introduced as follows to train the content-filling network:
The perceptual loss compares the feature maps acquired after the same convolution operation between the true image and the generated image and minimizes the difference between them to improve the high-level semantic consistency of the two kinds of images. Specifically, in this paper, the true image and the inpainted image are compared through the corresponding activation feature maps of pool-
i (
i = 1, 2, 3) in the network layer of VGG-16 trained on ImageNet.
indicates the amount of elements in the
activation layer, and
represents the activation diagram of the corresponding layer. The formula for calculating perceptual loss is as below:
The style loss can be defined as the correlation factor between the activation values of each channel of the activation feature, measured using the activation of layer
k. In this article, a network activation layer consistent with the perceived loss is chosen and its correlation is represented by calculating the eccentric covariance between activation feature maps at different scales. Specifically, the style loss is defined as follows:
where
is a Gram matrix of
size composed of
. The introduction of the style loss can availably counter the fuzzy impact caused by transposed convolution.
To complete the transmission of semantic information and detailed features in the decoding process, L1 loss is applied to the RGB image output and transformed at all levels of decoding layers. Compared with the L1 norm, the L2 norm can partly raise the convergence speed of the model. However, in the early stage of network training, the L2 norm will increase the difference between the restoration results and the corresponding pixel points of the original image. The application of the L2 norm at all levels of network will further expand this difference, easily causing gradient explosion. In contrast, the L1 norm has a steady gradient for any input value and does not cause gradient explosion concerns. By scaling the real image size to be consistent with the RGB image size output at each decoding layer and calculating the L1 normalized distance between them, the reconstruction loss between the decoded output at each layer and the real image can be represented as follows:
Among them is that scales to real images of the same size and the decoding figure ; is a 1*1 convolution, and it will decode for RGB image and maintain the same size.
The overall loss function of the content-filling network can be expressed as:
In this paper, after many independent experiments, the weight of each loss is finally set to , , , and .
4. Discussion
Compared with the current mainstream algorithms, the model in this paper achieves better results in image inpainting of multiple irregularly shaped missing regions and can accurately reconstruct various high-frequency information in the missing regions of images, and the synthesized textures are clearer. The specific performance is shown in the following aspects: First, in order to strengthen the edge generation network’s ability to generate unknown structural information of the missing region images, jump connections and residual blocks are added to the network, and feature matching losses are introduced in the loss function to generate results that are more similar to the real edges. Second, an attention-shifting module ATM is added to the content-filling module to make the restoration network focus more on the region to be restored during the restoration process, use reconstruction loss to refine the prediction at each scale, and combine perceptual loss and style loss for model training to better reconstruct the contour structure and color texture of the region to be restored. Finally, qualitative and quantitative comparison experiments are conducted with several classical networks, and the results validate the effectiveness of the network designed in this paper.
However, the method also has some limitations. First, some restoration results have more obvious reduction traces and local semantic inconsistency of the overall image. Second, the restoration effect is reduced for images with very large missing areas and complex patterns of missing parts. Finally, this paper adopts a second-order image restoration model with long convergence time, so is time-consuming.
In summary, the next step is to optimize the model from the following perspectives: First is to combine the global consistency constraint condition with the local detail discrimination condition. A new image discriminator is proposed to constrain the content inpainting network to improve the overall consistency of the inpainting results. Second, it is designed to obtain long-range feature information to achieve image restoration of large missing regions. Finally, the overall network structure is simplified, and the network fitting time is reduced on the premise of ensuring the same inpainting effect.