Multi-Step Structure Image Inpainting Model with Attention Mechanism

Ran, Cai; Li, Xinfu; Yang, Fang

doi:10.3390/s23042316

Open AccessArticle

Multi-Step Structure Image Inpainting Model with Attention Mechanism

by

Cai Ran

^1,2,

Xinfu Li

^1,2,* and

Fang Yang

^1,2

¹

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

²

Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(4), 2316; https://doi.org/10.3390/s23042316

Submission received: 22 December 2022 / Revised: 13 February 2023 / Accepted: 17 February 2023 / Published: 19 February 2023

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

The proliferation of deep learning has propelled image inpainting to an important research field. Although the current image inpainting model has made remarkable achievements, the two-stage image inpainting method is easy to produce structural errors in the rough stage because of insufficient treatment of the rough inpainting stage. To address this problem, we propose a multi-step structured image inpainting model combining attention mechanisms. Different from the previous two-stage inpainting model, we divide the damaged area into four sub-areas, calculate the priority of each area according to the priority, specify the inpainting order, and complete the rough inpainting stage several times. The stability of the model is enhanced by the multi-step method. The structural attention mechanism strengthens the expression of structural features and improves the quality of structure and contour reconstruction. Experimental evaluation of benchmark data sets shows that our method effectively reduces structural errors and improves the effect of image inpainting.

Keywords:

image inpainting; image reconstruction; generative adversarial networks; deep learning

1. Introduction

Image inpainting is a technology that fills the damaged area of an image into a complete image that conforms to human visual effects and cognition. With the development of deep learning technology, image inpainting has become an important research field. Image inpainting has been widely used in art restoration, special effects production for film and video, image editing, and other fields. Nowadays, image inpainting is mainly divided into two schemes: the traditional mathematical method and the deep learning method based on convolutional neural networks. Traditional image inpainting methods mostly use mathematical calculation, a non-learning method, to extract the complete information of the image and fill in the damaged area of the image. This method cannot extract the deep information of the image, which leads to the lack of semantic information in the synthesized image, resulting in abrupt restoration results which deviate from human visual perception. Especially when inpainting images with large holes, traditional inpainting methods can easily lead to failure. Because the larger the hole in the image, the more complex the internal texture, structure, and semantic information.

More recently, convolutional neural networks (CNN) in deep learning can alleviate the above problems. The deep information of the image can be better extracted by convolution calculation, and the generated image information is more abundant. Ian Goodfellow proposed the generative adversarial networks(GAN) [1] in 2014. Since then, the fields related to image generation [2,3,4] have witnessed significant development. The confrontational training mode of the network can make the network generate more realistic images. GAN is mainly composed of a generator and discriminator, both of which are composed of convolutional neural networks. The generator is responsible for generating the repaired image, then inputting the restored image and the raw image into the discriminator for identification, and feeding the identification result back to the generator. In this confrontation training process, the image generated by the generator is gradually colorful, the texture details are more abundant, and the discriminator is gradually improving its ability to distinguish the synthetic image from the natural image. To sum up, GAN is undoubtedly the most promising method for deep inpainting.

Presently, GAN-based depth image inpainting methods are divided into two directions, one-stage network and two-stage network, and their difference lies mainly in the generator. The single-stage network is primarily an end-to-end model, and the generator directly outputs the restoration image. Recently, the Shift-net proposed by Yan et al. [5] adopts a single-stage network model. By combining the shift connection layer with U-Net, for filling in missing regions of any shape with sharp structures and fine-detailed textures. This approach still fails to address the issue of indistinctness in the single-stage network model, resulting in the loss of texture details in the generated image. In subsequent work, Liu et al. [6] proposed the MED model to balance the texture and structural features with the aim of preserving both in the inpainting results.

Most of these single-stage depth image inpainting methods lack the processing of image details, resulting in overly smooth synthesized images with blurred images and a lack of texture and structural information. Two-stage generator network adopts more methods to enhance image details. This structure is like a painter creating, which first generates the structure information or sketch of the image and generates the refined image in the second stage. Yu et al. [7] proposed a Contextual Attention Model that roughly restores the entire image based on the context in the initial stage and proceeds to generate more refined results in the subsequent stage. Nazeri et al. [8] introduced the EdgeConnect Model that prioritizes restoring the image’s outline in the first stage, and then completes the colorization in the second stage.

Two-stage generators can often express more vivid textures and semantic information. However, the coarse-to-fine network depends on the image restoration result of the first stage. Suppose the first stage produces a coarse image with some deviations. In that case, the second stage has a finely filled effect that deviates from human visual perception, with obvious structural and semantic errors.

To improve the reliability of the coarse-to-fine image inpainting algorithm and to address the lack of image structure processing in the coarse-to-fine network model, we propose a coarse-to-fine deep image inpainting model based on the multi-step structure inpainting model. Our proposed network focuses on the coarse stage of the restoration, as this is the basis for the final image restoration. The coarse-to-fine generator adopts the Unet structure and uses the skip connection operation to counter the gradient disappearance problem and improve the transmission ability of the image information in the deep network. In the first stage, we input the contour information of the image into a one-stage network to reconstruct the structural features of the image. Unlike previous deep fill methods, we do not rebuild the damaged area at once but inpainting it in steps. We divide the occlusion area into four sub-areas and then gradually restore it. At present, the attention mechanism is widely used in various fields [9,10,11,12]. Attention mechanisms can better extract the important content of the image and improve the feature expression ability of the network. Therefore, we introduced a structural attention module to improve the ability to extract structural information. In this attention module, the local binary pattern(LBP) operator and grayscale images enhance structural features, and gated convolution is used to extract features. Finally, the enhanced feature information is transmitted to the decoder. We added local and global discriminators to the coarse-to-fine network to improve image inpainting’s details and overall effect. In particular, in the second stage of the discriminator, we additionally use perception loss and style loss to enhance the authenticity. Experimental results demonstrate that our method effectively enhances the structural reconstruction of the two-stage inpainting model and minimizes the occurrence of incorrect structures.

The main contributions of this paper are as follows:

(1): We propose a coarse-to-fine inpainting network, in which we inpaint the structural information of the image in the first stage and color the reconstructed area in the second stage. At the same time, to solve the problem that the damaged area is challenging to recover, we put forward a multi-step structure inpainting scheme.
(2): We introduce a structural information attention module to improve the ability to reconstruct structural information in the first stage.
(3): Our proposed method outperforms existing methods in the benchmark dataset. It resolves the instability issue of two-stage models, such as the gated convolution model [13] and Edgeconnect model [8], in image structure reconstruction, resulting in improved restoration effects. Our method demonstrates better inpainting results compared to MED [6], a single-stage restoration model.

2. Related Work

Advances in image inpainting technology have made it possible to perform these tasks with greater precision and accuracy, resulting in more visually appealing and useful images. This has led to a growing demand for image inpainting solutions in a variety of industries, including photography, printing, and multimedia. In addition, image inpainting is becoming an important branch in the field of privacy protection [14,15,16,17,18]. We briefly reviewed the work related to this paper.

2.1. Traditional Mathematical Method

The traditional mathematical method uses non-learning techniques to fill in images before the deep learning technology matures. This kind of method is mainly based on the idea of diffusion filling and block matching.

The diffusion-based method spreads the boundary of the complete area to the damaged area to complete the filling. At first, Bertalmio M et al. [19] proposed a diffusion method based on a partial differential equation by combining the idea of art restoration. Later, methods based on partial differential equation diffusion include total variation model [20], Euler’s elastica model [21], M um ford Shah model [22], etc. The diffusion-based operation has a good effect on restoring small missing areas like scratches, but it can not complete the large damaged images.

The method based on patch matching is to search the patch from the complete area of the image to complete the filling. This method can fill some large damaged areas more naturally than diffusion-based technology. However, the block matching method cannot fill the image that conforms to human vision. The filled image lacks semantic information, so it cannot generate an image that fits human cognition. Drori et al. [23] iterated through the missing area content by using the smooth interpolation method and filled the most similar image blocks into the occlusion area after adding details transformation. Criminisi et al. [24] adopted priority to calculate the restoration order and globally matched similar regions for filling, thus preferentially ensuring the propagation direction of the structural texture. The Criminisi algorithm can restore some images with simple texture structure but still cannot deal with complex content. Wilczkowiak et al. [25] can make guided fixes interactively with the user by searching for similar patch areas. Although these approaches outperform diffusion-based approaches in occlusion inpainting, they still fail to fill in more complex textures and structures, especially lacking semantic information.

2.2. Image Inpainting Based on Deep Learning

After Goodfellow et al. [1] put forward the Generative Adversarial Networks (GAN), Pathak et al. [26] introduced GAN into the field of image inpainting for the first time and put forward the Context Encoders model. The generator adopts the structure of an encoder and decoder. Context encoders model uses a convolution neural network, which can inpaint the semantic information of the image and a large occlusion area. However, the fully connected network employed between the encoder and decoder of the context encoder network results in the image being too smooth and fuzzy. After that, to enhance the texture and structural details, Iizuka et al. [27] used the fully convolutional generator and added the local discriminator to form the global and local image inpainting model. However, the double discriminator model still lacks the processing of image details. Yang et al. [28] proposed a joint generation structure to constrain the image content and texture. Yan et al. [5] constructed a single-stage generator network based on Unet network and introduced a shift connection layer to fill in any missing shape area with sharp structure and delicate texture. However, the single-stage Unet network structure still causes too smooth reconstruction results. Yu et al. [7] constructed the coarse-to-fine generator architecture, put forward the context attention mechanism, further ensured the consistency between the inpainting area and the complete area, and adopted the improved Wasserstein gan [29] stability training. Liu et al. [30] proposed partial convolution to improve the inpainting effect of irregular area occlusion. Yu et al. [13] improved the partial convolution, presented a learnable gated convolution, and allowed users to edit images interactively. Inspired by artistic creation, Nazeri et al. [8] adopted the coarse-to-fine network structure to restore the outline information of the image in the first stage and finish the delicate filling of the image in the second stage. However, in the first stage, the image contour is reconstructed at one time, so it is prone to cause the wrong outline, which eventually leads to the failure of the second stage. Liu et al. [31] designed a coherent semantic attention mechanism to strengthen the relationship between missing regional features. Liu et al. [6] think that coarse-to-fine generators often cause serious semantic errors, so they put forward a joint encoder-decoder model and introduced feature equalization operation. However, the joint encoder-decoder model still cannot completely solve the image-blurring problem generated by the single-stage model. More recently, Zhu et al. [32] proposed mask-aware dynamic filtering (MADF) and used a cascade thinning network to fill in any missing image area. Wu et al. [33] used the stack network to repair the occluded image from coarse to fine, improving the reusability of the extracted features.

In the depth inpainting model, the disadvantages of the single-stage model [34,35,36,37,38,39], such as the lack of texture details of the generated image, the fuzzy structural problem still cannot be effectively solved. Most coarse-to-fine models [40,41,42,43,44,45,46] can produce more realistic textures, so many researchers are committed to improving the coarse-to-fine image inpainting model. However, they lack details in the rough repair stage. Moreover, most coarse-to-fine networks use the one-time restoration of all damaged areas, so if the first stage produces an inaccurate image, this error will be amplified in the subsequently refined networks and ultimately creates an unreasonable image. These problems can lead to image inpatient instability, sometimes producing compelling images and sometimes severe semantic errors.

3. Approach

3.1. Model

To solve these problems in the coarse-to-fine model, we propose a two-stage image inpainting model, which focuses on restoring structure and outline in the first stage because it is the cornerstone of texture and color filling in the fine inpainting stage. We propose a multi-step structure inpainting model to improve the first-stage restoration effect and reduce the difficulty of structural reconstruction. Specifically, in the coarse inpainting stage, we divide the occluded area of the image into four regions, determine the inpainting order by priority calculation, and then input the first-stage network to complete the inpainting gradually. The purpose of this is to restore the most straightforward part in each step, and then when rebuilding the rest, the occluded area can sample more surrounding information to complete the filling. We all know that GAN is difficult to train, prone to model collapse, and challenging to converge. The occlusion area is divided into multiple fills, reducing the difficulty of the inpainting and reducing the pressure of the generator, thus improving the inpainting effect. In addition, we propose a structural attention mechanism to enhance the reliability of structural reconstruction. The attention mechanism uses the gray-scale and LBP images of the damaged image as input, fills the holes with six consecutive layers of gated convolution, and then sends the feature information to the decoder at each scale to complete the feature information enhancement. Then, the first-stage reconstructed contour image and the original image are input into the fine-stage network, and the color filling of the image is completed through Unet to obtain the final restored image.

The proposed network structure is shown in Figure 1. For the convenience of explaining the network structure, this figure omits the gradual inpainting stage and only shows the initial input and final inpainting results.

As shown in Figure 1, let

C_{i n}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{1}

be the outline of the damaged image. Let

M_{i n}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{1}

be the original mask. Let

L_{i n}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{1}

be the LBP characteristic map of the damaged image. Let

G_{i n}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{1}

be the gray scale of damaged image. In the backbone network of the rough inpainting stage, we use the contour image

C_{i n}

and mask

M_{i n}

of the damaged image for reconstruction. In the structural attention module, the input is the LBP feature map

L_{i n}

and grayscale map of the damaged image

G_{i n}

, and the mask

M_{i n}

to extract the feature information. The backbone network in the rough inpainting stage adopts the Unet structure. The first layer of convolution in the encoder will downsample the image

C_{i n}

to

\frac{H}{2} \times \frac{W}{2} \times 16

dimension feature map, the second layer will downsample to

\frac{H}{4} \times \frac{H}{4} \times 32

, and the subsequent three-layer convolution network will gradually downsample to

\frac{H}{64} \times \frac{W}{64} \times 512

dimension. Moreover, our decoder is symmetrical to the encoder, which ensures that the feature information is not lost in the image decoding process. The jump connection is introduced to prevent gradient disappearance caused by deep network structure effectively. At the connection between the encoder and decoder, we do not adopt the traditional full connection operation because that will cause image blur. Specifically, we use a residual block composed of three layers of residual convolutions to connect the encoder and the decoder. Compared with the fully connected layer, the convolutional layer can better express the image features. At the same time, the residual structure can effectively solve the problem of gradient disappearance caused by the deep network.

In the proposed structural attention module, we use the LBP feature map. LBP operator has good feature extraction ability, and undistorted illumination, which solves the problem that features information is difficult to extract due to uneven illumination of images. The equation for the LBP operator is as follows:

I_{L B P (x_{c}, y_{c})} = L B P (x_{c}, y_{c})

(1)

where

(x_{c}, y_{c})

represents pixel points,

L B P (x_{c}, y_{c})

denotes the feature extraction of corresponding coordinate points, which is expressed as:

L B P (x_{c}, y_{c}) = \sum_{p = 0}^{p - 1} 2^{p} s (i_{p} - i_{c})

(2)

where p is the neighborhood pixel, c is the center pixel, and s is the symbolic function, which is expressed as:

s (x) = \{\begin{matrix} 1 & x \geq 0 \\ 0 & x < 0 \end{matrix}

(3)

The attention module uses six-layer gated convolution, the gated convolution operation is expressed as:

G_{x} = ϕ (X \cdot W + b) \otimes σ (X \cdot V + c) .

(4)

We utilize W and V to denote different convolution filters.

ϕ

represents the ELU activation function, which is used to process the feature vector. Furthermore,

σ

represents the sigmoid activation function, which activates the gating operation and maps the gating value between 0 and 1.

The attention module also adopts a six-layer convolutional network, so that the dimension of the feature map becomes

\frac{H}{8} \times \frac{W}{8} \times 64

, which is convenient for upsampling and combining the structural feature information with the decoder.

In the fine-stage, let

I_{i n}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{3}

be the color damaged image.

C_{o u t}

∈

R

^{H}

^{\times}

^{W}

^{\times}

^{1}

be the outline image of coarse-stage output. Finally,

I_{i n}

,

C_{o u t}

, and

M_{i n}

are input into the fine-stage network to finish the final image coloring. The generator model used in the fine inpainting stage is consistent with the coarse inpainting stage.

3.2. Multi-Step Structure Inpainting

The specific process of the coarse inpainting stage is shown in Figure 2. We divide the mask into four parts and determine the inpainting order according to the priority policy. The proposed priority calculation method is shown in Equation (1). We determine that priority order according to the size of the complete area contained in each part, let

p r i_{i \in 1, 2, 3, 4}

denote the priority of the i-th block mask,

\sum_{j} D_{i, j}

denote the total number of damaged pixels in the i-th mask area, and

\sum_{k} C_{i, k}

denote the total number of complete pixels in the i-th mask area. The higher the

p r i

value, the higher the priority. We prioritize inpainting high-priority occlusion blocks due to their smaller masked areas and greater ease of restoration. By first reconstructing high-priority areas, there will be adding reference information available when restoring low-priority areas, leading to better inpainting results and fewer structural errors.

p r i_{i \in 1, 2, 3, 4} = \frac{\sum_{j} D_{i, j}}{\sum_{k} C_{i, k}} .

(5)

We show the first two steps of the multi-step structure inpainting model in Figure 2. In the first step, let

M_{1} \in R^{H \times W \times 1}

denote the first block of the occlusion area that needs to be restored. Input

C_{i n}

,

M_{1}

into the backbone network of the coarse-stage, and input

L_{i n}

,

G_{i n}

,

M_{1}

into the structural attention module. Then the contour image

C_{1}

reconstructed in the first step is obtained. In the second stage of inpainting, input

C_{i n}

,

M_{1}

into the backbone network of the coarse-stage, and input

L_{i n}

,

G_{i n}

,

M_{1}

into the structural attention module. After four restoration times, the final contour restoration image

C_{o u t}

is obtained. Finally, the image inpainting is completed through the fine-stage network.

The restoration process of an image is shown in Figure 3. Based on the priority calculation, the lower right corner area has the highest priority due to its smallest occluded area, so it is restored first. In the second step, the inpainting result (b) is passed to the coarse inpainting model for reference during the second step of inpainting. While the last area has the lowest priority and the largest occluded area, the use of reference information from the previous three steps reduces inpainting difficulty, improves generator stability, and enhances the structural reconstruction effect.

3.3. Loss Function

According to the different functions of the coarse inpainting network and fine inpainting network, we use two joint loss functions to complete the training of the network.

The coarse-stage network does not contain color, style, and texture features, so we only use feature-matching loss, pixel-level reconstruction loss, and adversarial loss to constrain the structural reconstruction of the image. In the fine-stage network, we use pixel-level reconstruction loss, adversarial loss, perceptual loss, and style loss to constrain jointly.

3.3.1. Feature-Matching LOSS

Similar to Edgeconnect [8], we used Feature-matching loss to guarantee the reconstruction effect of the structure image.

L_{f m} = E [\sum_{i = 1}^{L} \frac{1}{N_{i}} ‖ D_{1}^{(i)} (C_{g t}) - D_{1}^{(i)} (C_{p r e d}) ‖_{1}] .

(6)

where L is the final convolution layer of the discriminator,

N_{i}

is the number of elements in the i’th activation layer, and

D_{1}^{(i)}

is the activation in the i’th layer of the discriminator.

3.3.2. Reconstruction Loss

To improve the refinement at the pixel level of the image, we use L1 distance as the reconstruction loss to measure the error between the predicted image and the actual image.

L_{r e c} = {‖ I_{o u t} \otimes (1 - M_{i n}) - I_{g t} \otimes (1 - M_{i n}) ‖}_{1} .

(7)

I_{o u t}

represents the restored image and

I_{g t}

is the real image.

M_{i n} \in 0, 1

is the input mask, where 0 represents the mask area, 1 represents the complete area, and represents element-by-element multiplication.

3.3.3. Adversarial Loss

According to the characteristics of network training, we utilize relativistic average LS adversarial loss [6] to stabilize the training of GAN.

\begin{matrix} L_{a d v} = - E_{I_{g t}} [l o g (1 - D_{r a} (I_{g t}, I_{o u t}))] \\ - E_{I_{o u t}} [l o g (D_{r a} (I_{o u t}, I_{g t})] . \end{matrix}

(8)

where

D_{r a} (\cdot)

is defined as:

D_{r a} (I_{g t}, I_{o u t}) = s i g m o i d (C (I_{g t}) - E_{I_{o u t}} [C (I_{o u t})]) .

(9)

where

C (\cdot)

indicates the discriminator without the last sigmoid function.

3.3.4. Perceptual Loss

Inspired by human perception, human beings can receive the color and texture information conveyed by the image surface and understand the deep semantic information in the image. So we adopt the perception loss [47] to guide the generator to generate the image more in line with human perception.

L_{p e r c} = E [\sum_{i} ‖ Φ_{i} (I_{o u t}) - Φ_{i} (I_{g t}) ‖_{1}] .

(10)

where

Φ_{i}

is the activation map of the i layer of the VGG-16 network pre-trained on ImageNet, the corresponding layers to

Φ_{i}

in this paper are relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1.

3.3.5. Style Loss

Further, with the perception loss, we adopt the style loss [47] to constrain and keep the consistency between the style and the original image.

L_{s t y l e} = E [\sum_{i} ‖ G_{j}^{Φ} (I_{o u t}) - G_{j}^{Φ} (I_{g t}) ‖_{1}] .

(11)

where

G_{j}^{Φ}

is the Gram matrix with a size of

C_{j} \times C_{j}

constructed by the activation graph, and the activation graph

Φ_{i}

comes from the above perceptual loss.

3.3.6. Joint Loss

In a coarse-stage network, the joint loss function is defined as follows:

L_{c o a r s e} = λ_{f m} L_{f m} + λ_{r e c} L_{r e c} + λ_{a d v} L_{a d v} .

(12)

In a fine-stage network, the joint loss function is defined as follows:

L_{f i n e} = λ_{r e c} L_{r e c} + λ_{a d v} L_{a d v} + λ_{p e r c} L_{p e r c} + λ_{s t y l e} L_{s t y l e} .

(13)

where

λ_{f m}

,

λ_{r e c}

,

λ_{a d v}

,

λ_{p e r c}

and

λ_{s t y l e}

are the hyperparameters. According to the experience in [6], we set

λ_{f m} = 1

,

λ_{r e c} = 1

,

λ_{a d v} = 0.1

,

λ_{p e r c} = 0.1

,

λ_{s t y l e} = 250

.

3.3.7. Comparison of Loss Function

To visually demonstrate the impact of each loss function on image inpainting outcomes, we conducted a comparative experiment for each loss function. As can be seen from Figure 4, the absence of feature-matching loss leads to a deviation in the color and structural features of the object. The removal of reconstruction loss results in a failure of image inpainting. Adversarial loss is crucial for preserving the texture details of the image. Perceptual loss and style loss further enhance the visual effect.

4. Experiments

We validate the proposed method on two benchmark datasets: Places2 [48] and CelebA [49]. The damaged image was generated using the mask dataset [30]. The mask dataset contained 12,000 randomly generated mask images. In the experiment, we used four kinds of mask areas as the contrast experiments: 10∼20, 20∼30, 30∼40 and 40∼50. We compare three excellent models to prove the proposed method: GC [13], EC [8], MED [6]. We use random sampling method to extract several categories of images from Place2 for comparative experiments, and the ratio of the training set to test set is 9:1.

4.1. Visual Evaluations

Figure 5 and Figure 6 show the visual evaluations of the proposed method and the three models on two benchmark datasets. The comparison results on the Places2 dataset show that GC [13] and EC [8], which also use the two-stage generator, often have incorrect semantics, and the structure is too abrupt. MED [6] using a single-stage generator shows such problems as fuzzy texture and inconsistent color. In Figure 5, GC [13], EC [8], and MED [6] fill a large area of damaged image at one time, causing excessive pressure on the generator, which eventually leads to the instability of the restored image, frequent semantic errors. Moreover, the restored facial features lack delicate structure. Our proposed method can effectively solve these problems. The multi-step structure inpainting model is used to reduce the pressure on the generator. The details of the reconstructed image are more exquisite, the texture and color fit the reality, and the additional structural attention module can better fill the structural information, which greatly improves the reliability of the coarse-stage contour reconstruction network. In Figure 5, the first and last lines demonstrate that our method is able to fill in more intricate textures. The second and third lines show that our method effectively reconstructs the structural characteristics of the building without structural errors. In the comparison between the first line and the second line in Figure 6, our method restores more delicate facial features. The third and fourth lines illustrate that our method generates mouth details more realistically.

Experiments show that our multi-step inpainting model can effectively improve the stability of the generator and ensure that the generator can generate images with rich details. The attention mechanism based on the LBP operator can better extract the structural information of the image, strengthen the expression ability of the generator on structural features, and further ensure the inpainting effect in the rough inpainting stage.

4.2. Numerical Evaluations

In the numerical evaluation, we used PSNR, SSIM [50] and FID to measure our method and the other three methods. We divided the mask proportion into four groups: 10∼20, 20∼30, 30∼40, 40∼50. Table 1 shows the comparison results of Places2 and Table 2 shows the results of CelebA. The higher the values of PSNR and SSIM, the better the inpainting result, and the lower the value of FID, the higher the similarity with the original image.

The results displayed in the table demonstrate the superior performance of our model compared to other methods on both Places2 and CelebA datasets. The high PSNR and SSIM values indicate that our proposed method generates high-quality images with structural features that are more aligned with reality. The low FID value indicates that the use of a multi-step inpainting model has effectively stabilized the generator and produced inpainted images that are more consistent with the original images. In summary, the multi-step inpainting model and LBP operator-based attention mechanism effectively enhance the inpainting results in the coarse inpainting stage and minimize structural errors in the final inpainted image.

5. Ablation Study

In this section, we will continue to validate the proposed method through ablation experiments. We randomly selected a subset of images from the Places2 data set for use in the ablation study. We conducted ablation studies on the multi-step structure inpainting model and the structural attention mechanism to prove the effectiveness of the proposed method.

5.1. Multi-Step Structure Inpainting Model

To better prove the effectiveness of the method, we control the consistency of other structures and parameters of the model so that whether to enable a multi-step inpainting model is the only variable, and finally verify the proposed multi-step inpainting model on the same mask. As shown in Figure 7, when the entire damaged area is reconstructed at one time, it is difficult to ensure the inpainting effect of the image structure. The resulting image is unstable, and the structure has fractures and semantic errors. The visual effect is improved when we use the multi-step inpainting model, and the generated image conforms to human visual cognition.

5.2. Structural Attention Mechanism

In the ablation study of the structural attention mechanism, we control the consistency of other model structures and parameters. The comparison of whether the structural attention mechanism is adopted is shown in Figure 8. We can see from the figure that the structural attention mechanism enhances the structural inpainting of the image and improves the extraction ability of structural features, thus improving the contour reconstruction effect.

6. Conclusions

This paper presents a multi-step structure inpainting model to solve the unstable problem of the coarse-to-fine image inpainting model in the first phase. Using the multi-step structure inpainting model reduces the difficulty of image generation, and the effect of image contour rebuilding in the coarse inpainting stage is improved. In addition, we introduced a structure attention mechanism to extract more abundant structure information and enhance the ability to express image structure information. The experimental results on the benchmark dataset show that our proposed method is effective. In future work, we plan to improve the inpainting methods in the fine inpainting stage to fill in finer textures and vivid colors.

Author Contributions

Conceptualization, C.R. and X.L.; methodology, C.R. and X.L.; validation, C.R. and X.L.; writing—original draft preparation, C.R.; writing—review and editing, C.R., X.L. and F.Y.; supervision, X.L. All authors haveread and agreed to the published version of the manuscript.

Funding

This work was supported by Science and Technology Project of Hebei Education Department (ZD2019131).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Yang, J.; Kannan, A.; Batra, D.; Parikh, D. Lr-gan: Layered recursive generative adversarial networks for image generation. arXiv 2017, arXiv:1703.01560. [Google Scholar]
Xu, T.; Zhang, P.; Huang, Q.; Han, Z.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1316–1324. [Google Scholar]
Johnson, J.; Gupta, A.; Li, F.-F. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1219–1228. [Google Scholar]
Yan, Z.; Li, X.; Li, M.; Zuo, W.; Shan, S. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
Liu, H.; Jiang, B.; Song, Y.; Huang, W.; Yang, C. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 725–741. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Thomas, S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Ribeiro, H.D.M.; Arnold, A.; Howard, J.P.; Shun-Shin, M.J.; Zhang, Y.; Francis, D.P.; Lim, P.B.; Whinnett, Z.; Zolgharni, M. ECG-based real-time arrhythmia monitoring using quantized deep neural networks: A feasibility study. Comput. Biol. Med. 2022, 143, 105249. [Google Scholar] [CrossRef]
Liu, Z.; Chen, Y.; Zhang, Y.; Ran, S.; Cheng, C.; Yang, G. Diagnosis of arrhythmias with few abnormal ECG samples using metric-based meta learning. Comput. Biol. Med. 2023, 153, 106465. [Google Scholar] [CrossRef]
Hu, R.; Chen, J.; Zhou, L. A transformer-based deep neural network for arrhythmia detection using continuous ECG signals. Comput. Biol. Med. 2022, 144, 105325. [Google Scholar] [CrossRef]
Chen, H.; Das, S.; Morgan, J.M.; Maharatna, K. Prediction and classification of ventricular arrhythmia based on phase-space reconstruction and fuzzy c-means clustering. Comput. Biol. Med. 2022, 142, 105180. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
Wu, Z.; Shen, S.; Lian, X.; Su, X.; Chen, E. A dummy-based user privacy protection approach for text information retrieval. Knowl.-Based Syst. 2020, 195, 105679. [Google Scholar] [CrossRef]
Wu, Z.; Xuan, S.; Xie, J.; Lin, C.; Lu, C. How to ensure the confidentiality of electronic medical records on the cloud: A technical perspective. Comput. Biol. Med. 2022, 147, 105726. [Google Scholar] [CrossRef]
Wu, Z.; Shen, S.; Li, H.; Zhou, H.; Lu, C. A basic framework for privacy protection in personalized information retrieval: An effective framework for user privacy protection. J. Organ. End User Comput. (JOEUC) 2021, 33, 1–26. [Google Scholar] [CrossRef]
Wu, Z.; Li, G.; Shen, S.; Lian, X.; Chen, E.; Xu, G. Constructing dummy query sequences to protect location privacy and query privacy in location-based services. World Wide Web 2021, 24, 25–49. [Google Scholar] [CrossRef]
Wu, Z.; Shen, S.; Zhou, H.; Li, H.; Lu, C.; Zou, D. An effective approach for the protection of user commodity viewing privacy in e-commerce website. Knowl.-Based Syst. 2021, 220, 106952. [Google Scholar] [CrossRef]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Shen, J.; Chan, T.F. Mathematical models for local nontexture inpaintings. SIAM J. Appl. Math. 2002, 62, 1019–1043. [Google Scholar] [CrossRef] [Green Version]
Shen, J.; Kang, S.H.; Chan, T.F. Euler’s elastica and curvature-based inpainting. SIAM J. Appl. Math. 2003, 63, 564–592. [Google Scholar] [CrossRef]
Tsai, A.; Yezzi, A.; Willsky, A.S. Curve evolution implementation of the Mumford-Shah functional for image segmentation, denoising, interpolation, and magnification. IEEE Trans. Image Process. 2001, 10, 1169–1186. [Google Scholar] [CrossRef] [Green Version]
Drori, I.; Cohen-Or, D.; Yeshurun, H. Fragment-based image completion. In ACM SIGGRAPH 2003 Papers; Assoc Computing Machinery: New York, NY, USA, 2003; pp. 303–312. [Google Scholar]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef]
Wilczkowiak, M.; Brostow, G.J.; Tordoff, B.; Cipolla, R. Hole filling through photomontage. In Proceedings of the British Machine Vision Conference, BMVC 2005, Oxford, UK, 5–8 September 2005. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning; PMLR: London, UK, 2017; pp. 214–223. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Liu, H.; Jiang, B.; Xiao, Y.; Yang, C. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4170–4179. [Google Scholar]
Zhu, M.; He, D.; Li, X.; Chao, L.; Zhang, Z. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef]
Ren, Y.; Ren, H.; Shi, C.; Zhang, X.; Wu, X.; Li, X.; Mumtaz, I. Multistage semantic-aware image inpainting with stacked generator networks. Int. J. Intell. Syst. 2022, 37, 1599–1617. [Google Scholar] [CrossRef]
Gilbert, A.; Collomosse, J.; Jin, H.; Price, B. Disentangling structure and aesthetics for style-aware image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1848–1856. [Google Scholar]
Wang, Y.; Tao, X.; Qi, X.; Shen, X.; Jia, J. Image inpainting via generative multi-column convolutional neural networks. arXiv 2018, arXiv:1810.08771. [Google Scholar]
Vo, H.V.; Duong, N.Q.K.; Pérez, P. Structural inpainting. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1948–1956. [Google Scholar]
Abbas Hedjazi, M.; Genc, Y. Learning to inpaint by progressively growing the mask regions. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Yu, T.; Guo, Z.; Jin, X.; Wu, S.; Chen, Z.; Li, W. Region normalization for image inpainting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12733–12740. [Google Scholar]
Wang, Y.; Chen, Y.C.; Tao, X.; Jia, J. Vcnet: A robust approach to blind image inpainting. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 752–768. [Google Scholar]
Sun, Q.; Ma, L.; Oh, S.J.; Van Gool, L. Natural and effective obfuscation by head inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5050–5059. [Google Scholar]
Song, Y.; Yang, C.; Lin, Z.; Liu, X.; Huang, Q.; Li, H.; Kuo, C. Contextual-based image inpainting: Infer, match, and translate. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, T.; Ouyang, H.; Chen, Q. Image inpainting with external-internal learning and monochromic bottleneck. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5120–5129. [Google Scholar]
Zeng, Y.; Lin, Z.; Lu, H.; Patel, V.M. Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 14164–14173. [Google Scholar]
Guo, X.; Yang, H.; Huang, D. Image inpainting via conditional texture and structure dual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 14134–14143. [Google Scholar]
Yu, J.; Li, K.; Peng, J. Reference-guided face inpainting with reference attention network. Neural Comput. Appl. 2022, 34, 9717–9731. [Google Scholar] [CrossRef]
Li, L.; Chen, M.; Shi, H.; Duan, Z.; Xiong, X. Multiscale Structure and Texture Feature Fusion for Image Inpainting. IEEE Access 2022, 10, 82668–82679. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [Green Version]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Generator structure diagram. The upper part is the coarse inpainting stage and structural attention module, and the lower part is the fine inpainting stage. It should be noted that the final reconstructed image in the coarse inpainting stage is the result of four rounds reconstruction.

Figure 2. Multi-step structure model flowchart. The rough inpainting stage prioritizes structural reconstruction, and therefore, gray-scale images are used as input. The occlusion area is restored in four steps, with the inpainting order determined by priority calculation. The inpainting process begins with the lower right corner area, which has the highest priority, followed by the upper left corner area with the second highest priority. The final two steps follow a similar pattern. The result in the rough inpainting stage is represented as

C_{o u t}

.

Figure 2. Multi-step structure model flowchart. The rough inpainting stage prioritizes structural reconstruction, and therefore, gray-scale images are used as input. The occlusion area is restored in four steps, with the inpainting order determined by priority calculation. The inpainting process begins with the lower right corner area, which has the highest priority, followed by the upper left corner area with the second highest priority. The final two steps follow a similar pattern. The result in the rough inpainting stage is represented as

C_{o u t}

.

Figure 3. Multi-step structural reconstruction process. (a) represents a masked image, (b–d) depict the inpainting results from the first to third steps, and (e) shows the final structural reconstruction result.

Figure 4. Visual comparison of loss functions. (a) Input images. (b) W/O feature-matching loss. (c) W/O reconstruction loss. (d) W/O adversarial loss. (e) W/O perceptual loss. (f) W/O style loss. (g) Result w/ joint loss.

Figure 5. Comparison on the Places2 dataset. (a) Input images. (b) GC [13]. (c) EC [8]. (d) MED [6]. (e) Ours. (f) Ground truth.

Figure 6. Comparison on the CelebA dataset. (a) Input images. (b) GC [13]. (c) EC [8]. (d) MED [6] (e) Ours. (f) Ground truth.

Figure 7. Abalation studies on multi-step structure inpainting model. (a) Input image. (b) Ours w/o multi-step structure inpainting. (c) Ours. (d) Ground truth.

Figure 8. Abalation studies on structural attention mechanism. (a) Input image. (b) Ours w/o structural attention. (c) Ours. (d) Ground truth.

Table 1. Comparison on the Places2 dataset.

	Mask	GC	EC	MED	Ours
PSNR↑	10∼20	26.580	28.230	28.909	29.269
	20∼30	22.551	25.115	25.235	26.314
	30∼40	16.485	18.520	19.936	20.260
	40∼50	15.130	17.433	17.852	18.312
SSIM↑	10∼20	0.909	0.934	0.941	0.946
	20∼30	0.754	0.881	0.893	0.903
	30∼40	0.652	0.706	0.719	0.731
	40∼50	0.560	0.679	0.698	0.708
FID↓	10∼20	24.751	22.947	22.022	18.465
	20∼30	32.259	31.518	29.063	26.973
	30∼40	46.207	45.170	42.570	38.585
	40∼50	62.524	59.960	59.179	54.080

Table 2. Comparison on the CelebA dataset.

	Mask	GC	EC	MED	Ours
PSNR↑	10∼20	27.979	30.473	30.585	30.946
	20∼30	20.459	23.179	27.579	27.614
	30∼40	17.658	19.973	20.467	21.056
	40∼50	16.209	18.216	19.243	19.420
SSIM↑	10∼20	0.742	0.907	0.926	0.934
	20∼30	0.694	0.862	0.883	0.892
	30∼40	0.607	0.744	0.765	0.774
	40∼50	0.554	0.628	0.733	0.739
FID↓	10∼20	20.580	19.159	19.491	16.264
	20∼30	31.472	29.738	27.071	26.207
	30∼40	50.485	47.652	43.357	43.067
	40∼50	62.975	62.721	58.946	57.700

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ran, C.; Li, X.; Yang, F. Multi-Step Structure Image Inpainting Model with Attention Mechanism. Sensors 2023, 23, 2316. https://doi.org/10.3390/s23042316

AMA Style

Ran C, Li X, Yang F. Multi-Step Structure Image Inpainting Model with Attention Mechanism. Sensors. 2023; 23(4):2316. https://doi.org/10.3390/s23042316

Chicago/Turabian Style

Ran, Cai, Xinfu Li, and Fang Yang. 2023. "Multi-Step Structure Image Inpainting Model with Attention Mechanism" Sensors 23, no. 4: 2316. https://doi.org/10.3390/s23042316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Step Structure Image Inpainting Model with Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Traditional Mathematical Method

2.2. Image Inpainting Based on Deep Learning

3. Approach

3.1. Model

3.2. Multi-Step Structure Inpainting

3.3. Loss Function

3.3.1. Feature-Matching LOSS

3.3.2. Reconstruction Loss

3.3.3. Adversarial Loss

3.3.4. Perceptual Loss

3.3.5. Style Loss

3.3.6. Joint Loss

3.3.7. Comparison of Loss Function

4. Experiments

4.1. Visual Evaluations

4.2. Numerical Evaluations

5. Ablation Study

5.1. Multi-Step Structure Inpainting Model

5.2. Structural Attention Mechanism

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI