TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation

Wu, Meng; Jia, Min; Wang, Jia

doi:10.3390/app122110940

Open AccessArticle

TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation

by

Meng Wu

^1,*

,

Min Jia

¹ and

Jia Wang

²

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

Shaanxi History Museum, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(21), 10940; https://doi.org/10.3390/app122110940

Submission received: 18 September 2022 / Revised: 18 October 2022 / Accepted: 24 October 2022 / Published: 28 October 2022

Download

Browse Figures

Versions Notes

Abstract

:

The detection of crack information is very important in mural conservation. In practice, the number of ancient murals is scarce, and the difficulty of collecting digital information about murals leads to minimal data being collected. Crack information appears in pictures of paintings, which resembles painting traces and is easy to misidentify. However, the current mainstream semantic segmentation networks directly use the features of the backbone network for prediction, which do not fully use the features at different scales and ignore the differences between the decoder and encoder features. This paper proposes a new U-shaped convolutional neural network with feature pyramids and a transformer called TMCrack-net. Instead of U-Net’s jump-join, an AG-BiFPN network is used, which consists of two modules: a channel cross-fusion (CCT) module with a transformer and a bidirectional feature pyramid network. While fully using the information in different network dimensions, the channel cross-fusion module optimizes the final features of each layer to reduce the confounding effect caused by the fusion of features. For the fusion of multi-scale channel information with decoder features, we designed a fusion module based on a channel attention (called FCA) to guide the fusion of enhanced encoder features with decoder features and reduce the ambiguity between the two feature sets. To demonstrate the effectiveness and generalization of the model, TMCrack-Net was evaluated on the Tang Dynasty tomb chamber mural dataset and Crack500. MIou values of 0.7731 and 0.7944 were achieved, respectively, which are better than those of other advanced crack detection methods. The method yields accurate segmentation performance and is advantageous for mural painting crack segmentation tasks.

Keywords:

murals; crack segmentation; U-Net; BiFPN; FCA

1. Introduction

Ancient murals are regarded as valuable cultural heritage, reflecting the social landscape of the times they are from. They provide a valuable basis for studying ancient cultures because of their scientific, historical, and humanistic value. However, under the influence of natural and artificial factors, many murals [1] have suffered from problems such as cracks, puckering, and scratches, which weaken the information expression of frescoes and are not conducive to preserving history and culture, or the transmission of the art.

Screen cleaning, reinforcement, paper, cloth stickers, disease analysis, expert evaluation, and disease labeling [2] are all involved in the manual disease labeling of ancient murals. A manual method will not only cause secondary damage to the murals but will also be very inefficient. Therefore, image processing technology must be introduced to help heritage conservationists protect murals.

Accurate segmentation of cracks in mural images is challenging, and mural paintings have complex background structures. The painting of the polo picture in the tomb of Prince Zhang Huai in Tang Dynasty is rich in content; the painting has different backgrounds of people, horses, trees among mountains and rocks; and the complex image content increases the difficulty of crack recognition. Cracked images are shown in Figure 1.

Among the crack detection algorithms of traditional image processing, Hou proposed a K-means Sobel algorithm to extract the disease edges of murals. Two evaluation [3] criteria were given for the specificity of murals: disease recognition rate and edge continuity. Subsequently, his team [4] used geographic information system technology (GIS) to produce high-precision digital orthophoto maps (DOM) of murals, establish a spatial database of diseases, obtain disease maps and trend maps, obtain the locations and severity of diseases through spatial analysis and spatial statistics, and use decision trees to classify disease classes. Gancarczyk [5] applied data mining techniques to image segmentation techniques to extract crack information with unsatisfactory results. Cornelis [6] proposed a multi-scale high-hat transform and K-SVD method and a post-processing method based on semi-supervised clustering to eliminate mislabeling, and weighted fusion of the three methods was performed to calibrate crack information of different sizes and brightnesses. Two types of methods are described below: (1) manual depiction using GIS software; (2) image processing using edge detection and multivariate filtering. Though both methods involve non-contact extraction, they are mostly human–computer interactions and semi-automatic methods requiring improved accuracy.

With the development of computational vision and deep learning, different neural networks are used for mural painting disease detection. Lin [7] used hyperspectral images for disease region identification, using minimum noise fraction (MNF) to focus on different bands and reduce the effect of noise on the data. The mural images were classified into several types of damaged areas and normal areas by back propagation (BP) neural network. Yu [8] used a U-Net network with multi-scale details to detect paint peeling disease in mural paintings of Fengguo Temple of Liao Dynasty in Yi County, Jinzhou City, China. The shallow features of the wall paintings were injected into different stages of the encoder to obtain detailed information about the paint peeling disease, and attention mechanisms were added to suppress invalid features in the encoding stage. Wu [9] proposed the Ghost-C3SE YOLOv5 network to detect the damage to cave murals. The YOLOv5 network structure was adjusted to reduce the dimensionality of the convolutional layers. An attention mechanism was added to the backbone network to adjust the importance of different feature channels.

This paper proposes the TMCrack-Net model using the ConvNext network as the encoder skeleton. Bottom-up and top-down fusion of the extracted features take place using feature fusion networks with attention. Circulation of semantic and detailed information through the feature network. Each layer’s final features are optimized using multi-headed self-attention to reduce the confounding effect of feature fusion. The enhanced features are guided to fuse with the decoder’s features using the FCA module to reduce the ambiguity between the two features:

1.: We created a Tang Dynasty tomb chamber mural dataset, the TMCrack database, to implement the segmentation model, which includes images of the real tomb chamber mural and the corresponding crack label images.
2.: Based on the U-Net algorithm, we proposed TMCRack-Net with BiFPNet and a transformer for fresco crack segmentation. Instead of jumping joins in U-Net, a feature fusion network is used to fully use features at different scales to enhance the crack detail information. An attention mechanism was added after the feature fusion network to enhance the weights of useful features and reduce the influence of irrelevant features.
3.: To better fuse encoder and decoder features, we designed a fusion module based on channel attention (called FCA) to calculate the similarity matrix between encoder and decoder features, enhance each feature with a strong correlation with at least one other, suppress the features with weak correlations, and reduce the ambiguity between the two features.

2. Related Work

Different CNN models are used in crack segmentation methods. The effectiveness of CNNs in crack segmentation has been demonstrated in a large number of experiments.

2.1. Feature Pyramid

Multi-scale images are one of the most challenging aspects of computer vision, and most networks directly use the features extracted from the backbone network to predict the target. This pioneering work [10] proposed a top-down approach to multi-scale features using the feature pyramid network (FPN). Following this idea, PANet [11] adds a bottom-up path aggregation network on top of FPN to further exploit the different information between the bottom and top-level features. NAS-FPN [12] uses neural architecture searching to design feature network topologies automatically but requires thousands of hours of GPU time and makes it difficult to interpret the network structure. BiFPN [13] adds a bi-directional scale join based on removing nodes with only one input edge and adding a residual structure to join the original junction and the output node. Each of the four network structures is shown in Figure 2.

The fusion of features at different scales achieved good results using all of the methods above, but feature fusion leads to confounding effects [14], along with bias towards local interactions. By combining the CCT module, we optimize features across multiple scales after feature fusion to reduce confounding effects and establish long-range dependencies.

2.2. Vision Transformer

The transformer [15] program has achieved great success in the natural language processing domain, and many researchers have applied the program to tasks related to computer vision. A pure transformer vision model (Vision Transformer, ViT) was proposed by Dosovitskiy [16], which divides, convolves, and spreads images into two-dimensional vectors to solve the transformer’s input problem in the image domain. Liu [17] proposed the Swin-Transformer, which uses a hierarchical construction method similar to that used in convolutional neural networks: chunking the feature maps to reduce computational effort and using spatially shifted bit windows to facilitate information exchange between different windows, thus improving the model’s ability to extract local information. The state-of-the-art level of ImageNet classification is achieved. Inspired by Swin-Transformer, Cao [18] proposed Swin-Unet, the first purely transformer-based U-shaped structure, which uses Swin-Transformer to replace the convolution module in U-Net. Wang [19] proposed UCTransNet, using a CTrans (channel transformer) module instead of skip connections in U-Net. The CTrans module consists of multi-scale channel cross-fusion (called CCT) and channel cross-attention (called CCA). The CCT module performs cross-fusion of multi-scale features through the transformer. The CCA module directs the fusion of the fused multi-scale features with the decoded features through attention.

The Swin-Unet algorithm requires a large amount of data to obtain better performance. It lags behind the convolutional network when small amounts of data are available. UCTransNet directly uses the features of the backbone network as input to the transformer. It does not take full advantage of the underlying detailed information. This paper uses the pure convolutional network ConvNext as the backbone to extract features that better fit small datasets. The information at different scales is better utilized using a feature fusion network.

3. Proposed Methods

3.1. Model Architecture

TMCrack-Net consists of three main modules: (1) Top-down extraction of crack features using the ConvNext network as the backbone of the U-net network. (2) The AG-BiFPN module includes BiFPN and CCT, a feature fusion network, to exploit the multi-scale features of the image better. (3) An attention fusion module to better enhance useful features and suppress invalid ones. Figure 3 shows the TMCrack-Net with BiFPNet, Transforme, and FCA.

In this paper, we consider the crack segmentation task of mural images as a pixel-level semantic segmentation problem, where 0 represents “background pixels” and 1 represents “crack pixels”. Compared with natural images, the number of mural images is small, and the background information is complex, so the feature extraction ability of the backbone network directly affects the segmentation effect. Our backbone is the ConvNext [20] network proposed by Facebook AI Research and the University of California, Berkeley, which optimizes ResNet to reach the performance limits of pure convolutional models. As a pure convolutional network, it achieves 87.8% TOP-1 accuracy on the ImageNet dataset while maintaining the simplicity and efficiency of standard convolutional networks.

ConvNext consists of downsample blocks and ConvNext blocks, repeatedly stacked. A ConvNext block consists of depthwise convolution, and ordinary Conv is combined, using the GELU function as the activation function. A downsample block consists of layer norm and 2 × 2 convolution. Figure 4 shows the structures of downsample and ConvNeXt blocks.

With the small number of mural images and complex background information, low-level information and high-level semantic information are particularly important for crack segmentation. Using only the backbone network to extract features from the images does not make good use of the channel information at different scales, and a feature fusion network needs to be added to recycle the multi-scale features, but feature fusion confuse the localization of cracks in the images. Attention mechanisms need to be used to optimize the feature pyramid, and using attention computation at each layer of the pyramid is costly. Therefore, we introduced the CCT [19] module. By analyzing the information provided by each element layer, CCT can automatically determine the importance of different channels, increase the weights of channels that contribute to segmentation, and suppress channels that do not. Due to the ambiguity between the features of the codecs, in this paper, we design an FCA module based on a self-attentive model to filter the channel information and better fuse the features between them.

3.2. AG-BiFPN Module

UCTransNet demonstrates that not all jump-joins are beneficial for segmentation, and simple replication of features at the encoder stage can be detrimental to feature fusion, so more suitable feature fusion methods are needed to join the encoder and decoder.

In order to better utilize the features extracted from the backbone network, we invoke a combination of BiFPN and CCT modules instead of the original hop-join in the U-net. By fusing feature information at different scales through the BiFPN network, the input image retains a richer level of detail contained in the shallow features, better utilizes local information, and focuses on extracting crack details.

BiFPN was proposed by Google Research, based on the bi-directional scale join. It removes the nodes with only one input edge, which do not perform feature fusion and contribute little to the network, and add the residual structure to join the original junction with the output node, which can fuse more features. The BiFPN infrastructure is shown in Figure 2d.

The single-use feature fusion network with only a convolutional structure is biased toward local interaction and brings some confounding effect, which interferes with recognizing cracks. By adding an attention mechanism after BiFPN, as shown in Figure 5, the CCT module can better emphasize the global information and alleviate the confounding effect at each layer of the pyramid.

CCT computes the weights of different feature layers through a multi-headed self-attentive mechanism and sums the weights of each layer with the corresponding feature layer to establish global relationships and suppress invalid features. The implementation of CCT can be expressed as follows:

1.: Five features $E_{i_{-} out}, (i = 1, 2, 3, 4, 5)$ are obtained by feature fusion. Firstly, the features are divided into a series of patches, and the number of feature channels is kept constant. The patches are flattened and given location information to obtain tokens of different scales as $T_{i} (1, 2, 3, 4, 5)$ , and the four tokens are combined to obtain $T_{C} = C o n c a t (T_{1}, T_{2}, T_{3} T_{4}, T_{5})$ .
2.: Tokens are fed into the multi-headed self-attentive mechanism, with five $T_{i}$ as the query and $T_{C}$ as the key and value:

$Q_{i} = T_{i} M_{Q_{i}}, K = T_{c} M_{K}, V = T_{c} M_{V}$

(1)

where $M_{Q}$ , $M_{K}$ , and $M_{V}$ are the weights of the different input features, and the weights $W_{v i}$ for each layer of features are obtained by $Q_{i}$ and K. The formula for $W_{v i}$ is:

$\begin{matrix} W_{v i} & = softmax [σ (\frac{Q_{\tilde{i}}^{T} K}{\sqrt{C_{Σ}}})] \\ = softmax [σ (\frac{{(T_{i} M_{Q})}^{T} T_{C} M_{K}}{\sqrt{C_{Σ}}})] \end{matrix}$

(2)

where $σ$ represents the instance normalization [21], $C_{Σ} = \sum_{i = 1}^{N} C_{i}, (i = 1, 2, 3, 4, 5)$ denotes the sum of the number of channels of the input features. In our implementation, $C_{1} = 96$ , $C_{2} = 192$ , $C_{3} = 384$ , $C_{4} = 768$ , and $C_{5} = 768$ , values weighted by $W_{v i}$ :

$A t_{i} = W_{v i} V^{T}$

(3)

In the case of N head attention, the output after multi-head cross-attention is calculated as follows:

$M A t_{i} = (A t_{i}^{1} + A t_{i}^{2} + A t_{i}^{3} +, \dots, A t_{i}^{N}) / N$

(4)

where N is the number of heads. A series of experiments on layers 1, 2, 3, 4, and 5 of CCT in the implementation found that the best performance can be achieved on the tomb mural dataset with one layer and six heads of CCT. The following output is obtained by using MLP and residual join:

$O_{i} = M A t_{i} + M L P (Q_{i} + M A t_{i})$

(5)

Finally, the five outputs $O_{1}$ , $O_{2}$ , $O_{3}$ , $O_{4}$ , $O_{5}$ are reconstructed by up-sampling through convolution, and the corresponding decoder features are concatenated.

3.3. Decoder with FCA Module

We designed a feature fusion module with attention guidance to better fuse the two features. Figure 6 illustrates the FCA module. This module does not perform dimensionality reduction operations. ECA-Net [22] found that dimensionality reduction adversely affects the attention paid to the learning channel.

After encoder feature enhancement, the FCA module first compresses the encoded features

O_{i}

and decoded features

D_{i}

using the global average pooling layer to obtain global spatial information. Secondly, global attention is obtained by the spreading operation with a linear layer. Again, the encoder

O_{i}^{'}

and decoder

D_{i}^{'}

features with attention are obtained using the spreading operation. Finally,

O_{i}^{'}

and

D_{i}^{'}

are multiplied to get the similarity matrix

S_{i}^{'}

, which is normalized by the softmax function to get the weights of different features, enhance the features with strong correlations, suppress the features with weak correlations, and eliminate the ambiguity existing between the encoder and decoder to some extent.

As shown in Figure 6, in FCA, we obtain the global spatial information by spatially compressing the features

O_{i}

and

D_{i}

. Space compression is performed by space compression by the global average pool (GAP) layer:

G (O_{i}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} O_{i}^{k} (i, j)

(6)

G (D_{i}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} D_{i}^{k} (i, j)

(7)

where k is the table k-th layer channel, and i is the i-th layer. Then, to improve the model fitting ability, the features are linearly varied to obtain trainable vectors

l (o_{i})

and

l (d_{i})

. The linear variation is implemented through the linear layer:

L (G (O_{i})) = G (O_{i}) A_{1}^{T} + B_{1}

(8)

L (G (D_{i})) = G (D_{i}) A_{2}^{T} + B_{2}

(9)

After restoring the original dimensions of

l (o_{i})

and

l (d_{i})

, the two dimensions of H and W are flattened to obtain

O_{i}^{'}

and

D_{i}^{'}

, and the similarity between them is calculated to obtain the weight matrix

S_{i}^{'}

. The specific operation is:

O_{i}^{'} = F latten (unsqueeze (l_{o_{i}}))

(10)

D_{i}^{'} = F latten (unsqueeze (l_{d_{i}}))

(11)

S_{i}^{'} = O_{i}^{'} \cdot D_{i}^{' T}

(12)

The attention mask is generated by normalizing these weights using the softmax function for the similarity matrix

S_{i}^{'}

. After decompressing

S_{i}

, it is multiplied by

O_{i}

and

{\hat{O}}_{i}

is obtained by the RELU activation function as follows:

S_{i} = φ (unsqueeze (S_{i}^{'}))

(13)

{\hat{O}}_{i} = ψ (O_{i} \cdot S_{i})

(14)

where

φ (\cdot)

represents the softmax function and

ψ (\cdot)

represents the Relu function. Finally, merge

F \hat{_{i}}

with the corresponding upsampling features.

4. Experiments and Results

4.1. Implementations Details

We implemented our model using PyTorch on an NVIDIA GeForce RTX 3090 GPU card with 24 GB of RAM. To avoid overfitting, we performed data enhancement, including horizontal flipping, vertical flipping, and random rotation. Pre-trained weights trained on ImageNet using ConvNext-T were used to initialize the backbone of the model. The input resolution of the dataset was set to 224 × 224. In this study, we used an Adam optimizer to train our model with batch size set to 8, initial learning rate set to 1 × 10

^{- 4}

, and cross-entropy loss as our loss function to train our network.

4.2. Evaluation Criteria

Six evaluation metrics were used to verify the performance of the model, including precision rate (P), regression rate (R), F-score (

F_{1}

), mean intersection over union (

M I o u

), DiceScore (

D i c e

), and Jaccard Distance (

J a c c a r d

). Equations are as follows:

P = \frac{T P}{T P + F P}

(15)

R = \frac{T P}{T P + F N}

(16)

F_{1} = \frac{2 P R}{P + R}

(17)

M I o u = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(18)

D i c e = \frac{2 T P}{F P + 2 T P + F N}

(19)

J a c c a r d = \frac{T P}{T P + F P + F N}

(20)

where K represents the number of categories,

T P

represents positive samples which have been correctly classified as positive, and

F P

represents false positive samples which have been incorrectly classified as positive samples.

F N

represents false negative samples which have been incorrectly classified as negative samples. Precision rate (P) is the probability that all predicted positive samples are positive samples. Recall that (R) is the probability that the original sample is predicted to be a positive sample among positive samples. F-Score is a balanced value when both accuracy and recall are considered simultaneously, and the highest value is taken when both are reached simultaneously. In the problem of semantic segmentation, the intersection and merge ratio is the ratio of the intersection and merging of the true labels and predicted values of the class. The dice coefficient and Jaccard distance represent the similarity between the true and predicted samples.

4.3. Dataset

Mural data for the Tang Zhang Huai Prince tomb polo: the whole picture is sparse, staggered, and very successful at presenting harmonious rhythmic beauty. The detailed portrayal of people and horses and the rude outline of ancient trees in the mountains and rocks in the picture can give people a sense of ancient and elegant beauty. This picture truly reflects an actual polo match in the Tang Dynasty and provides informative physical data for the study of polo in the Tang Dynasty, which is of great historical value. The polo picture from the tomb of Prince Zhang Huai of the Tang Dynasty was selected for crack disease extraction experiments to provide a reference for the conservation and restoration of the mural.

The field data were acquired using a Swiss Sinar P2 large format technology camera and a Sinar 75LV digital back, with a color gamut model of Adobe RGB. Contact type split shot parallel acquisition. The frescoes were divided into fresco chunks with a pixel size of

224 \times 224

format, and 700 fresco samples were selected and named TMCrack. See Figure 7.

The images were labeled using the LabelMe tool, and the corresponding image label mapping was obtained after subsequent conversion processing. The images in the dataset were randomly segmented, including 60% as the training set, 10% as the validation set, and 30% as the test set. Table 1 shows the settings of the dataset.

To verify the generalization of the model, we used Crack500 [23] for training and testing. Crack500 was taken by Yang et al. using a cell phone at the main campus of Temple University, USA. It has a total of 500 crack images, and it includes labeled images corresponding to semantic segmentation. The dataset is divided into training images, validation images, and test images, and the ratio is 5:1:4.

4.4. Performance Comparison of the ConvNext with Different Version

ConvNeXt was designed in different variants, ConvNeXtT/S/B/L. The variants differ only in the number of channels and the number of block repetitions in each stage. To better select the backbone network, we used different variants as the backbone of the u-net network and tested the data in the same configuration. The results are shown in Table 2.

The channel represents the number of input channels in the four stages, and the stage represents the number of times each stage is repeatedly stacked with blocks.

4.5. Performance Comparison of the CCT with Different Numbers of Attention Heads and Layers

A comparative experiment was conducted with the other unchanged modules to determine the number of attention heads in the CCT module. As shown in Table 3, as the number of attention heads increased, the MIou value increased, and as the number of attention heads increased to six, the performance did not increase and appeared to decrease, and as it continued to increase, the performance then decreased.

Keeping the other modules and parameters constant, the effect of the number of CCT modules in the model was tested, as shown in Table 4, and degradation in performance occurred when two more layers were added. MIou decreased as the number of layers increased. The final choice was to use one layer in the network.

4.6. Other Methods

1.: U-Net: U-Net uses an encoding–decoding structure to build a U-shaped network. It uses convolutional and pooling layers to extract features and contextual information and can maintain a certain performance level on a small amount of data by linking contextual information through hopping.
2.: PSPnet: PSPnet is designed with a pyramid pooling module (PPM) to obtain global information at different scales by using global pooling at different scales and fusing the obtained global features with the original input features. The feature map carries both local and global contextual information.
3.: DeepLab v3+ obtains feature information of different perceptual fields by atrous convolutions, uses spatial pyramid pooling (SPP) to fuse feature information under different perceptual fields, and uses an encoder–decoder structure to recover spatial information to obtain target boundaries.
4.: UNet++ [24]: U-Net++ is an improved jump-join method for U-Net networks. It uses nested and dense jump-joins to flexibly fuse features at different scales, compensate for semantic loss between the encoder and decoder, and enhance segmentation capabilities.
5.: Swin-Unet: Swin-Unet is the first pure transformer-based u-architecture; both encoder and decoder are built based on the Swin-transformer block. The extracted contextual features are upsampled by the decoder with the patch extension layer, and the spatial resolution of the feature map is recovered for segmentation prediction by fusing with the multi-scale features of the encoder through jump connections.
6.: UCTransNet: UCTransNet proves that not all jump-joins are powerful for the segmentation task and enhances the effective feature layer by computing the importance of features at different scales via the transformer. An attention mechanism is used to guide the feature fusion between encoder and decoder, reducing the semantic gap between them.

4.7. Experimental Results

Results on TMCrack: We compare the proposed models with the models in Section 4.6 using MPA, recall,

F_{1}

, MIou, dice score, and Jaccard on input images of size 224 × 224. All models were trained and tested on the TMCrack dataset, and the dataset settings were kept consistent. The results are listed in Table 5.

Table 5 shows that our network achieved an accuracy of 0.8649, a recall of 0.8526, an

F_{1}

value of 0.8587, an MIou of 0.7731, a dice value of 0.7376, and a Jaccard value of 0.5842 on the TMCrack test set. All six of these metrics are better than those of the other methods.

Figure 8 shows the crack prediction results of different methods on the TMCrack test dataset. We tested the segmentation performance of various networks on simple and complex crack images. For cracks with simple or obvious precise shapes, most of them were detected in almost all tested models. For images with complex shapes and obscure cracks, TMCRack-Net had better performance. The PR curves of TMCrack-Net and other methods are shown in Figure 9a, which were used to evaluate the generalization level and implementation of each model from a qualitative point of view, and it can be seen from the figure that TMCrack-Net outperformed the other methods, segmenting more cracks and maintaining the accuracy of the locations.

Results on Crack500: To verify the robustness of the model and the effectiveness of the improved method, the Crack500 dataset, widely used for crack detection, was selected. The improved model was retrained on Crack500 and compared with the model in Section 4.6. The results are shown in Table 6.

Figure 9b shows that TMCrack-Net achieved the highest precision–recall curve, and had better generalization level and performance, as shown in Table 5. Its precision rate was 0.8809, recall rate was 0.8685,

F_{1}

value was 0.8747, MIou was 0.7944, and Dice value was 0.7641 on the Crack500 test dataset.

Figure 10 illustrates the crack image prediction results from the test set. Figure 10 shows the images where the crack shapes vary from simple to complex to reflect the effect of crack shape on the detection of different models. It can be seen from the results that TMCRack-Ne has a better global grasp of the segmented cracks due to having more complete shapes compared to other segmentation networks.

4.8. Model Complexity

We used the number of floating point operations (FLOPs) and frames per second (FPS) to measure the time complexity of the different models; the smaller the FLOPS, the smaller the computational demand of the model, and the FPS indicates the speed of detecting cracked images per second. Params are used to measure the spatial complexity of different models, and the parameter size indicates the memory size occupied by the model, and the results are shown in Table 7.

Table 7 presents the FLOPS, FPS, and PARAMS of all methods based on the TMCracks dataset. The detection speed of the pure convolutional model was better than that of the model with a transformer, and DeepLab V3+ had the fastest detection speed among all methods, at 50 FPS. The detection speed of this model was 25 FPS, meaning it can process 25 crack images per second, and it has a smaller computational volume with less memory usage than the model with a pure transformer. TMCrack-Net has more parameters than most of these methods, so the model proposed in this paper also has some limitations.

4.9. Ablation Experiments

To investigate the effectiveness of BiFPN, CCT, and FCA in TMCRack-Net networks, we assumed that the U-net with ConvNeXt as the backbone was the baseline, and all parameters were the same except for the combination of each module and ConvNeXt.

As shown in the Table 8, the best performance of our approach was achieved with the addition of the three modules, which indicates that the improvements are cumulative.

By explaining which regions the model focuses on and visualizing them through Grad-CAM [25], Figure 11 shows the changes in the model’s focus regions after the ablation experiment, from which it can be seen that the TMCrack-Net network with the addition of BiFPN, CCT, and FCA obtained the best detection results.

Figure 11 shows the effect of different modules in the model. The simple U-Net network did not sufficiently focus on the area where the cracks were located for complex and inconspicuous crack images, and the addition of CCT and FCA modules increased the effective features.

5. Conclusions

In this work, we compared the capability of different semantic segmentation models in mural crack segmentation. To improve the model’s performance for mural crack segmentation, we proposed TMCrack-Net with feature fusion and a transformer. In addition, we built a mural crack library with 700 mural images annotated as a dataset. Our results showed that our model has some advantages for recognizing mural cracks. We use BiFPN to fuse the detailed information and semantic information of cracks, the CCT module to optimize the feature fusion to enhance the effective feature layer, and the FCA module to guide the fusion of encoded features and decoded features through an attention mechanism to reduce the ambiguity between them and segment more cracks. The main work and conclusions of this study are as follows:

•: We constructed a dataset of Tang Dynasty tomb chamber murals. The murals genuinely reflect the actual polo matches in the Tang Dynasty, providing informative physical data for the study of polo in the Tang Dynasty, which is of great historical value. The mural background is complex, having people, horses, and ancient mountains and rocks. Complex backgrounds are suitable for reflecting the model’s applicability in complex backgrounds.
•: Different layers of convolutional layers have different image features. In the shallow network, the features contain detailed information about the cracks, and in the higher layers, the features are represented as the semantic features of the cracks. The fusion pre-screening of information at different scales facilitates the extraction of crack information.
•: In order to obtain the number of CCT modules and self-attentive heads, the experimental comparison results in Section 4.8 show that the results are better when the number of transformer modules is one and the number of self-attentive heads is six. P, F₁, MIou, dice score, and Jaccard improved by 0.0181, 0.0056, 0.0067, 0.0087, and 0.0106 compared to the model using only convolution.
•: The FCA module is designed to integrate the encoding features with the decoding features. In Section 4.8, the results showed that by adding the FCA module, the MIou value of the model was improved by 0.0275, the p value by 0.015, recall by 0.0278, $F_{1}$ -score by 0.0216, dice score by 0.0414, and Jaccard by 0.0499.
•: The proposed model TMCrack-Net was compared with semantic segmentation networks of different structures. The evaluation metrics and visualization results in Section 4.7 show that TMCrack-Net achieved the highest values of P, R, $F_{1}$ , MIou, dice score, and Jaccard with 0.8649, 0.8526, 0.8587, 0.7731, 0.7376, and 0.5842, respectively.

In the future, we will further address the limitations of our model. We will improve the model’s accuracy and enhance the model for segmenting cracks in more complex mural images. We will investigate better fusion methods with attention mechanisms for improvement. As the number of tomb murals is scarce and precious, we will continue to collect more murals images to build a larger dataset so that the model can better learn the crack features.

Author Contributions

Conceptualization and methodology, M.W. and M.J.; validation, M.J.; writing—original draft preparation, M.J.; writing—review and editing, M.J. and M.W.; supervision, M.W.; funding acquisition and resources, M.W.; visualization, M.J.; data formatting, J.W. All authors have read and agree to the published version of this manuscript.

Funding

This work is supported by National Natural Science Foundation of China (number 61701388).

Data Availability Statement

The Mural crack dataset presented in this study is available from the Shaanxi History Museum.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, J. A Study on the Display Design of Chinese Monastery Murals-Taking the Frescoes of Yongle Palace as an Example. China Natl. Exhib. 2020, 151–153. [Google Scholar] [CrossRef]
Zhao, B. Shanxi Yu County Song and Jin Burial Mural Relocation Protection and Restoration of Shallow. Identif. Apprec. Cult. Relics 2020, 54–57. [Google Scholar]
Hou, M.; Tian, S.; Guo, H.; Cheng, Q. K-Means Sobel Algorithm in Edge Extracting of Mural Diseases. In Proceedings of the 2010 2nd International Conference on Information Engineering and Computer Science, Wuhan, China, 25–26 December 2010; pp. 1–4. [Google Scholar]
Hou, M.L.; Wang, Y.M.; Fang, M.Z.; Hong, G. The Collection Mural Protection Application of the Lidar and GIS Technology. In Proceedings of the 2010 Second International Conference on Computer Modeling and Simulation, Washington, DC, USA, 22–24 January 2010. [Google Scholar]
Gancarczyk, J.; Sobczyk, J. Data Mining Approach to Image Feature Extraction in Old Painting Restoration. Found. Comput. Decis. Sci. 2013, 38, 159–174. [Google Scholar] [CrossRef] [Green Version]
Cornelis, B.; Ruzic, T.; Gezels, E.; Dooms, A.; Pizurica, A.; Platisa, L.; Cornells, J.; Martens, M.; Mey, M.D.; Daubechies, I. Crack Detection and Inpainting for Virtual Restoration of Paintings: The Case of the Ghent Altarpiece. Signal Process. 2013, 93, 605–619. [Google Scholar] [CrossRef]
Lin, Y.; Xu, C.; Lyu, S. Disease Regions Recognition on Mural Hyperspectral Images Combined by MNF and BP Neural Network. J. Phys. Conf. Ser. 2019, 1325, 012095. [Google Scholar] [CrossRef]
Yu, K.; Li, Y.; Yan, J.; Xie, R.; Zhang, E.; Liu, C.; Wang, J. Intelligent Labeling of Areas of Wall Painting with Paint Loss Disease Based on Multi-Scale Detail Injection U-Net. In Optics for Arts, Architecture, and Archaeology VIII; SPIE: Bellingham, WA, USA, 8 July 2021; Volume 11784, pp. 37–44. [Google Scholar]
Wu, L.; Zhang, L.; Shi, J.; Zhang, Y.; Wan, J. Damage Detection of Grotto Murals Based on Lightweight Neural Network. Comput. Electr. Eng. 2022, 102, 108237. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection|IEEE Conference Publication|IEEE Xplore. Available online: https://ieeexplore.ieee.org/document/8954436 (accessed on 31 August 2022).
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-Guided Context Feature Pyramid Network for Object Detection. arXiv 2020, arXiv:2005.11475. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transform-er Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, virtually, 22 February–1 March 2022; pp. 2441–2449. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–23 June 2022; pp. 11966–11976. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks|IEEE Conference Publication|IEEE Xplore. Available online: https://ieeexplore.ieee.org/document/9156697 (accessed on 4 September 2022).
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Im-age Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Feature network design: (a) person and horse; (b) stone; (c) trees.

Figure 2. Feature network design. (a) FPN builds top-down networks with lateral connections to fuse detailed and semantic information (P3-P7); (b) PANET adds bottom-up paths to FPN to build a bidirectional fusion backbone network; (c) NAS-FPN uses neural architecture searching to find the most suitable feature network structure for the backbone network; (d) BiFPN, based on bidirectional scale linking, removes nodes with only one input edge and adds a residual structure to link the original junction and the output node.

Figure 3. Overall model framework of TMCrack-Net.

Figure 4. Feature network design. The * stands for multiplication (a) ConvNeXt block’s structure; (b) downsample structure, where k represents the convolution kernel size, s represents the step size, and p represents the padding.

Figure 5. Structure of the CCT module.

Figure 6. Structure of the FCA module.

Figure 7. (a) Mural images; (b) images clipped from mural images.

Figure 8. Comparison of the results of different methods for six selected images from the test set.

Figure 9. (a) Precision–recall curves for different models on TMCrack. (b) Precision–recall curves for different models on Crack500.

Figure 10. Comparison of the results of different methods for selected crack images from the test set.

Figure 11. (a) Raw, (b) U-net, (c) U-Net+CCT, (d) U-Net+BiFPN, (e) U-Net+FCA, (f) U-Net+CCT +BiFPN, (g) U-Net+CCT+FCA, (h) U-Net+BiFPN+CCT+FCA.

Table 1. Dataset settings.

Dataset	Dataset Settings
TMCrack	Train set	Validation Set	Test Set
TMCrack	420	70	210
Crack500	Train set	Validation Set	Test Set
Crack500	250	50	200

Table 2. Performances of different versions of ConvNext.

Vesrison	Channel	Stage	MIou%	Parame (M)
ConvNeXt-T	[96 192 384 768]	[3 9 3 3]	74.22	29
ConvNeXt-S	[96 192 384 768]	[3 3 27 3]	73.18	50
ConvNeXt-B	[128 256 512 1024]	[3 3 27 3]	74.02	87
ConvNeXt-L	[192 384 768 1536]	[3 3 27 3]	74.10	198

Table 3. Performances with different numbers of attention heads.

Number of Head	$F_{1}$ %	MIou%	Dice%
2	85.53	76.67	73.19
3	85.67	76.87	73.46
4	85.78	77.03	73.67
5	85.92	77.19	73.91
6	86.04	77.31	73.40
7	85.77	76.92	73.60
8	85.53	76.78	73.18
9	85.57	76.61	73.20

Table 4. Performance with different numbers of layers.

Layers of CCT	$F_{1}$ %	MIou%	Dice%
1	86.04	77.31	74.11
2	85.80	77.05	73.71
3	85.82	76.95	73.67
4	85.79	76.88	73.59
5	85.70	76.77	73.42

Table 5. Performances of different methods on TMCrack.

Method	P%	R%	$F_{1}$ %	MIou%	Dice%	Jaccard%
U-Net	85.08	82.02	83.52	74.22	69.24	52.96
PSPnet	85.18	77.14	80.96	70.86	65.39	46.61
DeepLab V3+	85.98	80.06	82.91	73.33	67.72	51.19
UNet++	85.26	85.03	85.14	76.28	72.47	56.82
Swin-Unet	82.53	76.97	79.65	69.59	61.61	44.52
UCTransNet	82.79	83.48	83.13	73.87	68.93	52.69
TMCrack-Net	86.49	85.26	85.87	77.31	73.76	58.42

Table 6. Performances of different methods on Crack500.

Method	P%	R%	$F_{1}$ %	MIou%	Dice%	Jaccard%
U-Net	85.40	88.90	87.11	78.86	75.82	61.06
UNet++	88.04	85.28	86.38	78.28	74.73	59.65
PSPnet	87.37	77.78	82.30	72.43	65.43	48.62
DeepLabV3+	88.03	84.43	86.20	77.70	73.86	58.55
Swin-Unet	86.56	82.59	84.52	75.57	70.66	54.63
UCTransNet	88.42	83.49	85.88	77.23	73.13	57.64
TMCrack-Net	88.09	86.85	87.47	79.44	76.41	61.82

Table 7. The FLOPs, Params, and FPS of all methods on the TMCrack dataset.

Method	Params (M)	Flops (G)	FPS
U-Net	43.93	17.62	46
PSPnet	49.07	11.82	47
DeepLab V3+	5.81	5.05	50
UNet++	47.2	153.25	20
Swin-Unet	154.24	119.37	17
UCTransNet	66.43	32.94	18
TMCrack-Net	89.27	19.15	25

Table 8. Ablation study on TMCrack for the each module.

Method	P%	R%	$F_{1}$ %	MIou%	Dice%	Jaccard%
U-Net	85.08	82.02	83.52	74.22	69.24	52.96
U-Net+CCT	87.03	83.55	85.25	76.37	72.44	56.79
U-Net+BiFPN	85.22	84.16	84.69	75.70	71.57	55.73
U-Net+FCA	86.58	84.80	85.68	76.97	73.38	57.95
U-Net+CCT +BiFPN	85.08	84.63	84.58	75.91	71.91	51.64
U-Net+CCT+FCA	85.58	85.64	85.60	76.88	73.34	57.90
TMCrack-Net	86.49	85.26	85.87	77.31	73.76	58.42

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Jia, M.; Wang, J. TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation. Appl. Sci. 2022, 12, 10940. https://doi.org/10.3390/app122110940

AMA Style

Wu M, Jia M, Wang J. TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation. Applied Sciences. 2022; 12(21):10940. https://doi.org/10.3390/app122110940

Chicago/Turabian Style

Wu, Meng, Min Jia, and Jia Wang. 2022. "TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation" Applied Sciences 12, no. 21: 10940. https://doi.org/10.3390/app122110940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TMCrack-Net: A U-Shaped Network with a Feature Pyramid and Transformer for Mural Crack Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Feature Pyramid

2.2. Vision Transformer

3. Proposed Methods

3.1. Model Architecture

3.2. AG-BiFPN Module

3.3. Decoder with FCA Module

4. Experiments and Results

4.1. Implementations Details

4.2. Evaluation Criteria

4.3. Dataset

4.4. Performance Comparison of the ConvNext with Different Version

4.5. Performance Comparison of the CCT with Different Numbers of Attention Heads and Layers

4.6. Other Methods

4.7. Experimental Results

4.8. Model Complexity

4.9. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI