Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning

Lee, Taehee; Kim, Jung-Ho; Lee, Sung-Jin; Ryu, Seung-Ki; Joo, Bong-Chul

doi:10.3390/app13042367

Open AccessArticle

Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning

by

Taehee Lee

¹

,

Jung-Ho Kim

²,

Sung-Jin Lee

²,

Seung-Ki Ryu

²

and

Bong-Chul Joo

^2,*

¹

Infrastructure Engineering Team, R&D Institute, LOTTE Engineering&Construction, Seoul 06515, Republic of Korea

²

Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2367; https://doi.org/10.3390/app13042367

Submission received: 29 January 2023 / Revised: 10 February 2023 / Accepted: 10 February 2023 / Published: 12 February 2023

(This article belongs to the Section Civil Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Signs of functional loss due to the deterioration of structures are primarily identified from cracks occurring on the surface of structures, and continuous monitoring of structural cracks is essential for socially important structures. Recently, many structural crack monitoring technologies have been developed with the development of deep-learning artificial intelligence (AI). In this study, stacking ensemble learning was applied to predict the structural cracks more precisely. A semantic segmentation model was primarily used for crack detection using a deep learning AI model. We studied the crack-detection performance by training UNet, DeepLabV3, DeepLabV3+, DANet, and FCN-8s. Owing to the unsuitable crack segmentation performance of the FCN-8s, stacking ensemble learning was conducted with the remaining four models. Individual models yielded an intersection over union (IoU) score ranging from approximately 0.4 to 0.6 for the test dataset. However, when the metamodel completed with stacking ensemble learning was used, the IoU score was 0.74, indicating a high-performance improvement. A total of 1235 test images was acquired with drones on the sea bridge, and the stacking ensemble model showed an IoU of 0.5 or higher for 64.4% of the images.

Keywords:

concrete crack; deep learning; stacking ensemble learning; semantic segmentation

1. Introduction

Aging social infrastructure needs to find damaged areas and continuously monitor them to maintain their unique functions. The observation of crack propagation on the surface of social infrastructure is primarily used as an index to estimate the degree of damage to the structure. It is common to begin the maintenance of infrastructure structures by inspecting surface cracks with human eyes. Depending on the severity of the cracks (crack width over 0.3 mm), it is determined whether to reinforce or inspect them in detail. However, it is challenging to constantly check the state of surface cracks by human visual inspection when structures are located in places that are not easy to access, such as marine bridges, or when the area to be managed is vast, such as road surfaces. In addition, owing to the lack of professional human resources for visual inspection, a structural diagnosis may be performed after some time.

The causes of structural cracks are presumed to be drying shrinkage of concrete, insufficient cover thickness, insufficient amount of reinforcing bars at the cross section, influence of various loads acting within the design load, and lack of compaction during construction. Worldwide, the strengthening and monitoring of the structures are quite important as this would be a cost-effective solution compared to demolishing and re-building structures. Recently, rapidly developing information and communication technologies based on deep learning artificial intelligence (AI) inference models have been used in various fields, and grafting is actively being attempted in the construction sector [1,2,3,4,5,6,7]. In particular, in the field of visual recognition, deep learning AI technology is expected to make a significant difference to the existing structure observation method that relies on the human eye. In construction, deep learning AI technology in the field of vision is primarily used to detect problematic parts in images. An object detection model [8,9,10,11,12] that detects cracks or damages in a structure through an AI model and creates a bounding box in the area where the abnormality occurs, or a classification model that incorporates the sliding window technique [13,14,15,16], is used. Segmentation models [17,18,19,20] that classify states in pixel units to provide detailed shape information, such as crack width and length, are also utilized.

Cracks in structures occur mainly on the surface of ascons or concrete, and even the same material is recognized as different visual information according to environmental changes, such as aging of materials, differences in components, and illumination and humidity. A deep neural network for heterogeneous data acquired from various local sources is a factor that complicates convergence. In particular, the segmentation for the crack area using only a single deep learning inference model is very challenging. It is time to approach a new technique for improving the model performance. This study introduces a stacking ensemble technique using heterogeneous models, departing from the existing method using a single model.

In this study, we compared the performances of five representative segmentation models using a heterogeneous dataset obtained from more than five studies. In addition, to improve the performance of the deep neural network for locally acquired heterogeneous data, the stacking ensemble technique was introduced, and the model inference results were compared. Based on the comparison results, we quantitatively show performance improvement when using a stacking ensemble network compared to that of a single semantic segmentation model.

2. Literature Review

2.1. Representative Semantic Segmentation Architecture

Recently, AI models in the field of vision have been rapidly developed with the emergence of convolutional neural networks (CNNs). CNN models are used in various fields, and they are also used for continuous management by detecting anomalies in the construction field. In particular, microcracks on the surface of structures, such as concrete or ascon, require geometric shape information, such as the width and length of cracks, to determine the current state of structural damage, continuously observe the progress of damage, and use them as decision-making data for repair and reinforcement. Segmentation models, which can closely distinguish areas occupied by cracks by analyzing images in pixel units, are mainly used because they can easily determine the shape of microcracks. Therefore, in this study, the analysis focused on semantic segmentation models that can easily detect microcracks on the surface of structures, such as concrete or ascons.

Semantic segmentation models are based on fully convolutional networks (FCNs). All hidden layers are convolutional, and the pre-learned classification model is used as the feature extractor. The results of pixel-wise prediction are provided through an upsampling process based on the features obtained by the feature extract. If a concrete crack image is input, the location where the crack occurs is provided in pixel units. An FCN [17] is the simplest basic form of a semantic segmentation model. This is simply the result of upsampling the result of passing through a feature extractor to a desired size. According to the detailed steps of upsampling, it is divided into FCN-8s, FCN-16s, and FCN-32s, and the information of each pooling layer of the feature extractor is summarized with the upsampled information.

A semantic segmentation model based on an FCN was developed into two types of architectures. One is an encoder–decoder type architecture that maximizes the ensemble effect by connecting the tensors in the encoder and decoder with a skip connection. Because of the shape of the network, it was named U-Net [21], and by diversifying upsampling, max-pooling, and skip connection, J-Net, M-net, UNet++, UNet3+, and CU-Net [22,23,24,25] were used in the development. This model prevents the loss of contextual information due to rapid upsampling and exhibits high performance in distinguishing micro-boundaries between objects.

The other makes the feature extractor more sophisticated. After arranging pooling layers with different receptive field sizes to extract more diverse features, upsampling was performed to generate an output of the desired size. Networks from DeepLabV1 to V3 belong to this category [26,27,28]. It uses fewer parameters than UNet and can obtain higher segmentation quality than FCN. In addition, a decoder module was added to the DeepLabV3 model to develop DeepLabV3 plus, which includes encoder–decoder features and encodes richer contextual information to distinguish object boundaries [29].

In addition, the recent deep learning architecture using the attention module shows high performance in natural language processing, and its use is expanding to the field of vision, for example, dual attention [30] and OCR net [31]. DANet effectively extracts long-range contextual information between pixels of the same class that are far apart from each other by inputting the extracted image features into the position and channel attention modules and then producing results.

In this study, the performance of the FCN-8s model, which is the basic model of semantic segmentation, and DeepLabV3, DeepLabV3+, UNet, and DANet, were compared, and the segmentation performance improvement was reviewed using the stacking ensemble learning technique.

2.2. Ensemble Learning Techniques

In general, one model is used to predict a value based on data in deep learning. However, more accurate predicted values can be obtained by ensemble learning, which combines multiple models [32]. The key to ensemble learning is to combine multiple models to create a robust model. Depending on the combination method, it can be divided into Bootstrap Aggregation (bagging), boosting, and stacking.

Bagging is a method of aggregating results by drawing samples multiple times and training each model connected in parallel. The output of each model was aggregated by voting in the case of categorical data, and the final result was obtained by averaging the continuous data. Bagging prevents overfitting by improving the stability of the algorithm because it adjusts the results from each sample to a median value.

Boosting is a method of creating a robust predictive model by applying a weight to the learning error of each serially connected model and sequentially reflecting it on the next learning model. First, the raw data are assigned the same weight, but a high weight is assigned to the incorrectly predicted data by the model, and a low weight is assigned to the correctly predicted data such that the incorrectly predicted data are predicted.

Bagging and boosting are methods of arranging multiple identical models; however, stacking, unlike the above two methods, combines different models to create a model with the best performance. Therefore, although a large amount of computation is required, high performance can be obtained by taking the strengths of each model and supplementing their weaknesses.

3. Experiments

3.1. Used Dataset

For AI model training and testing, 10,789 open images and labeling datasets were used, as shown in Figure 1. The data used are disclosed in Kaggle [33] and often used to analyze the performance of the crack segmentation model. Each dataset comprises a pair of three-channel color images for road surface damage and a grayscale image in which damage-occurring pixels are assigned a value of 255 and pixels without damage are assigned a value of 0. It includes various microcracks and deep cracks in asphalt and concrete. This dataset is suitable for review and can obtain consistent detection performance for crack images obtained from multiple locations.

-: Crack Forest Dataset [3]: In the crack forest dataset, 118 pairs of RGB color images and manually labeled information about asphalt road surface cracks taken in Beijing, China, using an iPhone 5 were prepared by reshaping the original resolution from 480 × 320 to 224 × 224 pixels.
-: Crack500 [34]: Crack500 was produced by acquiring 500 RGB color images of 2560 × 1440 resolution cracks on the surface of 500 asphalt roads. Five hundred images were cropped to a size of 640 × 360 and divided into 3368 pieces, which were reshaped to 224 × 224 and used in this study.
-: CrackTree200 [35]: CrackTree200 provides 800 × 600 resolution images and label values for surface cracks on asphalt surfaces. It contains many images of cracks on an asphalt surface cast with tree shadows.
-: Deep Crack [36]: A dataset for crack segmentation comprises 537 RGB color images of 554x384 size. It is characterized using images of various scales for cracks that can occur on the surface of various materials and includes images of cracks occurring on concrete surfaces and asphalt.
-: GAPs384 [37]: The German Asphalt Pavement Distress dataset provides road surface images and labeling data, including cracks, potholes, and inlaid paths, with a resolution of 1920 × 1080. The image was divided into six pieces with 640 × 540 resolution so as not to overlap, and 509 images with cracks of 1000 pixels or more were provided as a dataset.

3.2. Model Architectures

3.2.1. Backbone Network

A backbone network is used to improve the learning efficiency of the semantic segmentation model. MobilNet_V1 [38], which shows efficient feature extraction performance with relatively few parameters, was selected as the backbone for the stacking ensemble learning of multiple models. In MobileNet_v1, weights pretrained with the ImageNet dataset were assigned as initial values.

3.2.2. Semantic Segmentation Models

Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 show the five semantic segmentation models implemented in this study to compare crack segmentation performance. The FCN of Long et al. [17] is the simplest basic form of the semantic segmentation model, and Figure 2 shows the FCN-8s model implemented in this study. For the feature extractor part, mobileNetV1, pre-learned as an image net, was used, and it was characterized by summation with features of the corresponding size by upsampling twice.

Figure 3 shows the UNet architecture used in this study. The different features of MobileNet are connected by a skip connection and convoluted to calculate the original output size.

After performing atrous spatial pyramid pooling (ASPP) with different dilation rates on the 28 × 28 size features extracted from MobilnetV1, concatenating all the results, proceeding with convolution, and calculating the original size output by upsampling to the original size, the architecture of Deeplabv3 used in this study is shown in Figure 4.

Figure 5 shows the architecture of DeepLabv3+. It is characterized by upsampling the result of ASPP in the previously implemented DeepLabV3 to 56 × 56 size and connecting it with a feature of the corresponding size through a skip connection.

Figure 6 shows the DANet implemented in this study. After inputting 28 × 28 size features to the position attention module and channel attention module, respectively, the final output of the original size was calculated through a series of convolutions of the results of the two modules.

3.2.3. Meta Model Architecture

Figure 7 shows the architecture of the metamodel used to provide the final output using the results of the segmentation model. Four individual models were created for each segmentation architecture, and four-fold cross-validation of ensemble learning was performed using 9704 pairs of training data. For each model, pixel-wise averaging was performed on the results obtained from identical models, and four outputs of size 224 × 224 × 2 were obtained from the four types of models. This was concatenated to create a tensor of 224 × 224 × 8 and was used as the input value of the metamodel. The metamodel was based on the UNet architecture. Since the metamodel primarily serves to synthesize the results derived from the four models, a complicated network configuration is unnecessary. Therefore, pre-learned weights were not used, and two skip connections were configured to derive a final output of 224 × 224 × 2.

Figure 8 shows the training dataset configuration process for learning the metamodel. After preparing four segmentation models for four-fold cross-validation learning, the training dataset was prepared by dividing it into four parts to train and validate the model with different validation sets. After training the individual models with different validation sets, the outputs for the validation sets were collected, and the output for each segmentation model for all training data was obtained as 9704 × 224 × 224 × 2. A tensor of 9704 × 224 × 224 × 8 was constructed by concatenating the outputs obtained from four segmentation models, and a 224 × 224 × 2 labeling dataset for 9704 images to be obtained by the segmentation model was prepared in pairs.

In Figure 9, a test dataset of 1085 sheets was input into 16 segmentation models built with a training set of 9704 sheets configured for validation of the metamodel, and a result of 1085 × 224 × 224 × 2 was obtained for each segmentation model. Pixel-wise averaging was performed on the same type of segmentation model results, and by concatenating the results of each type of segmentation model, 1085 × 224 × 224 × 8 data and corresponding output data of 1085 × 224 × 224 × 2 were configured as a pair. Figure 8 and Figure 9 show that the metamodel was learned using 10,789 datasets.

3.3. Model Training

The output of the segmentation model was built with two channels, one hot encoded per pixel, and categorical cross-entropy loss was used as the loss function. The weights of the model were optimized using ADAM [39], and a learning rate = 0.001, beta_1 = 0.9, and beta_2 = 0.999 were used. As an evaluation index for model learning, the intersection over union (IoU) score, which is mainly used for learning segmentation models, was used, and it was stored in the validation set during the entire learning process of 100 epochs.

4. Results and Discussion

4.1. IoU Scores of Five Segmentation Models

Figure 10 shows the change in the IoU score according to the epoch for each representative model of each architecture. Although there were differences depending on the model when the number of epochs exceeded 20, most of the learning was saturated, and the IoU score did not show a significant change. Because learning was performed under the same conditions, if overfitting occurred, such as in FCN or DeepLabV3, some architectures could improve performance by further learning, such as UNet, but no additional tuning was performed. The weights from epochs with output IoU scores were used as the final model.

4.2. Predicted Images of Five Segmentation Models

Figure 11 shows the crack-prediction results and target images of the four models comprising five architectures. The target image was an image of cracks on a heavily discolored concrete surface. This is a problem with segmentation owing to the discolored area, and a large difference in prediction results can be observed depending on the architecture.

DeepLabV3+ showed the most similar results to the Target. The combination of upsampling and skip connections predicts the degree of cracking at an appropriate level without being excessively detailed or rough. In addition, according to the four models, the prediction results are the most similar and are judged to be the most suitable architecture for this dataset.

In the case of DANet, it was inferred that the part identified as a crack in the image was highlighted more thickly, and cracks with small widths were often ignored and omitted. The inference results are consistent for each model.

DeepLabV3 often incorrectly infers discolored concrete areas as cracks; in particular, it is challenging to detect detailed cracks. The location and degree of noise differed owing to the significant differences among the models.

The FCN does not capture the crack information. Much information is decoded with a small number of parameters; therefore, it cannot show a correctly interpreted result. In some cases, none of the crack pixels could be found for each model, and even if cracks were found, they were in the wrong location or were roughly predicted. Because it is challenging to present detailed results of FCN to predict cracks, FCN was excluded from the subsequent stacking ensemble learning.

UNet identifies cracks with the thinnest points, but the results are at a level where only wide cracks can be identified. This is because learning still needs to be completed. Depending on the additional data augmentation and epochs, it is possible to infer cracks at a more detailed level.

4.3. Comparison Results between Predicted and Labeled Images

Figure 12 shows the crack prediction results of the metamodel trained using the results of the four segmentation architectures. This includes the prominent crack locations in the results of DeepLabv3+, which showed the best performance. It provides segmentation results that are more detailed according to the crack width, such as UNet, and it does not include noise from DeepLabV3 or DANet. In other words, detailed segmentation results for cracks are obtained while hardly including noise that can be mistaken for cracks, such as the discoloration or shadows of concrete. Even if overfitting occurs in the model used in the stacking ensemble or if the learning is not fully progressed, the results are improved after performing stacking ensemble learning.

Figure 13 compares the highest IoU scores calculated by each split model for the test and validation sets. In the case of the existing model, the IoU score for the validation set was calculated at approximately 0.6 under the same learning and dataset conditions, and the IoU score for the test set ranged from approximately 0.4 to 0.6.

This difference in IoU is due to that in upsampling stage of the network. FCN and DeepLabV3 showed a low IoU score, which is because the final segmentation image was derived through a one-step upsampling process with a high magnification of the feature extractor result. During one-step upsampling at high magnification, significant microcrack information may be lost, or tiny crack information may be overestimated. On the contrary, DANet or DeepLabV3+, which performed about two steps of the upsampling process, showed a high IoU of 0.6 among the four models and little overfitting appropriate for crack detection. UNet undergoes a four-step upsampling process, conducting an IoU of 0.6 for the validation set but a slightly lower IoU for the test set, indicating overfitting. In other words, the parameters required for learning increased due to many upsampling processes. It is considered that it is difficult to train UNet with the dataset used in this study sufficiently. In contrast, the IoU score for the metamodel test set, which reanalyzed all models’ results, was 0.74, showing a high improvement.

Table 1 shows the scores of the models and metamodel used for stacking ensemble learning in this study. A total of 1085 test data were used to calculate the scores of the models. Among the test data, 795 images had cracks, and the remaining 290 images had no cracks. When using a single model, the model with the highest score was DeepLabV3+ with an F1-score of 0.899, and DANet performed next with an F1-score of 0.844. UNet and DeepLabV3 showed the third and fourth subordinate performance, respectively. In particular, in the case of DeepLabV3, the scorer of all models showed a high score, confirming that it is a network that should be considered a priority when using a single model. Finally, in the case of the metamodel created by introducing the stacking ensemble learning technique, it was quantitatively confirmed that the F-1 score showed superior performance over 0.95.

4.4. New Data for Testing Performance of Metamodel

To examine the performance of the stacking ensemble model developed in this study, as shown in Figure 14, photographs of cracks occurring on the concrete surface of the marine bridge were acquired, and the crack locations were labeled. Two hundred and thirty original images of UHD (3840 × 2160) resolution were acquired, and 1235 small images of size 224 × 224 were prepared for performance verification of the metamodel by cropping only the part containing the cracked part.

Figure 15 shows the results of the comparison between the results predicted by the metamodel and labeled images prepared in advance. Assuming that the IoU score was more significant than 0.5, it was found to predict the labeling data well in 64.4% of the total images. This is the reasoning result for the unlearned crack image, and it was found to estimate the label crack location similarly.

5. Conclusions

In this study, the performance of stacking ensemble learning, comprising four different structural models, was analyzed using individual models to detect cracks occurring on the concrete surface of the structure. For learning, a crack segmentation dataset published in Kaggle was used, and UNet, DeepLabV3, DeepLabeV3+, DANet, and FCN-8s were trained under certain conditions without additional data augmentation. The learned models showed an IoU score ranging from approximately 0.4 to 0.6 for the test dataset, and as a result of comparing the inference images, it was determined that the FCN-8s model was unsuitable for predicting cracks. Except for FCN-8s, the remaining four models were used for stacking ensemble learning. Stacking ensemble learning is conducted to learn the metamodel. As a result of reasoning on the test dataset with the metamodel, a high-performance improvement of 0.74 was confirmed. To analyze the inference performance on the data not used for learning, a crack image of the concrete surface of the marine bridge was acquired and processed to a size of 224 × 224. The crack was predicted to exhibit a satisfactory accuracy of 64.4%.

Author Contributions

Writing—original draft preparation and methodology, T.L.; supervision and project administration, J.-H.K.; formal analysis, S.-J.L.; methodology, S.-K.R.; writing—review and editing, B.-C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Civil Engineering and Building Technology (project number: 20230165-001), granted financial resource from the Ministry of Science and ICT, Republic of Korea.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lee, T.; Chun, C.; Ryu, S.K. Detection of road-surface anomalies using a smartphone camera and accelerometer. Sensors 2021, 21, 561. [Google Scholar] [CrossRef] [PubMed]
Lee, T.; Yoon, Y.; Chun, C.; Ryu, S. CNN-Based Road-Surface crack detection model that responds to brightness changes. Electronics 2021, 10, 1402. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Hanandeh, S. Evaluation circular failure of soil slopes using classification and predictive gene expression programming schemes. Front. Built Environ. 2022, 8, 858020. [Google Scholar] [CrossRef]
Lanning, A.; Zaghi, A.E.; Zhang, T. Applicability of convolutional neural networks for calibration of nonlinear dynamic models of structures. Front. Built Environ. 2022, 8, 873546. [Google Scholar] [CrossRef]
Slaton, T.; Hernandez, C.; Akhavian, R. Construction activity recognition with convolutional recurrent networks. Autom. Constr. 2020, 113, 103138. [Google Scholar] [CrossRef]
Tsai, Y.-L.; Chang, H.-C.; Lin, S.-N.; Chiou, A.-H.; Lee, T.-L. Using convolutional neural networks in the development of a water pipe leakage and location identification system. Appl. Sci. 2022, 12, 8034. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Girshick, R. Fast R CNN. In Proceedings of the IEEE International Conference on Computer Vision, Araucano Park, Las Condes, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Kirillov, A.; He, K.; Girshick, R.B.; Rother, C.; Dollár, P. Panoptic Segmentation. arXiv 2018, arXiv:1801.00868. [Google Scholar] [CrossRef]
Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. UPSNet: A Unified Panoptic Segmentation Network. arXiv 2019, arXiv:1901.03784. [Google Scholar] [CrossRef]
Ronneberger, O.; Philipp, F.; Thomas, B. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015. [Google Scholar]
Chen, B.-W.; Hsu, Y.-M.; Lee, H.-Y. J-Net: Randomly Weighted U-Net for Audio Source Separation. arXiv 2019, arXiv:1911.12926. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. arXiv 2020, arXiv:2004.08790. [Google Scholar]
Liu, H.; Shen, X.; Shang, F.; Wang, F. CU-Net: Cascaded U-Net with Loss Weighted Sampling for Brain Tumor Segmentation. arXiv 2019, arXiv:1907.07677. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2016, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. arXiv 2019, arXiv:1809.02983. [Google Scholar]
Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation. arXiv 2021, arXiv:1909.11065. [Google Scholar]
Pintelas, P.; Livieris, I.E. Special issue on ensemble learning and applications. Algorithms. 2020, 13, 140. [Google Scholar] [CrossRef]
Middha, L. Crack Segmentation Dataset. 2021. Available online: https://www.kaggle.com/lakshaymiddha/crack-segmentation-dataset (accessed on 12 February 2023).
Zhang, L.; Yang, F.; Daniel Zhang, Y.D.; Zhu, Y.J. Road crack detection using deep convolutional neural network. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; Volume 2016, pp. 3708–3712. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Crack segmentation datasets used in this study.

Figure 2. Model architecture of FCN-8s.

Figure 3. Model architecture of UNet.

Figure 4. Model architecture of DeepLabV3.

Figure 5. Model architecture of DeepLabV3Plus.

Figure 6. Model architecture of DANet.

Figure 7. Metamodel architecture.

Figure 8. Strategy of training data generation for metamodel by stacking ensemble: preparing training dataset.

Figure 9. Strategy of training data generation for metamodel by stacking ensemble: preparing test dataset.

Figure 10. IoU score versus Epochs: (a) FCN, (b)UNet, (c) DeepLabV3, (d) DeepLabV3+, and (e) DANet.

Figure 11. Typical images predicted by segmentation models.

Figure 12. Images predicted by metamodel trained by the stacking ensemble technique.

Figure 13. Comparison of IoU scores with segmentation models.

Figure 14. Images of concrete surface and corresponding labeled cracks.

Figure 15. Comparison of segmentation results between labeled and predicted images.

Table 1. Summary of crack segmentation model performance.

Model		Precision	Recall	Accucacy	F1-Score
DANet	#1	0.842	0.846	0.731	0.844
	#2	0.699	0.818	0.605	0.754
	#3	0.874	0.849	0.757	0.861
	#4	0.432	0.935	0.482	0.591
DeepLabV3	#1	0.044	0.223	0.038	0.074
	#2	0.687	0.815	0.595	0.746
	#3	0.331	0.680	0.286	0.445
	#4	0.583	0.789	0.505	0.671
DeepLabV3+	#1	0.880	0.854	0.766	0.867
	#2	0.843	0.868	0.753	0.855
	#3	0.913	0.854	0.790	0.883
	#4	0.942	0.859	0.816	0.899
UNet	#1	0.192	0.560	0.170	0.286
	#2	0.239	0.637	0.223	0.348
	#3	0.099	0.424	0.104	0.160
	#4	0.757	0.860	0.683	0.805
Ours (stacking ensemble model)		0.982	0.920	0.911	0.951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, T.; Kim, J.-H.; Lee, S.-J.; Ryu, S.-K.; Joo, B.-C. Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning. Appl. Sci. 2023, 13, 2367. https://doi.org/10.3390/app13042367

AMA Style

Lee T, Kim J-H, Lee S-J, Ryu S-K, Joo B-C. Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning. Applied Sciences. 2023; 13(4):2367. https://doi.org/10.3390/app13042367

Chicago/Turabian Style

Lee, Taehee, Jung-Ho Kim, Sung-Jin Lee, Seung-Ki Ryu, and Bong-Chul Joo. 2023. "Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning" Applied Sciences 13, no. 4: 2367. https://doi.org/10.3390/app13042367

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improvement of Concrete Crack Segmentation Performance Using Stacking Ensemble Learning

Abstract

1. Introduction

2. Literature Review

2.1. Representative Semantic Segmentation Architecture

2.2. Ensemble Learning Techniques

3. Experiments

3.1. Used Dataset

3.2. Model Architectures

3.2.1. Backbone Network

3.2.2. Semantic Segmentation Models

3.2.3. Meta Model Architecture

3.3. Model Training

4. Results and Discussion

4.1. IoU Scores of Five Segmentation Models

4.2. Predicted Images of Five Segmentation Models

4.3. Comparison Results between Predicted and Labeled Images

4.4. New Data for Testing Performance of Metamodel

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI