Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Concrete Bridge Defects Identification and Localization Based on Classification Deep Convolutional Neural Networks and Transfer Learning

Remote Sens. 2022, 14(19), 4882; https://doi.org/10.3390/rs14194882

by Hajar Zoubir¹, Mustapha Rguig¹, Mohamed El Aroussi¹, Abdellah Chehri²

, Rachid Saadane¹

and Gwanggil Jeon^3,*

Reviewer 1:

Reviewer 2:

Reviewer 3:

Reviewer 4:

Remote Sens. 2022, 14(19), 4882; https://doi.org/10.3390/rs14194882

Submission received: 31 May 2022 / Revised: 20 September 2022 / Accepted: 26 September 2022 / Published: 30 September 2022

(This article belongs to the Special Issue Artificial Intelligence-Based Learning Approaches for Remote Sensing)

Round 1

Reviewer 1 Report

Structural health monitoring (SHM) has attracted much attention. This paper presents a dataset of more than 6900 images featuring three common concrete bridge defects such as cracks, efflorescence, and spalling. Three transfer learning approaches in fine-tuning the VGG16 model are explored, and the best classification method for these defects is proposed. Furthermore, three gradient-based back-propagation interpretation techniques are implemented to localize defects on randomly selected images.

1. The structure of this paper needs to be further optimized. Section 1 Introduction is too long. More importantly, there is a lack of overall introduction to the proposed method and specific description for key steps of this method.

2. There are some typographical, spelling, and grammatical mistakes in this paper. Please check Lines 22, 56, 228, 274, and so on. For transfer learning that occurs many times, the case of the first letter of each word is inconsistent.

3. Line 395. This paper only compares the results of the last three layers used for training. Can the parameters of more layers be retrained ? How is the performance? Please give it in your paper.

4. The image quality of Figure 4 needs to be improved. The y-coordinate values of all the left subfigures are in percentage, so the range of the y-coordinate values should be 0-100 instead of 0.0-1.0.

5. The best way to detect concrete cracks is to use semantic segmentation for pixel level location rather than bounding-box level location. How far is the difference between this method and the method based on semantic segmentation in localization accuracy?

6. Overall, the literature review should be improved. Without discussing the previous works, the novelty of the paper cannot be assessed properly. It is suggested that the related papers should be cited. For example, DOIs: 10.1007/s12243-019-00731-9; 10.1080/15732479.2018.1550519; 10.1016/j.jksuci.2022.02.004; 10.1061/(ASCE)ST.1943-541X.0002666.

Author Response

We gratefully acknowledge the anonymous referees for careful reading and constructive suggestions that helped us improve the quality of our manuscript significantly. Following the reviewers’ comments, we have thoroughly revised the manuscript and tried our best to address all of their comments. In the following, we provide our responses to the reviewers.

REVIEWER #1

Dear reviewer, Thanks a lot for such motivational and encouraging comments. The suggestion provided by the reviewer is really helpful in improving the overall quality of the manuscript.

Comment 1: The structure of this paper needs to be further optimized. Section 1 Introduction is too long. More importantly, there is a lack of overall introduction to the proposed method and specific description for key steps of this method.

Authors response: We are very grateful for the Reviewer’s comment and we have made the necessary changes in the manuscript.

Authors actions: We have updated the introduction section and eliminated some paragraphs. We have also added a paragraph describing clearly the contributions of our paper, the proposed method and its key steps.

This work covers three main aspects related to defect detection in concrete bridges. First, we constructed a labeled multi-class dataset of defects. Then we established an efficient methodology for defect classification based on DCNNs and Transfer Learning. Finally, we proposed an AI-based approach for defect localization using classification models and image-level annotations.

For this purpose, the authors constructed a dataset featuring three common defects in concrete bridges (i.e., cracks, concrete spalling, and efflorescence) with more than 6900 images. The performance of three classification settings leveraging the well-known Visual Geometry Group (VGG) network with its 16 learning layers (i.e., VGG16[26]) and Transfer Learning methods was compared and evaluated based on classification metrics. The VGG16 model can capture high-level features [12] and has a good generalization ability to other datasets [31]. Therefore, it is chosen as a base model for our learning approach. Finally, the best classification scheme was employed to implement three gradient-based backpropagation interpretation techniques (i.e., Saliency maps [46], Gradient-weighted Class Activation Mapping (Grad-CAM) [47], and Grad-CAM++ [48]) to localize defects on a sample of test bridge concrete images.

Comment 2: There are some typographical, spelling, and grammatical mistakes in this paper. Please check Lines 22, 56, 228, 274, and so on. For transfer learning that occurs many times, the case of the first letter of each word is inconsistent.

Authors response: we very much appreciate the Reviewer’s comment, and we have made the necessary changes in the manuscript.

Authors actions: we have revised the text of the manuscript and have checked it thoroughly. We have fixed this and reviewed the article to fix all typo errors. Furthermore, the mistakes have been incorporated into the revised manuscript with English proofreading.

Comment 3: Line 395. This paper only compares the results of the last three layers used for training. Can the parameters of more layers be retrained ? How is the performance? Please give it in your paper.

Authors response: We thank the Reviewer for this comment and we added more explanations in the manuscript.

Authors actions: We updated a paragraph in the results and discussions section where we explained that training more layers will potentially lead to overfitting.

The training and validation accuracies in the three learning settings generally increase over time, reaching a plateau around the last three epochs. The loss curves in (b) and (c) show a slight tendency of over-fitting due to the increasing number of parameters to update in their corresponding learning schemes. Thus, training more than two convolutional layers will lead to a higher possibility of overfitting and consequently decrease the model's generalization ability.

Table 2 lists the training, validation, testing best accuracy, and RMSE of the three training settings. The results show that the model in the scheme (c) presents better performance in training compared to methods (a) and (b). The achieved training accuracy was 98.34% in (c) and over 94.62% and 91.10% in (b) and (a), respectively). The model in setting (c) also yielded a higher testing accuracy (97.13% over 94.61% and 91.10% in (b) and (a), respectively), showing that approach (a) enables the model to have better generalizability to extend the learning from the training subset to unseen test data.

Comment 4: The image quality of Figure 4 needs to be improved. The y coordinate values of all the left subfigures are in percentage, so the range of the y-coordinate values should be 0-100 instead of 0.0-1.0.

Authors response: We appreciate the Reviewer’s comment and we have made the necessary changes in the manuscript.

Authors actions: We have updated the figures of this manuscript. We have revised Figure 4. The font size of the figures has been adjusted to make them easier to read. The preprocessing is done, and the resolution of the image is improved. In addition, we have maintained consistency in the positioning of figures.

Comment 5: The best way to detect concrete cracks is to use semantic segmentation for pixel level location rather than bounding-box level location. How far is the difference between this method and the method based on semantic segmentation in localization accuracy?

Authors response: Thank you for your rigorous and valuable suggestion on our manuscript.

Indeed, due to the large number of images tested, there are significant differences between images of different modalities. The recorded variance in the experimental results can intuitively see the fluctuation of fusion results in the datasets. Therefore, we modified all experimental data into the form of “mean±standard deviation” to show our results more rigorously.

In addition, we explore in this work the possibility of localizing defects in concrete images with using only classification models and image-level annotations. Semantic segmentation requires pixel level annotations that are not prepared yet by the authors and will eventually be discussed in future works.

Comment 6: Overall, the literature review should be improved. Without discussing the previous works, the novelty of the paper cannot be assessed properly. It is suggested that the related papers should be cited. For example, DOIs: 10.1007/s12243-019- 00731-9; 10.1080/15732479.2018.1550519; 10.1016/j.jksuci.2022.02.004; 10.1061/(ASCE)ST.1943 541X.0002666.

Authors response: We very much appreciate the Reviewer’s suggestion, and we have made the necessary changes in the manuscript.

Authors actions: We have updated the manuscript and discussed in more details the previous works presented in the paper. We also added a paragraph explaining the limitations of these works compared to our proposed method. We adjusted the content and logic of the introduction and highlighted our motivation and contribution in the introduction. Some related technologies have been introduced to “Background and Related Work”.

The Visual Geometry Group introduced VGG16 [26] in 2014. The algorithm is very efficient and won first place in object localization and second place in image classification in the ImageNet Large Scale Visual Recognition Challenge. This model trained on the ImageNet dataset achieved a top-1 accuracy of 71.5% and a top-5 accuracy of 90.1% in image classification.

The network contains 13 convolutional layers with 3x3 filters (i.e., convolution kernels), five max-pooling layers, three fully connected layers, and a Softmax layer.

On the other hand, the Rectified Linear Unit (ReLU) activation function is used to preserve only the positive values of a filtered image. The model takes 224 x 224 RGB images as inputs and has more than 138 million learning parameters. Figure 1 presents the architecture of the VGG16 model.

For image classification, VGG16 and other state-of-the-art DCNNs are usually trained on the ImageNet dataset that contains millions of images belonging to thousands of classes. However, since the size of domain-specific datasets (e.g., concrete defects datasets in our case) is limited, Transfer Learning techniques are applied to overcome the scarcity of labeled data.

In a Transfer Learning approach, pre-trained models on large datasets (e.g., ImageNet) are fine-tuned and partially retrained on the small target dataset. In this learning framework, the weights of the lower-level layers are generally maintained since they represent generic features. In contrast, the high-level layers are more sensitive to the target dataset and must be retrained to update their learning parameters [49].

2.2. DCNNs and Concrete Damage Classification

Due to their robust feature extraction and learning capabilities, DCNNs have been widely investigated in concrete damage classification studies. For example, the superiority of these deep models over traditional Image Processing Techniques has been demonstrated in [49], where the authors compared the performance of the AlexNet network and six standard edge detectors in classifying concrete crack images of the SDNET dataset [20].

Yang et al. [12] developed a low-cost automated inspection approach based on UAVs and deep learning. They constructed the CSSC database and used a fine-tuned VGG16 model to classify cracks and to spall in concrete bridge elements and achieved a mean accuracy of 93.36% with the CSSC dataset. Mundt et al. [19] proposed the CODEBRIM dataset that features five non-exclusive damage classes in bridges (i.e., crack, spallation, exposed reinforcement bar, efflorescence, and corrosion). In addition, they investigated reinforcement learning approaches to build a DCNN model for the multi-target classification task, and their best meta-learned models yielded a testing accuracy of 72%.

Hüthwohl et al. [50] used a pre-trained inception-V3 network to define a hierarchical multi-classifier for reinforced concrete bridge defects (i.e., cracks, efflorescence, spalling, exposed reinforcement, and rust staining). Experimental results showed that the multi-classifier could assign class labels with an average F1 score of 83.5%. The authors concluded that the observed misclassifications are related to the training dataset where some defect classes are underrepresented compared to others.

Yang et al. [32] proposed an end-to-end based Transfer Learning method for crack detection using three knowledge transfer approaches (i.e., sample, model, and parameter transfer knowledge), a fine-tuned VGG16 model and three crack datasets. Their experiments show that by training 13 convolutional and two fully connected layers of the pre-trained VGG16 model on the three datasets, crack detection was improved and achieved a testing accuracy of 97.07% on the SDNET crack dataset.

Bukhsh et al. [30] investigated cross-domain, and domain Transfer Learning approaches. They compared the performance of the VGG16, InceptionV3, and the ResNet50 [27] models in different Transfer Learning strategies to detect damages in 6 binary and multi-label concrete damage datasets. Their experiments demonstrated that combined representations of in-domain and cross-domain Transfer provide considerable performance gain, particularly with tiny datasets. For example, the proposed combined learning framework achieved an accuracy of 95.5% and 88.5% on the SDNET and the CODEBRIM datasets, respectively.

Zhu et al. [51] built a robust classifier to detect four defects, including cracks, pockmarks, spalling, and exposed rebar. They used the pre-trained inceptionV3 model to extract features from input images and a fully connected network to classify defects. The proposed model was trained on 1180 images with arbitrary sizes and resolutions for 374.1s and recorded a testing accuracy of 97.8%. On the other hand, Gao and Mosalam [31] proposed the concept of structural ImageNet and manually labeled 2000 images for four recognition tasks: component type identification (binary), spalling condition check (binary), damage level evaluation (three classes), and damage type determination (four classes). They applied two different strategies of Transfer Learning based on the pre-trained VGG16 model. For the damage type multi-classification task, a 68.8% accuracy with 23% overfitting is obtained by retraining the last two convolutional blocks of the network.

The scope of the presented works covers binary and multi-defect classification tasks. In the Transfer learning-based studies, the performance of the proposed methods varies according to the size and complexity of the training and testing datasets and the implemented Transfer Learning strategy. For example, most studies retrain more than two or all convolutional layers and update a high number of the network parameters to achieve a higher detection accuracy. However, this approach is computationally expensive, requires more training time, and is also subject to overfitting in heavily parameterized networks and small datasets.

Author Response File: Author Response.pdf

Reviewer 2 Report

I have some comments and recommendations for corrections which should be considered as a contribution to increasing the quality of the manuscript before its publication.

1) Avoid acronyms in the abstract

2) Line 56 – change “A close” with “a close”

3) Line 139 – Each acronym must be spelled out (defined) the first time it is used in the body of the paper. See for example “VGG16” ????

4) Lines 173 – 176. The sentence should be improved…

5) Lines 290 – 291. The sentences should be improved…

6) Lines 299 – 301. The sentences should be improved… (one of the two sentences should be deleted)

7) Line 454. The sentence should be improved…

8) The figures 4 and 5 should be improved.

9) In a measurement the measured value and its uncertainty must always have the same number of digits after the decimal place. Thus, the authors have to provide the same degree of uncertainty (for example see Figure 5… in some cases the value is limited to 2 decimal places, whereas in others the value of the same parameter is limited to 4 decimal places, etc., etc.)

10) The position of some figures should be redefined. For example, the Figure 2 should go to line 308… etc, etc…

11) Improve the manuscript (text and figures) to make it more understandable, if possible

Author Response

REVIEWER #2:

Dear reviewer, Thanks a lot for such motivational and encouraging comments. The suggestion provided by the reviewer is really helpful in improving the overall quality of the manuscript.

Comment 1: Avoid acronyms in the abstract

Authors response: We thank the reviewer for this comment. We agree, and we have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have updated the abstract. We avoid using acronyms in the abstract. In the revised version of the paper, we reduced the use of acronyms. However, the used acronyms were well defined.

New Abstract:

The conventional bridge visual inspection practices present several limitations, including a tedious process of analyzing images manually to identify potential damages. Vision-based techniques, particularly Deep Convolutional Neural Networks, have been widely investigated to automate identifying, localizing, and quantifying defects in bridge images. However, these deep models require massive, comprehensive training datasets with different annotation levels. This paper presents a dataset of more than 6900 images featuring three common concrete bridge defects (i.e., cracks, efflorescence, and spalling). The authors explored three Transfer Learning approaches in fine-tuning the state-of-the-art Visual Geometry Group network to propose the best classification method for these defects. Furthermore, using a hybrid Transfer Learning method, the authors implemented three gradient-based back-propagation interpretation techniques to localize defects on randomly selected images. The visualization results showcase the potential use of these techniques to provide helpful localization information leveraging classification models. The proposed approach achieved a high testing accuracy (97.13%), combined with high F1-scores 97.38%, 95.01%, and 97.35% for cracks, efflorescence, and spalling, respectively.

Comment 2: Line 56 – change “A close” with “a close”

Authors response: Dear reviewer, thanks for highlighting our mistake. We have made the necessary changes in the manuscript.

Authors actions: We have updated the introduction section. We have fixed this and reviewed the article to fix all typo errors. Furthermore, the mistakes have been incorporated into the revised manuscript with English proofreading.

Comment 3: Line 139 – Each acronym must be spelled out (defined) the first time it is used in the body of the paper. See for example “VGG16” ????

Authors response: We very much appreciate the Reviewer’s comment.

Authors actions: We have made the necessary changes in the manuscript. In the revised version of the paper, we reduced the use of acronyms. However, the used acronyms were well defined.

Dear reviewer the VGG is referred to as Visual Geometry Group.

Comment 4: Lines 173 – 176. The sentence should be improved

Authors response: We are grateful for the Reviewer’s suggestion.

Authors actions: the changes have been incorporated into the revised manuscript. Kindly refer to the Section 2.

Comment 5: Lines 290 – 291. The sentences should be improved…

Authors response and actions: the changes have been incorporated into the revised manuscript. Kindly refer to the Section 3.

Comment 6: Lines 299 – 301. The sentences should be improved… (one of the two sentences should be deleted)

Authors response and actions: We thank the Reviewer for this suggestion and the sentences have been improved. The English are improved, and all mistakes have been incorporated.

Comment 7: Line 454. The sentence should be improved…

Authors response and actions: We are grateful for the Reviewer’s suggestion and the sentence has been improved.

Authors actions: Thanks for the valuable suggestion, but processing large image data sets in m-fold cross-validation is costly, so we split the data set in a single random split.

We randomly selected 12 images for our testing set. The images with defects were retrieved from the CODEBRIM dataset and then were tested to visualize the implementation results. Figure 7 shows sample examples of the obtained results.

Comment 8: The figures 4 and 5 should be improved.

Authors response and actions: We thank the Reviewer for this suggestion, and we have updated all the figures of the manuscript.

Comment 9: In a measurement the measured value and its uncertainty must always have the same number of digits after the decimal place. Thus, the authors have to provide the same degree of uncertainty (for example see Figure 5… in some cases the value is limited to 2 decimal places, whereas in others the value of the same parameter is limited to 4 decimal places, etc., etc.)

Authors response and actions: We thank the Reviewer for this suggestion, and we have made the necessary changes in the manuscript.

Comment 9: The position of some figures should be redefined. For example, the Figure 2 should go to line 308… etc, etc…

Authors response and actions: Dear Reviewer, thanks for the suggestion; in the revised manuscript, the suggestion is incorporated. We moved the Figure 2 to line 307.

Comment 10: Improve the manuscript (text and figures) to make it more understandable, if possible

Authors response: Dear reviewers, thank you for the valuable suggestion.

Authors actions: We have updated the text and figures of the paper. A complete proofreading has been made for the entire manuscript and each suggestion made by the Reviewer has been fixed accordingly.

Author Response File: Author Response.pdf

Reviewer 3 Report

This study investigated three transfer learning approaches using more than 6900 image features of concrete damages such as cracks, efflorescence, and spalling. Although the study presented in this manuscript should draw the interest of the bridge engineer and research communities in the field of health monitoring, some concerns must be attended.

1- The reviewer believes that the Introduction section needs to be improved. A few statements and paragraphs at the start of the Introduction are open statements that are not referenced. For example, Line 59, need to be referenced based on techniques used. Some references are provided below:

Sensors, and non-destructive testing: https://doi.org/10.3390/rs13122291

Visual inspection: https://doi.org/10.3390/rs14051148

Please also check the references for the statement presented in Lines 69-70. Moreover, please reference your statement presented in Lines 74 to 77. Moreover, please consider the references for statements presented in Lines 88 to 90. Another updated reference is provided as https://doi.org/10.1177/14759217211053546. Please discuss this in your manuscript.

2- In contrast to the assertion stated in Lines 83–84, the reviewer believes that new technologies such as UAV photogrammetry should be covered in the Introduction section that can reduce the time of inspection and capturing images. However, data processing with such large amounts of image data is always a challenge that your solution can alleviate.

3- VGG16 model needs to be discussed in more detail. Please clarify the abbreviation in the Introduction, its benefit over other approaches, and your advantage over references [26] and [32]. This abbreviation is explained in Section 2, which will confuse the reader, and therefore needs a brief discussion in the Introduction.

4- Reviewer believes that the dataset needs to be discussed in more detail. Please specify the distance that images taken from the objected bridge or other specifications of device or light or etc.

5- Authors need to clarify their statement that “some background images contain concrete joints that are likely to be misclassified as cracks.” and clarify the method used to overcome this challenge.

6- Reviewer believes that the statements such as “We computed the precision”, need to be in a passive format. Please double-check the manuscript in this regard.

Author Response

REVIEWER #3:

Thanks a lot for such motivational and encouraging comments. The suggestion provided by

the reviewer is really helpful in improving the overall quality of the manuscript.

Comment 1: The reviewer believes that the Introduction section needs to be improved. A few statements and paragraphs at the start of the Introduction are open statements that are not referenced. For example, Line 59, need to be referenced based on techniques used. Some references are provided below: Sensors, and non-destructive testing: https://doi.org/10.3390/rs13122291 Visual inspection: https://doi.org/10.3390/rs14051148

Authors response: We thank the Reviewer for this suggestion, and we have made the necessary changes in the manuscript.

Authors actions: We have updated the introduction section and added the necessary references. In addition, an English expert significantly improves the revised manuscript's writing. The Inappropriate wordings/sentences are rephrased, as suggested.

Comment 2: In contrast to the assertion stated in Lines 83–84, the reviewer believes that new technologies such as UAV photogrammetry should be covered in the Introduction section that can reduce the time of inspection and capturing images. However, data processing with such large amounts of image data is always a challenge that your solution can alleviate.

Authors response: We appreciate the Reviewer’s comment and we have made the necessary changes in the manuscript.

Authors actions: We have added a paragraph in the introduction section covering UAVs in the context of the work presented in the paper.

In recent years, technological advances in civil engineering and related disciplines have promoted the emergence of innovative tools to manage civil infrastructures. Within this context, bridge owners and managers have shown increasing interest in Unmanned Aerial Vehicles (UAVs) as an assistive, efficient, and cost-effective means offering great potential for inspection automation [2], [44]. However, one of the major challenges of this inspection scheme lies in deploying an efficient method to process the large amount of image data collected by the UAVs sensors to identify and locate possible damages in the acquired images. To this end, several vision-based techniques have been extensively explored to automate defect detection in different civil engineering structures. These methods include traditional Image Processing Techniques (IPTs) [45], Machine Learning algorithms [57], and Deep Convolutional Neural Networks (DCNNs) [58].

Comment 3: VGG16 model needs to be discussed in more detail. Please clarify the abbreviation in the Introduction, its benefit over other approaches, and your advantage over references [26] and [32]. This abbreviation is explained in Section 2, which will confuse the reader, and therefore needs a brief discussion in the Introduction.

Authors response: We very much appreciate the Reviewer’s suggestion, and we have made the necessary changes in the manuscript.

Authors actions: We have updated the introduction section and clarified the abbreviation. We have also explained the benefit of using the VGG16 model.

For this purpose, the authors constructed a dataset featuring three common defects in concrete bridges (i.e., cracks, concrete spalling, and efflorescence) with more than 6900 images. The performance of three classification settings leveraging the well-known Visual Geometry Group (VGG) network with its 16 learning layers (i.e., VGG16 [26]) and Transfer Learning methods was compared and evaluated based on classification metrics. The VGG16 model can capture high-level features [12] and has a good generalization ability to other datasets [31]. Therefore, it is chosen as a base model for our learning approach. Finally, the best classification scheme was employed to implement three gradient-based backpropagation interpretation techniques (i.e., Saliency maps [46], Gradient-weighted Class Activation Mapping (Grad-CAM) [47], and Grad-CAM++ [48]) to localize defects on a sample of test bridge concrete images.

Furthermore, our approach's advantage lies in the reduced number of retrained convolution layers, which reduces the computational time while achieving high accuracy in the multi-classification task. This advantage is highlighted in the "DCNNs and concrete damage classification" subsection.

Comment 4: Reviewer believes that the dataset needs to be discussed in more detail. Please specify the distance that images taken from the objected bridge or other specifications of device or light or etc.

Authors response: We are grateful for the Reviewer’s suggestion and have made the necessary changes in the manuscript.

Authors actions: Dear Reviewer, as we discussed in the literature section, most developed methods are trained and tested on the limited data set. We customized the different online data sets and trained the deep learning model on different data sets to extend the work. we have updated the paragraph explaining the details of image acquisition conditions

More than 1200 images of Moroccan bridges representing decks and piers with concrete spalling and efflorescence were collected and processed according to the same experimental setup and procedure in [52]. The images were captured using two 20-MP consumer digital cameras with 5 mm of focal length, a sensitivity of 100 ISO, and a maximum resolution of 5152 x 3864. They were gathered at varying distances from the bridges depending on the accessibility to the bridge elements of interest, and a maximum 8x optical zoom was applied. Moreover, the images were taken under different weather and lighting conditions, and a flash was used to illuminate the dark bridge areas containing defects. It's noteworthy that the original images have not undergone any processing operations other than the manual cropping using the inbac tool [53].

This treatment chain is based on the repeated use of robust estimates of parametric models to extend an initial reconstruction. These repeated estimates require the definition of thresholds that irrevocably decide which data is validated or rejected as images are added.

The dataset in [52] is expanded with the concrete spalling and efflorescence classes, and the resulting dataset is publicly available at [54] for academic purposes.

Comment 5: Authors need to clarify their statement that “some background images contain concrete joints that are likely to be misclassified as cracks.” and clarify the method used to overcome this challenge.

Authors response: We are grateful for the Reviewer’s comment and we have made the necessary changes in the manuscript.

Authors actions: We add an explanation of the statement in the manuscript. The observed confusion is low (3%, including the confusion between background and efflorescence) but is a limitation of this work and can be reduced by adding more examples to the training dataset.

The observed confusion is mainly related to the complexity of the concrete surface in terms of colors and textures. In addition, some surface alterations in the training dataset (e.g., stains, markings, minor defects such as scaling and segregation) make feature learning more challenging. For example, some background images contain concrete joints representing straight lines in the concrete surface and are likely misclassified as cracks. Generally, these confusion classes can be further reduced by adding more labeled samples to the training dataset. Figure 6 illustrates some misclassification examples. These classification errors correspond to the learning setting used in (c).

Comment 6: Reviewer believes that the statements such as “We computed the precision”, need to be in a passive format. Please double check the manuscript in this regard.

Authors response: We thank the Reviewer for this comment, and we have made the necessary changes in the manuscript.

Authors actions: We have revised the text of the manuscript and have checked it thoroughly. To make the expression of the article more accurate and logical, we found an English expert to revise our paper.

Author Response File: Author Response.pdf

Reviewer 4 Report

The article is well written and the problem that the authors are investigating is relevant. In this manuscript, pre-trained VGG16 models with three transfer learning approaches are proposed to overcome the challenge of limited samples. The best-proposed approach achieved high accuracy in concrete bridge defects identification and a weakly supervised framework is investigated to localize the defect regions.

1.The authors may want to expand their review of the literature on the application of weakly supervised semantic segmentation.

2.In section 3.1, the authors may want to explain the value of some parameters, like why the SGD optimizer is chosen and the epoch is set as 25 in the DCNN training?

3.The format(indent) of references in this paper is inconsistent. Please revise it.

Author Response

Dear reviewer, Thanks a lot for your appreciated comments. The suggestions provided by the reviewer are really helpful in improving the overall quality of the manuscript.

Comment 1: The authors may want to expand their review of the literature on the application of weakly supervised semantic segmentation.

Authors’ response: We very much appreciate the Reviewer’s comment and we have made the necessary changes in the manuscript.

Authors’ actions: Dear reviewer, we are very grateful for your suggestion. We added a paragraph in the introduction section to discuss studies related to weakly supervised semantic segmentation in the context of damage detection.

To alleviate the heavy workload associated with data annotation in a fully supervised learning framework, weakly supervised segmentation methods consider different weak annotations (e.g., image-level and bounding box labels) as the supervision condition [46]. Within the context of damage detection, Dong et al. [47] designed a patch-based weakly supervised semantic segmentation network to detect cracks in construction materials. In their proposed method, an input image is cropped, and the resulting patches are annotated at an image level. Class activation maps of cracks are obtained for each patch. They are fed to a fully connected conditional random field to generate the corresponding synthetic labels, which are used to train a segmentation network.

König et al.[48] presented a weakly supervised segmentation approach leveraging classification labels to detect surface cracks. To obtain pixel-level segmentation pseudo labels, the authors utilized a patch-threshold segmentation combined with coarse localization maps generated by a Convolutional Neural Network trained on images with classification annotations. The generated pseudo labels are used to train a standard semantic segmentation network to perform crack segmentation.

Zhu and Song [49] developed a weakly supervised network for crack segmentation in asphalt concrete bridge decks. Based on an autoencoder, the original data generates a weakly supervised start point for convergence, and image feature extraction and segmentation are performed under weak supervision.

This paper investigates a weakly supervised framework based on interpretation techniques and leveraging image-level annotations to generate pixel-level maps. The goal is to provide a coarse localization of three distinct types of damage in concrete bridge images.

Comment 2: In section 3.1, the authors may want to explain the value of some parameters, like why the SGD optimizer is chosen and the epoch is set as 25 in the DCNN training?

Authors’ response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors’ actions: Thank you for the suggestion. We have updated the paragraph explaining the training setting followed in our work.

Updated paragraph:

The optimization method recommended in [32] based on Stochastic Gradient Descent (SGD) with momentum and a small learning rate is used in this work experiments. The SGD with momentum optimizer reduces computational load and accelerates the training convergence. The training was conducted for 25 epochs as the results converged. In addition, low training and validation errors were achieved while mitigating overfitting. The cross-entropy loss function was optimized using the SGD with a learning rate of 0.001, a momentum of 0.9, and a mini-batch size of 32. All the experiments were carried out using Pytorch in Google Colaboratory (Colab) with the 12GB NVIDIA Tesla K80 GPU provided by the platform.

Comment 3: The format(indent) of references in this paper is inconsistent. Please revise it

Authors’ response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors’ actions: Thank you for your comment. The format of the references has been updated according to the MDPI predefined format.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

In the revised version, the author's modification is not satisfactory. At present, the writing and innovation of this manuscript cannot meet the average level published in the journal Remote Sensing.

Therefore, I strongly recommend the authors to follow the previous comments 5 and 6, and to cite the related references. Please describe the advantages/disadvantages and performance comparison between the detection-based method of this paper and other semantic segmentation-based methods.

Author Response

REVIEWER #1

Dear reviewer, Thanks a lot for your appreciated comment. The suggestion provided by the reviewer is really helpful in improving the overall quality of the manuscript.

Comment 1: In the revised version, the author's modification is not satisfactory. At present, the writing and innovation of this manuscript cannot meet the average level published in the journal Remote Sensing.

Authors response: We very much appreciate the Reviewer’s comment, and we have made the necessary changes to the manuscript.

Authors actions: Dear Reviewer, we are very grateful for your suggestion. We have updated the introduction section and discussed previous works related to semantic segmentation. We have also explained that our proposed method is based on weakly supervised semantic segmentation using image-level annotations.

In a bridge condition assessment framework, defect localization is crucial to evaluate damage's impact on the bridge's structural integrity. For this purpose, deep learning based-semantic segmentation algorithms have been deployed to provide pixel-level classification results to improve damage detection accuracy.

Zhang et al. [42] designed a fully convolutional model to detect and group image pixels for three types of concrete surface defects (i.e., crack, spalling, and exposed rebar). The authors prepared a dataset with mask labeling of 1443 images to train and test the model. Their proposed method achieved a semantic segmentation accuracy of 75.5%.

Fu et al. [43] introduced a crack detection method based on an improved DeepLabv3+ semantic segmentation algorithm. They established a concrete bridge crack segmentation dataset to train and test the proposed model. The experimental results proved the effectiveness of the trained algorithm that reached an average intersection over union ratio of 82.37%.

Wang et al. [44] constructed a crack dataset of 2446 manually labeled images to train and evaluate the performance of five deep networks for semantic segmentation. The best model achieved an F1 score of 77.32% and an intersection over union ratio of 62.98%. The authors also discussed the influence of dataset choice and image noise on the detection performance.

Dung and Anh [45] developed a fully convolutional network-based method and annotated 600 crack-labeled images for semantic segmentation. The proposed model reached approximately 90% for the average precision score. The authors demonstrated their method's effectiveness by accurately identifying and capturing crack path and density variation in a crack opening video.

The above studies have shown very promising results in detecting damages. However, the fully supervised semantic segmentation deep networks are complex and are faced with a common major challenge associated with data scarcity. These models require training labeled images with pixel-level annotations that are expensive and necessitate the empirical knowledge of field experts. Furthermore, most publicly available concrete damage datasets only provide image-level annotations. Within this context, a weakly supervised framework leveraging image-level annotations is investigated in this paper to generate pixel-level maps and provide a coarse localization of concrete damage.

We further discussed the qualitative results of our proposed method in the section of Results and Discussions.

As intended, the resulting heatmaps highlight the discriminative image regions that contributed to image classification. These heatmaps show the probability of the target class at each pixel. By analyzing the qualitative results in Figure 8, the active regions are primarily consistent with the defect area. Grad-CAM++ provided better visualization results for cracks and efflorescence examples compared to Grad-CAM.

The pixel-level maps generated after applying a threshold of 0.5 provide a coarse localization of the concrete defects and offer semantically meaningful discrimination at the pixel level between defects and background. Therefore, it is believed that in the context of weakly supervised semantic segmentation, interpretation methods can provide relevant pixel-level maps by using only image annotations as the supervision condition. The proposed method has reasonably captured a coarse localization of defects avoiding the annotation workload of the fully supervised semantic segmentation-based frameworks.

However, since the visualization results using these interpretation techniques depend on the feature space learned by the classifier, some highlighted areas don't represent the target classes in the test images, and other regions representing damage were not captured.

As a result, it would be challenging to localize and quantify the damage precisely (e.g., crack path and density). This can be attributed to the underlying complexity of the training dataset, its limited size, and the limited learning capabilities of the pre-trained network due to the difference between the source domain (ImageNet dataset) and the target domain (the proposed concrete damage dataset). Thus, to further examine the potential of interpretation techniques in weakly supervised semantic segmentation, more customized networks tailored to the damage classification task and trained on more comprehensive datasets should be explored.

Furthermore, additional simulation and experiments had been performed.

We have also improved the design and text of the manuscript. Kindly refer to the revised manuscript.

Author Response File: Author Response.pdf

Reviewer 3 Report

1- The Abstract, Line 22, need to be revised. “The authors” need to be revised as “The paper”. Please check for all the abstract.

2- The arrangement of references is not in order, please check them. Moreover, the updated reference, https://doi.org/10.1177/14759217211053546 was not discussed regarding the crack detection of concrete components. please discuss it in Line 64 to improve the introduction.

3- The reviewer believes that the manuscript needs to be in a passive format. Please double-check. The statement presented in Line 100 “First, we constructed …”, “Then we established…” and “ we explained”,.. etc.

4- The statement in Line 109 “has a good generalization ability 109 to other datasets” needs to be revised. What do you mean by good? The reviewer believes that Line 99 to Line 114, discussing the aims of the research study needs to be majorly revised.

5- The Section title “Background and Related Work” confuse the readers as the background better to be discussed in the Introduction section, so please change this title.

6- The reviewer believes that the conclusion section needs to contain only the written results of the research, not graphical results. Please add another section named “Discussion” for showing your evaluations and graphical results.

7- In overall, extensive revision in both English and design is recommended. Please double-check the manuscript with a professional native English speaker in this field. In this regard, I would like to ask Editor for a longer revision time to make everything up.

Author Response

REVIEWER #3:

Dear reviewer, Thanks a lot for such motivational and encouraging comments. The suggestions provided by the reviewer are really helpful in improving the overall quality of the manuscript.

Comment 1- The Abstract, Line 22, need to be revised. “The authors” need to be revised as “The paper”. Please check for all the abstract.

Authors response: We thank the reviewer for this comment. We agree, and we have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have updated the abstract.

New Abstract:

Conventional practices of bridge visual inspection present several limitations, including a tedious process of analyzing images manually to identify potential damages. Vision-based techniques, particularly Deep Convolutional Neural Networks, have been widely investigated to automatically identify, localize, and quantify defects in bridge images. However, massive datasets with different annotation levels are required to train these deep models. This paper presents a dataset of more than 6900 images featuring three common defects of concrete bridges (i.e., cracks, efflorescence, and spalling). To overcome the challenge of limited training samples, three Transfer Learning approaches in fine-tuning the state-of-the-art Visual Geometry Group network were studied and compared to classify the three defects. The best-proposed approach achieved a high testing accuracy (97.13%), combined with high F1-scores 97.38%, 95.01%, and 97.35% for cracks, efflorescence, and spalling, respectively. Furthermore, the effectiveness of interpretable networks was explored in the context of weakly supervised semantic segmentation using image-level annotations. Two gradient-based backpropagation interpretation techniques were used to generate pixel-level heatmaps and localize defects in test images. Qualitative results showcase the potential use of interpretation maps to provide relevant information on defect localization in a weak supervision framework.

Comment 2- The arrangement of references is not in order, please check them. Moreover, the updated reference, https://doi.org/10.1177/14759217211053546 was not discussed regarding the crack detection of concrete components. please discuss it in Line 64 to improve the introduction.

Authors response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have checked the order of references and we have discussed the suggested reference in the introduction section.

On the other hand, DCNNs extract features from a set of training images through the convolution operation and classify them within one learning framework. Owing to their robust feature extraction and learning capabilities, DCNNs have been widely examined in concrete damage classification studies.

For example, Dorafshan et al. [23] demonstrated the superiority of the AlexNet network [24] over six standard edge detectors in classifying concrete crack images of the SDNET dataset [25].

Kim et al. [26] trained and optimized the LeNet-5 network [27] to detect cracks in concrete surfaces using a dataset of 40 000 images. The proposed model achieved an accuracy of 99.8% and could be implemented using low-power computational devices.

Yu et al. [28] developed a method based on DCNNs to detect cracks in image patches of damaged concrete. The authors proposed an architecture consisting of six convolutional layers, two pooling layers, and three fully connected layers and employed the enhanced chicken swarm algorithm to optimize the meta-parameters of the DCNN model.

Mundt et al. [29] proposed the CODEBRIM dataset that features five non-exclusive damage classes in bridges (i.e., crack, spallation, exposed reinforcement bar, efflorescence, and corrosion). In addition, they investigated reinforcement learning approaches to build a DCNN model for the multi-target classification task, and their best meta-learned models yielded a testing accuracy of 72%.

Comment 3 : The reviewer believes that the manuscript needs to be in a passive format. Please double-check. The statement presented in Line 100 “First, we constructed …”, “Then we established…” and “ we explained”,.. etc.

Authors response and actions: We thank the reviewer for this comment. We agreed with your suggestion as passive voice is often preferred in scientific research papers. Therefore, we have incorporated the necessary changes and updated the manuscript (the whole manuscript is converted to passive format).

Comment 4 : The statement in Line 109 “has a good generalization ability 109 to other datasets” needs to be revised. What do you mean by good? The reviewer believes that Line 99 to Line 114, discussing the aims of the research study needs to be majorly revised.

Authors response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have revised the statement and we have also incorporated it in the Experimental setup section.

The VGG16 model can capture high-level features [21] and has the ability to generalize to other datasets [32]. Moreover, it showed an excellent performance in many studies on damage classification in concrete surfaces [31]-[33]. Therefore, it is chosen as a base model for the learning approach proposed in this paper.

We have also revised the paragraph discussing the aims of the research in the introduction section

The main contributions of this work are the following:

A multi-class labeled dataset with more than 6900 images was constructed. The dataset features three common types of defects in concrete bridges (i.e., cracks, spalling, and efflorescence) and covers their diverse representation in the real world of bridge inspection.
Three classification schemes using the pretrained Visual Geometry Group (VGG) network with its 16 learning layers (i.e., VGG16 [34]), Transfer Learning, and the proposed dataset were compared. The experiments investigated the effect of the number of layers to be retrained on the model’s performance in terms of classification measures (i.e., accuracy, precision, recall, and F1 score), computational time, and generalization ability.
Based on the best classification scheme, the effectiveness of interpretable neural networks was explored in the context of weakly supervised semantic segmentation (i.e., image-level supervision). Two gradient-based backpropagation interpretation techniques (i.e., Gradient-weighted Class Activation Mapping (Grad-CAM)[46] and Grad-CAM++[47]) were used to generate pixel-level heatmaps and localize defects. Qualitative results of test images showcase the potential of interpretation heatmaps to provide localization information in a weak supervision framework.

Furthermore, additional simulation and experiments had been performed.

We have also improved the design and text of the manuscript. Kindly refer to the revised manuscript.

Comment 5: The Section title “Background and Related Work” confuse the readers as the background better to be discussed in the Introduction section, so please change this title.

Authors response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have updated the design of the manuscript. Background and related work are discussed in the introduction section. Kindly refer to the revised manuscript.

Comment 6: The reviewer believes that the conclusion section needs to contain only the written results of the research, not graphical results. Please add another section named “Discussion” for showing your evaluations and graphical results.

Authors response: We thank the reviewer for this comment. We have made the necessary changes in the manuscript.

Authors actions: Thank you for the suggestion. We have updated the design of the manuscript. In the revised manuscript, the conclusion section contains only written results and all graphical results are presented in the “Results and Discussions” section.

Comment 7- In overall, extensive revision in both English and design is recommended. Please double-check the manuscript with a professional native English speaker in this field. In this regard, I would like to ask Editor for a longer revision time to make everything up.

Authors response: We very much appreciate the reviewer’s comment. We have made the necessary changes in the manuscript.

Authors actions: Dear reviewer, we are grateful for your suggestion. We have updated the design of the manuscript. The revised manuscript contains five sections: Introduction, Methodology and materials, Experimental setup, Results and Discussions, and Conclusions and Perspectives. In addition, the manuscript has been revised by an English expert in the field.

Author Response File: Author Response.pdf

Article Menu

Printed Edition

Concrete Bridge Defects Identification and Localization Based on Classification Deep Convolutional Neural Networks and Transfer Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI