*Article* **A Comparative Study on Crack Detection in Concrete Walls Using Transfer Learning Techniques**

**Remya Elizabeth Philip 1, A. Diana Andrushia 1,\*, Anand Nammalvar 2, Beulah Gnana Ananthi Gurupatham <sup>3</sup> and Krishanu Roy 4,\***


**Abstract:** Structural cracks have serious repercussions on the safety, adaptability, and longevity of structures. Therefore, assessing cracks is an important parameter when evaluating the quality of concrete construction. As numerous cutting-edge automated inspection systems that exploit cracks have been developed, the necessity for individual/personal onsite inspection has reduced exponentially. However, these methods need to be improved in terms of cost efficiency and accuracy. The deep-learning-based assessment approaches for structural systems have seen a significant development noticed by the structural health monitoring (SHM) community. Convolutional neural networks (CNNs) are vital in these deep learning methods. Technologies such as convolutional neural networks hold promise for precise and accurate condition evaluation. Moreover, transfer learning enables users to use CNNs without needing a comprehensive grasp of algorithms or the capability to modify pre-trained networks for particular purposes. Within the context of this study, a thorough analysis of well-known pre-trained networks for classifying the cracks in buildings made of concrete is conducted. The classification performance of convolutional neural network designs such as VGG16, VGG19, ResNet 50, MobileNet, and Xception is compared to one another with the concrete crack image dataset. It is identified that the ResNet50-based classifier provided accuracy scores of 99.91% for training and 99.88% for testing. Xception architecture delivered the least performance, with training and test accuracy of 99.64% and 98.82%, respectively.

**Keywords:** transfer learning; crack detection; concrete wall; convolutional neural network; structural health monitoring

#### **1. Introduction**

Many buildings have reached their design life expectancy; therefore, it is critical to safeguard the facilities/amenities through routine maintenance. Extensive research has been conducted in order to improve the performance of concrete and thus the health of structures [1,2]. The concrete's structural integrity is severely affected by the development of cracks, increasing the risk of failure or collapse in buildings and structures. Crack inspection is an important but tedious maintenance task for buildings and other infrastructures. Cracks reduce the load-bearing capacity of the structural elements and accelerate the damage level. Cracks in concrete adversely affect durability by reducing the lifespan of the buildings. Cracks can create distress for the occupants and impair the building's appearance.

When the crack inspection is executed manually, the work becomes time-consuming, labor-intensive, and necessitates skills. Crack monitoring and digital image processing are viable alternatives for visual inspections [3]. Although digital image processing-based technologies have largely been successful, "false positive" results occur occasionally. Therefore,

**Citation:** Philip, R.E.; Andrushia, A.D.; Nammalvar, A.; Gurupatham, B.G.A.; Roy, K. A Comparative Study on Crack Detection in Concrete Walls Using Transfer Learning Techniques. *J. Compos. Sci.* **2023**, *7*, 169. https:// doi.org/10.3390/jcs7040169

Academic Editor: Francesco Tornabene

Received: 4 March 2023 Revised: 7 April 2023 Accepted: 13 April 2023 Published: 18 April 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

a comprehensive automated crack monitoring system must be enabled to identify cracks from surface images that contain natural cracks or non-cracks that appear to be natural cracks [4,5].

Studies have been conducted in masonry structures, along with crack modelling and crack pattern identification by using numerical models and finite element analysis [6–8]. Tan et al. [9] studies the possibility of using distributed sensing technology to detect, locate, trace, quantify, and visualize the crack using fiber optic sensors. Kim et al. [10] proposed a crack identification strategy combining RGB-D and sensors that measure cracks regardless of the angle of view. The authors have deployed high-resolution digital cameras as sensors.

Many technologies based on computer vision (CV) and artificial intelligence (AI) have evolved to help automate the process [11,12]. Machine learning (ML) is a subset of AI-based techniques that many researchers have used to detect cracks on concrete surfaces. For crack detection, various ML methods such as support vector machines (SVM), Bayesian decision trees (BDT), and random forests (RF) are used.

Traditional ML methods include SVM, artificial neural networks (ANN) [13–15], and RF [16]. Even though these algorithms reduce false positives, their accuracy is still hugely dependent on crack features obtained through specific image processing steps. A predefined feature extraction stage in all these methods necessitates an additional image processing stage to make the patterns clearer to the learning algorithms. It also has a negative impact on the model's performance. Another disadvantage of ML methods is that the learning algorithms cannot learn higher-order features with complex information in the dataset.

Deep learning (DL) is a promising technology for addressing the issues associated with handcrafted feature extraction. Deep learning is a branch of ML that uses neural networks as a framework for its algorithms. DL techniques include auto encoders (AEs), deep belief networks (DBNs), deep Boltzmann machines (DBMs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs) [17]. CNNs are critical among DL methods, which are primarily used to analyze image-based data [18]. Many datasets are utilized to train CNNs for different types of damage and fault diagnoses. Non-contact sensors' capabilities are enhanced by using trained networks to build autonomous structural health monitoring (SHM) systems [19].

Concrete crack detection DL models can be administered in various health monitoring scenarios to identify and locate cracks in concrete structures. The practical and feasible applications of concrete crack detection DL models are building inspection, infrastructure maintenance, construction quality control, historical preservation, etc. Therefore, concrete crack detection DL models yield an effective and efficient way to identify cracks in concrete structures, helping to prevent accidents, improve safety, and ensure the prolonged endurance of critical infrastructure.

Laxman et al. [20] developed a binary-class convolutional neural network (CNN) model which automatically detects the crack on concrete surfaces. They also interface the CNN model, combining the convolutional feature extraction layers with that of the regression models such as Random Forest and XG-Boost, resulting in automatic predictions of the depth of cracks. This experimental study was validated on reinforced concrete slabs. Apart from concrete cracks, many studies have been carried out on road surface damage. Xu et al. [21] demonstrated the advantage of combining Faster R-CNN and Mask R-CNN in detecting road pavement cracks. The limitations of the architecture was degradation in the effectiveness of the bounding box detected by Mask R-CNN. Huyan et al. [22] proposed a new architecture called CrackU-net that achieved pixel-wise crack detection. It was found that the proposed model outperformed traditional U-Net and fully convolutional neural networks (FCN).

The primary design concept of CNNs' architecture is to deploy successive convolutional layers to the input, resample the spatial dimensions while increasing the number of feature maps, and then repeating it. These architectures serve as rich feature extractors for image classification, object recognition, image segmentation, and other more laborious

tasks. AlexNet, VGG16, ResNet, MobileNet, Inception, and Xception are illustrations of CNN architectures that have been widely used as classifiers and segmenters. Many CNN-based deep learning models rely on these networks [23–25].

Contemporary studies make use of transfer learning mechanisms. The term "transfer learning" (TL) commonly refers to a procedure in which a model is developed primarily for one particular problem and then utilized in some capacity for secondary problems. Because it directly integrates pre-trained models into feature extraction preprocessing and comprehensive new models, this method is adaptable. It has the advantage of reducing neural network model training time, resulting in fewer generalization errors [26,27].

Figure 1 compares traditional DL models to that of models based on TL. The basic idea behind TL models is that the architecture can be reused as it is pre-trained on similar datasets. The TL can be used in two different ways: one is reusing the network structure, and the other reusing both the network structure and weights by either retraining only a few layers, retaining all layers, or adding a few layers on top. The idea behind image classification TL techniques is that if a model is trained and tested on a large and diverse dataset, the model will successfully obtain a critical visual overview of distinct features or attributes. This method has been widely used in the semantic segmentation and image classification stages of crack detection [28]. The significance and importance of TL is that it substantially mitigates the usage of huge datasets for training, as its pre-defines or pretrained models show promising results due to its handling of huge datasets, for example the ImageNet dataset.

**Figure 1.** (**a**) Traditional deep learning model (**b**) Transfer learning-based deep learning model.

TL models have been used in various types of research. Su and Wang [29] compare the performances of the architectures MobileNetV2, DenseNet201, EfficientNetB0, and InceptionV3 for crack detection on concrete. It was discovered that EfficientNetB0 was efficient in terms of performance and generalization. Dung and Anh [30] used VGG-16 as the foundation of a fully convolutional neural network for crack classification. Zhong et al. [31] used an improvised variant of VGG16 for concrete pavement crack detection. TL techniques have been combined with fine-tuning procedures to achieve high accuracy in pre-trained models. Sun et al. [5] used the Xception architecture's pre-trained weights and biases to detect cracks and holes in concrete surfaces. Joshi et al. [32] used the ResNet50 architecture and a segmentation-based approach to detect cracks. Do ˘gan and Ergan [33] used MobileNet architecture as a backbone for pixel-wise crack detection in lightweight mobile applications.

The backbone framework in most architectures is VGG16, ResNet, MobileNet, and Inception or Xception networks, with some fine tuning [24]. This study compares the most commonly used CNN architectures in order to determine the accuracy of these networks in classifying crack and non-crack concrete surface images. For TL, the pre-trained weights of the VGG16, ResNet50, MobileNet, and Xception architectures are used. For all architectures, the dataset used for classification is similar. Thus, TL takes less duration than building a network from the ground up. The availability of datasets is greatly reduced because all the

networks are trained in classifying the 1000 different object categories that we encounter on day-to-day basis.

This study is executed in such a way that different types of DL models were compared for classifying and identifying cracks in concreate structures by implementing pre-trained architectures such as VGG16, VGG19, ResNet50, MobileNet, and Xception. Sizable concreate image datasets were used for validating and training these frame works.

Conducting research such as this contributes immensely to the progressive development of DL methods for structures in multitudinous ways. Primarily, the study proposes an effective approach for the automatic detection and classification of cracks in concrete surfaces using transfer learning methods, which saves time and reduces the need for manual inspection. Moreover, as all the architectures used in this study were already trained on thousands of image datasets, the need for new image datasets is also reduced. Secondly, it provides a critical comparison of different transfer learning frameworks that can be used as feature extraction or backbone architecture, depending on the availability of memory and time, which will be beneficial for future research in this area.

This analysis demonstrates the viability and potential of DL methods in scrutinizing the varied and complex structural data, resulting in progressive development, and extending the application of these models in other structural assessment areas. These models could help in determining the efficacy of categorizing and localizing different types of cracks, spalls, and other flaws in concrete buildings in natural settings by potentially automating the damage detection process.

The findings of this study will assist researchers in developing new technologies for efficiently maintaining the service life of infrastructures using unmanned aerial systems, automation systems for infrastructure monitoring by deploying various sensors such as high-resolution cameras, LIDAR systems, etc., and in the enhancement of SHM systems for constant monitoring of the serviceability and maintainability of infrastructures, thereby reducing costs [34–37].

This paper emphasizes the suitability of existing DL convolutional models for TL strategies. Previous research conducted by various researchers primarily focused on various topologies for classification and segmentation tasks based on these backbone models. This study compares and contrasts various backbone architectures to provide a comprehensive picture of the best backbone model, or TL model, to use when developing new systems for detecting and analyzing the formation of concrete cracks.

#### **2. Methodology**

The framework for comparing different CNN models is depicted in Figure 2. The datasets consist of raw images taken from residential buildings and datasets available online in data repositories. In this study, CNN models employ TL techniques to detect concrete surface cracks. The different models considered in this study are VGG16, VGG19, ResNet50, MobileNet, and Xception. The model's weight is learned on ImageNet, saved, and then applied to the models. As a result, the model has a higher starting point, substantially cutting training time and achieving improved performance. To be suitable for crack classification, the pre-trained CNN model must be retrained to find concrete surface cracks. On a sizeable dataset, a pre-trained model has already been trained and preserved; furthermore, the final fully connected layer of the original model is replaced by a new, fully connected layer. The steps of the experiment are outlined below.

TensorFlow is used to change the size of image datasets before training the model. The datasets are then loaded batch-wise and randomly for further operations. The structure of the crack detection model is then defined by loading and refining the pre-trained model. The final layer with complete connectivity is replaced with a bespoke layer. The number of classes in the custom layer has been set to 2, as per this investigation's pre-requisites. The weight values of other layers did not change. The model is then compiled and trained using the datasets. Before training the model, the network's structure-related hyperparameters are specified, and the optimal optimization technique is chosen. In this study, model

training is guided by the adaptive-learning-rate optimization method Adam, and the crossentropy loss function. After completing the training procedure, the model's performance is validated. The test dataset was then used to evaluate the model.

**Figure 2.** Framework for comparing different CNN models for classification of cracks.

#### *2.1. VGG16*

VGG16 is a convolutional neural network for classification and object detection purposes [31,38,39]. It is widely used for classifying images, and is uncomplicated to employ with transfer learning. It has 16 convolutional layers and 64 feature kernel filters, each measuring 3 × 3 pixels, making up the first and second convolutional layers. The input image's dimensions change to 224 × 224 × 64 as it is passed through the first and second convolutional layers (an RGB image with a depth of 3). The output is then sent to the maximum pooling layer with a stride of 2. The third and fourth convolutional layers are 124 feature kernel filters, and the filter size is 3 × 3. A maximum-pooling layer follows these two layers with stride 2, and the resulting output was reduced to 56 × 56 × 128. The fifth, sixth, and seventh layers are convolutional layers with a kernel size of 3 × 3. All three layers used 256 feature maps. A maximum pooling layer with stride 2 follows these layers. Eighth to thirteen are two sets of convolutional layers with a kernel size of 3 × 3. All these sets of convolutional layers had 512 kernel filters. A maximum-pooling layer follows these layers with a stride of 1. Layers fourteen and fifteen are fully connected to the hidden layers of 4096 units, followed by a softmax output layer (sixteenth layer) of 1000 units. Figure 3 shows the schematic diagram of the VGG16 architecture.

#### *2.2. ResNet-50*

The residual neural network (ResNet) proposed by He et al. [40] won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2015). ResNet implemented residual connections between layers, which aids in mitigating the loss, preserving knowledge gain, and enhancing performance during the training phase. A residual link in a layer indicates that a layer's output is a convolution of its input and output. Figure 4 depicts a block schematic of the architecture of the ResNet model. A convolutional layer with a 7 × 7 kernel size and 64 different kernels makes up the first layer. The following layer is the maximum pooling layer. The following convolution layers consist of 1 × 1-sized kernels with 64 kernels, 3 × 3-sized kernels with 64 kernels, and 1 × 1-sized kernels with 256 kernels. There will be nine layers after repeating this layer thrice. The following are convolution layers with sizes of 1 × 1, 3 × 3, and 1 × 1 with 128, 512, and 128 kernels, respectively. This is repeated about four times to get a total of 12 layers. A convolution layer of 1 × 1 size follows this with 256 kernels and two additional kernels of 3 × 3, 256, and 1 × 1, 1024, repeated six times for a total of 18 layers. Finally, a layer of 1 × 1, 512 kernels, with two additional kernels of 3 × 3, 512, and 1 × 1, 2048, was repeated three times for a total of nine layers. Following that, average pooling concludes with a fully connected layer with 1000 nodes, and a softmax function adds another layer. In total, there were 50 layers.

**Figure 4.** Block diagram of the ResNet-50 architecture.

#### *2.3. MobileNet*

MobileNet is a model primarily developed for use in mobile apps. MobileNet uses depth-wise separable convolutions. It dramatically decreases the number of parameters compared to that of networks with standard convolutions of the similar depth. Thus, lightweight deep neural networks are produced. The depth-wise separable convolution is accomplished by depth- and point-wise operations, making it suitable for embedded applications. The depth-wise convolution filter produces a single convolution on each input channel, whereas the point convolution filter linearly combines the depth-wise convolution output with 1 × 1 convolutions. MobileNet's architecture is illustrated in Figure 5 [41]. The depth-wise convolution filter produces a single convolution on each input channel, whereas the point convolution filter linearly combines the depth-wise convolution output with 1 × 1 convolutions, in the figure the depth wise separable convolution is highlighted in blue color which consist of point wise and depth wise layers. The average pooling layer is colored in red which is further directed to a fully convolution layer. The computational speed is advantageous in pointwise and depth-wise convolution [42].

**Figure 5.** MobileNet architecture.

#### *2.4. Xception*

The Xception architecture is a variation of the Inception architecture that only employs depth-wise separable convolutional layers. Figure 6 displays Xception's architecture, which may be seen as a linear stack of depth-separable convolutional layers. The Xception architecture comprises of 36 convolutional layers organized across 14 modules with residual connections. The data flows through the input, the middle flow is then repeated about eight times, and the exit flow enters the Xception architecture. Batch normalization comes after all convolutional and separable convolution layers, which is not shown in the diagram. A depth multiplier of one is used for all separable convolution layers [43].

**Figure 6.** Architecture of Xception.

#### **3. Details of the Experiment**

The details of the experiment to classify cracked and non-cracked images are presented herewith. The experiments were conducted in two different stages: in the first stage, the models VGG16 and VGG19 were compared to analyze their performances. In the second stage, the VGG16 model was compared with the pre-trained models ResNet50, MobileNet, and Xception. All the pre-trained models were tested to determine which would generalize and perform better regarding the crack information contained in the image dataset.

#### *3.1. Datasets*

The database combines images in data repositories such as SDNET 2018 [23], Chundata [44–46], and the data captured from residential buildings. The dataset was divided into two sets: positive images (cracked) and negative images (non-cracked). They were kept in separate folders named cracked and non-cracked, each containing 11,000 and 11,000 images, respectively, in RGB format. In total, 4400 images were used exclusively for testing and not for training. Sample images of cracked and non-cracked concrete walls are shown in Figure 7. The datasets were separated into training (80%) and testing (20%). Once again, the data was divided into training and validation data sets, as shown in Figure 8. The training dataset is the sample of data utilized to fit the model. The validation dataset is the sample used to offer an impartial evaluation of a model's fit to the training dataset while setting the hyperparameters of the model. As the validation dataset skill was added to the model design, the evaluation of the model started increasing. The test dataset is the data sample used to provide an unbiased evaluation of the final model's fit to the training dataset.

**Figure 7.** (**a**) Cracked concrete wall images (**b**) Non-cracked concrete wall images.

**Figure 8.** Dataset distribution.

#### *3.2. Implementation Details*

Python 3, the sklearn module, and the Keras neural network library, which contain the architecture and weights of VGG16, ResNet50, MobileNet, and Xception, were used to build the convolutional neural network. The experiments were performed in Google Colab. The target size was 100 × 100, the class mode was binary, the batch size was 64, and Adam, with an initial learning rate of 0.001, was used as the optimizer. The Adam optimization method employs stochastic gradient descent and adaptive estimation of first- and second-order moments. It is suitable for many data items and parameters since it is computationally efficient, memory-light, and invariant to the diagonal rescaling of gradients [47]. The maximum epoch was set to be 20. Table 1 shows the hyperparameter settings for each model.

**Table 1.** Hyperparameters used for training.


#### *3.3. Performance Metrics*

To examine the model's performance, evaluation metrics were needed. The model was accessed using precision, accuracy, recall, and F1 measures. Equations (1)–(4) show the accuracy, precision, recall, and F1 measures, respectively. Accuracy refers to the ratio of correct predictions to the total number of input images. Precision is the ratio of correct positive predictions to the total number of positive predictions. The recall is the proportion of accurate positive predictions compared to the total number of true positives. The F1 measure is the weighted harmonic mean of precision and recall [48].

$$Accuracy = \frac{true\ positive + true\ negative}{true\ positive + true\ negative + false\ positive + false\ negative} \tag{1}$$

$$Precision = \frac{true\ positive}{true\ positive +false\ positive} \tag{2}$$

$$Recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}} \tag{3}$$

$$F1 = \frac{2 \times precision \times recall}{precision + recall} \tag{4}$$

The classification report visualizer displays the model's precision, recall, F1, and support scores. A classification report is then used to evaluate the accuracy of a classification algorithm's predictions. The metrics of a classification report are assessed using true positives, false positives, true negatives, and false negatives [49].

#### **4. Results and Discussion**

This section summarizes the results of the trained networks used to categorize images using TL. The pre-trained models are assessed to determine which one would generalize and provide optimum results in the dataset images.

#### *4.1. Comparison on VGG16 and VGG19 Architecture*

In the first stage, VGG16 architecture is compared with that of its contemporaries, VGG19. All the hyper-parameters are set as given in Table 1, and it is found that the VGG16 architecture gave a test accuracy of 99.61% whereas VGG19 provided a test accuracy of 99.57%. Furthermore, the training duration was comparatively longer for VGG19 architecture, which was about 2.07 h. Sample datasets for the classification results of test images by the VGG16 and VGG19 architectures are shown in Figures 9 and 10, respectively. The model accuracy and loss of VGG19 is shown in Figure 11.

**Figure 9.** Sample classification result on test images by the VGG16 architecture.

**Figure 10.** Sample classification result on test images by the VGG19 architecture.

**Figure 11.** Accuracy and loss curves of VGG19 architecture.

The comparison of the VGG16 and VGG19 models on training losses and training accuracies is provided in Figure 12a,b. Similarly, the comparison of the VGG16 and VGG19 models on validation loss and validation accuracy is provided in Figure 12c,d. It can be seen that the VGG19 architecture attained stability in validation accuracy a few epochs before the VGG16 architecture, which is an advantage of the network. The VGG16 model

takes a few more epochs to settle down. However, considering the results of test accuracy and training duration, one can choose VGG16.

**Figure 12.** (**a**) Training loss of VGG16 and VGG19 (**b**) Training accuracy of VGG16 and VGG19 (**c**) Validation loss of VGG16 and VGG19 (**d**) Validation accuracy of VGG16 and VGG19.

#### *4.2. Comparison on VGG16, ResNet50, MobileNet and Xception Architectures*

The second comparison was made on the VGG16, ResNet50, MobileNet, and Xception architectures. The hyper-parameter details are the same as those mentioned in Table 2. A classification matrix is used to evaluate the models. For a balanced dataset, accuracy can be a good measure. Still, in the case of imbalanced datasets, precision, recall, and F1 measures need to be validated to measure the performance of the models. In applications where it is not critical to identify all positive samples, a high degree of precision over recall is acceptable; if precision exceeds recall, "false negatives" outnumber "false positives". The recall measures the maximum number of "true positives"; in this case, there are only negligible "false negatives" rather than "false positives". The F1 metric is the harmonic mean of precision and recall. In each of these metrics, one value indicates optimal performance [50]. Support is the number of images belonging to that particular class used to measure the metrics.

**Table 2.** Details and training accuracy of networks.


The training time taken for VGG16 was high compared to that of other architectures [38]. All the models gave more than 99% training accuracy results, as it was only a binary classification problem [51]. It was found that out of the four models, ResNet50, MobileNet, and VGG16 produced optimum results, 99.88%, 99.68%, and 99.61%, respectively, in terms of accuracy, on testing data. It was also found that the test accuracy was high for the ResNet50 architecture, followed by the MobileNet, VGG16, and Xception architectures.

The details and training accuracy of the models are mentioned in Table 2. Table 2 shows that the test loss was less for ResNet, followed by VGG16, MobileNet, and Xception architectures, which tells how well the architectures behave after each optimization iteration. Even though the MobileNet architecture training accuracy was 99.72%, the classification report shows that the precision, recall, and F1 metrics are lower than those of the VGG16 and ResNet50 architectures.

The statistical outcomes of the networks are displayed in Table 3. The measures ensure that the pre-trained architectures VGG16, VGG19, and ResNet50 accurately classified the crack images; only MobileNet and Xception gave some false negatives and false positives. From the statistical results mentioned in Table 3, it was clear that VGG16, VGG19, ResNet50, and MobileNet have a recall rate higher than the precision, showing the models have fewer "false negatives", which is necessary to identify the concrete cracks. In the Xception model, precision exceeded recall, indicating that the number of "false negatives" is greater than the number of "false positives".


**Table 3.** Statistical outcomes of the concrete cracks (crack or intact).

The sample classification results from the test data are provided in Figure 13. The classification results for MobileNet and Xception in Figure 13c,d show that both architectures performed comparatively worse than others. It was noticed that the MobileNet and Xception architectures could not identify small cracks and images with crack-like features or background irregularities. The architectures misclassified crack-like features as a crack, which is a false positive, and those datasets with hairline cracks were not identified as cracks, which are the false negatives. Figure 14 is a breakdown of the training duration for each model. ResNet50, MobileNet, Xception, VGG16, and VGG19 architectures have relative training times of 34 min, 35 min to 38 s, 43 min to 44 s, 1.98 h, and 2.077 h, respectively. Due to the increase in the number of layers in the design, the training time for VGG19 was longer than that for VGG16, and it was much longer when compared to ResNet50 as illustrated in Figure 14 [52].

Model accuracy and loss are depicted in Figures 15–18 for the VGG16, ResNet50, Xception, and MobileNet architectures, respectively. These graphs show that all models, except for the Xception architecture, could appropriately learn the features. In Xception, it was discovered that validation accuracy and loss fluctuate, demonstrating the network's incapacity to easily fit the model during hyperparameter tuning. The response can be enhanced by fine-tuning the hyperparameters, which may increase the model's performance.

The training loss and accuracy—as well as the validation loss and accuracy—of all the architectures are depicted in Figure 19, which replicates Figures 15–18. The test data sets yielded better results. The accuracy of these classifications was higher than 98%. Therefore, deep learning classifiers could detect and categorize cracks, resulting in strong and dependable models using transfer learning. In addition to the accuracy metric, the evaluation metrics associated with classification tasks are shown in a classification report

alongside the accuracy metric. Table 3 displays the report's key points. The models accurately categorize photos of cracked and uncracked concrete. The classification report demonstrates that the model evaluated the validation set effectively.

**Figure 13.** *Cont*.

**Figure 13.** Classification result of test images (**a**) VGG16 (**b**) ResNet50 (**c**) MobileNet (**d**) Xception.

**Figure 14.** Time taken for training.

**Figure 15.** Accuracy and loss curves of VGG16 architecture.

**Figure 16.** Accuracy and loss curves of ResNet50 architecture.

**Figure 17.** Accuracy and loss curves of Xception architecture.

**Figure 18.** Accuracy and loss curves of MobileNet architecture.

Figure 20 compares the training and testing accuracy of all models considered in this study. The lack of variation in training and test accuracy suggests that the network may learn the features accurately. Because it has the highest training and testing accuracy, ResNet50 is ideal for identifying cracked images using deep learning models. Furthermore, the duration was much lower than for VGG16 and VGG19, despite having nearly the same precision, recall, and F1 scores. VGG16 and MobileNet have comparable training and testing accuracy. When time and complexity are not factors, the VGG16 is the cutting-edge network for classifying cracks on concrete wall images. However, when deploying mobile applications where time and complexity are constraints, the MobileNet architecture will provide nearly equal performance to VGG16 [48]. When compared to the other architectures, the performance of the Xception architecture was found to be inferior to others.

Deep learning models have several advantages over traditional machine learning models, but large data sets are still required. Transfer-learning approaches, in which the weights of pre-trained models trained on multiple datasets are used to transfer their knowledge when accessing new datasets, can reduce this to a greater extent. This research will help researchers identify the backbone network for classification networks and other hybrid architectures [30,39,53]. Five pre-trained models were tested and analyzed in this study. Model performance was assessed using the classification report metrics. The ResNet50 model outperformed the other four pre-trained models.

**Figure 19.** (**a**) Model accuracy (**b**) Model loss (**c**) Validation loss (**d**) Validation accuracy.

The distance at which the image is captured and its resolution become critical considerations in real-time approaches that use an autonomous vehicle to collect the image. In this investigation, raw photos were taken with a high-resolution camera, and the crack was visible to the naked eye.

#### *4.3. Challenges in Deep Learning-Based Crack Classification of Concrete Walls*

In this study, deep learning models are employed to classify cracks in concrete walls. However, the limited availability of datasets and the time-consuming process of labeling them can affect the accuracy and generalization of the models. To enhance the models' generalization capability, the models need to be trained on a wider range of datasets that reflect real-world environments with different lighting, moisture, and other conditions. Additionally, hyperparameter tuning is another challenge that requires time and effort, and there is no universal solution for it.

**Figure 20.** Comparison of training and test accuracies of all five architectures.

#### **5. Conclusions**

The traditional methods of detection and classification of cracks on different concrete structures take time, which is also labor-intensive and expensive when conducted manually. Therefore, automated study of damage in concrete structures is necessary for early diagnosis of the structures and for extending their service life. This study investigates the application of pre-trained deep learning networks for crack detection and classification.

In this paper, different backbone deep learning approaches were verified for automatically classifying concrete cracks. VGG16, VGG19, ResNet50, MobileNet, and Xception are the architectures considered.


Due to the low-level traits that cracks and other objects share with more abstract features, pre-trained networks had a high degree of applicability toward the identification of cracks, even when trained on wholly different datasets. It was found that the features acquired through the training are highly accurate when applied to other materials. Pretrained networks are a good choice for deploying CNNs for the crack detection task, since they require fewer training samples and have a faster convergence rate.

Future attention will be placed on determining the efficacy of these structures in categorizing and localizing different types of cracks, spalls, and other flaws in concrete buildings in natural settings, which could automate the damage detection process.

**Author Contributions:** R.E.P. and A.D.A., Conceptualization, methodology, validation, and formal, analysis; A.D.A. and A.N., Data curation, visualization, supervising the research as well as the analysis of results; A.N., B.G.A.G. and K.R., Writing, review, submission, collaborating in and coordinating the research. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** This manuscript has not been submitted to, nor is it under review by, another journal or other publishing venue. The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript. The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
