*4.1. Training and Testing Results*

Figure 5 plots the model's learning and loss curves following the three Transfer Learning settings. It can be seen that in all three training approaches, the model converged quickly from the early epochs; this is mainly attributed to the reduced number of layers to retrain and the fixed parameters of the non-trainable layers.

The training and validation accuracies in the three learning settings generally increase over time, reaching a plateau around the last three epochs. The loss curves in (b) and (c) show a slight tendency to over-fitting due to the increasing number of parameters to update in their corresponding learning schemes. Therefore, it is believed that training more than two convolutional layers will lead to a higher possibility of overfitting and, consequently, will decrease the model's generalization ability.

Table 2 lists the three learning schemes' best training, validation, testing accuracies, and RMSE. The results show that the model in scheme (c) presents a better performance in training compared to settings (a) and (b). The achieved training accuracy was 98.34% in (c) over 94.62% and 91.10% in (b) and (a), respectively. The model in setting (c) also yielded a higher testing accuracy of 97.13% over 94.61% and 91.10% in (b) and (a), respectively. Furthermore, the results were obtained with only 1.21% overfitting (calculated as the difference between training and testing accuracies). These exciting results show that approach (c) enables the model to have better generalizability to extend the learning from the training subset to unseen test data. These results also demonstrate that more important features representing the target dataset were learned in the last two convolutional layers of the pre-trained model.

**Figure 5.** Learning and loss curves: (**a**) retraining only the classification layers, (**b**) retraining the classification layers and the last convolutional layers, (**c**) retraining the classification layers and the last two convolutional layers.

In addition, the Root Mean Squared Error decreased with more retrained layers (0.15 in (c) over 0.20 and 0.27 in (b) and (a), respectively), denoting that the predictive capacity of the model improves with updating more learning parameters.

It is also essential to mention that the training time corresponding to the three classification methods is reasonable. Moreover, the difference between the three approaches in terms of training time is significantly low (e.g., 8 s between (b) and (c)). At the same time, a considerable gain in performance was observed (e.g., a 1.61% gain in training accuracy

between (b) and (c)). Therefore, fine-tuning the classification layers and the last two convolution layers of the pre-trained VGG16 is efficient as it balances prediction performance, training time, and overfitting.

**Table 2.** Best training, validation, and testing accuracies and training times of the three training settings.


To further visualize the performance of the trained DCNN on the test subset, normalized confusion matrices for the three learning schemes are presented in Figure 6.

The results reflect confusion between background and efflorescence images and background and crack pictures. This confusion was particularly observed in schemes (a) and (b). For example, 7% and 6% of background images were classified as cracks or efflorescence in methods (a) and (b), respectively. However, this confusion was notably reduced in the scheme (c) as less than 3% of background images were predicted as cracks or efflorescence.

The observed confusion is mainly related to the complexity of the concrete surface in terms of colors and textures. In addition, some surface alterations in the training dataset (e.g., stains, markings, minor defects such as scaling and segregation) represent noisy defectlike features in concrete images and, as a result, make feature learning more challenging. For example, some background images contain concrete joints representing straight lines in the concrete surface and hence are likely to be misclassified as cracks. Generally, this confusion between classes can be further handled by adding more labeled samples to the training dataset and integrating an additional denoising process into image data.

Figure 7 illustrates some misclassification examples corresponding to the learning setting used in (c).

The precision, recall, and F1-scores of each defect class were computed using the confusion matrices. The results are summarized in Table 3.

The model achieved higher precision, recall, and F1-score results in learning scheme (c) compared to the other learning settings. For example, 86.64%, 94.03%, and 97.38% are the cracks F1-scores achieved in the learning settings (a), (b), and (c), respectively.

By comparing the F1-scores of the three classes in the learning scheme (c), the efflorescence class yielded a lower score than the other defects, which showed nearly similar performance (95.01% for efflorescence over 97.38% and 97.35% for cracks and spalling, respectively). This can be attributed to the wide representations of the efflorescence in concrete and the complexity of features required to describe this defect class. This issue can be solved by adding more training data that extensively covers the diverse representations of this type of damage.

**Figure 6.** Confusion matrices of the three learning settings: (**a**) retraining only the classification layers, (**b**) retraining the classification layers and the last convolutional layers, (**c**) retraining the classification layers and the last two convolutional layers.

**Figure 7.** Misclassification examples (row 1: background images containing concrete joints misclassified as cracks, row 2: background images with surface alteration and different concrete colors misclassified as efflorescence).


**Table 3.** Precision, recall, and F1-scores of the three models.
