*2.3. Defect Detection in Infrastructures*

Before the high development age of deep learning techniques, the defect detection methods were mainly developed by using image processing method. In [10,11], the authors conducted surveys of newly developed robotic tunnel inspection systems and showed that they overcome these disadvantages and achieve high-quality inspection results. Additionally, Huang et al. reported a method for analyzing the morphological characteristics and distribution characteristics of structural damage based on an intelligent analysis method from visible images of tunnel images [13]. Furthermore, Koch et al. reported computer vision-based distress detection and condition assessment approaches related to concrete and asphalt civil infrastructure [49]. In addition, several methods for automatic detection based on computer vision techniques have been proposed [21,22]. Khoa et al. proposed automatic crack detection and classification methods by using morphological image processing techniques and feature extraction based on distance histogram-based shape descriptors [21]. Furthermore, Zhang et al. proposed a method called online CP-ALS to incrementally update tensor component matrices, followed by self-tuning a one-class support vector machine [24] for online damage identification [22].

In recent years, deep learning techniques have been successfully applied to defect detection tasks based on real-world datasets. For instance, Kim et al. [50] used Mask R-CNN to detect and segment defects in multiple kings of civil infrastructure. Bai et al. [51] used Robust Mask R-CNN for the task of crack detection. Specifically, they proposed a two-step method, called cascaded network, in which ResNet is used to classify defects and then some state-of-art segmentation networks are used. Huang et al. [52] proposed an integrated method, which combines a deep learning algorithm and Mobile Laser Scanning (MLS) technology, achieving an automated three-dimensional inspection of water leakages in shield tunnel linings. Choi et al. [53] proposed a semantic damage detecting network (SDDNet) for crack segmentation, which achieves real-time segmentation effectively negating a wide range of various complex backgrounds and crack-like features. Chen et al. [54] present a switch module to improve the efficiency of the encoder–decoder model, demonstrating it with U-Net and DeepCrack as examples. In this way, deep learning-based defect detection methods have shown promising results for the classification and segmentation tasks with the benefit of high representation ability.

#### **3. Dataset**

In this section, we explain the inspection data used in our study. Figure 1 shows examples of the subway tunnel image data. We can see that the tunnel image data have different characteristics of natural image data. The size of the images is approximately 12, 088 × 10, 000 pixels or 12, 588 × 10, 000 pixels and the resolution is 1 mm/pixel, and so they can be considered high-resolution images. Typically, analyzing high-resolution images requires enormous computer resources and such image sizes are not used in the input of deep learning models. On the other hand, the resizing process results in the loss of fine-scale defects. We solve this problem by the patch division processing.

The subway tunnel image data consist of defect and background images. Figure 2 shows defect patch examples divided from original images shown in Figure 2 (a) cracks, (b) cold joint, (c) construction repair (d) deposition (e) peeling, and (f) trace of water leakage. As shown in Figure 2, we can see that each type of defect has its characteristics such as different texture edges and color features. As for a two-class segmentation task, this intraclass variance will cause false alarms. For instance, the size and color of cracks (Figure 2a) are different from those of traces of water leakage (Figure 2f).

Next, we show divided patch examples of background images that have no defects in Figure 3 (a) cable, (b) concrete joint, (c) connection component of overhead conductor rail, (d) passage tunnels (e) overhead conductor rail, and (f) lighter. In Figure 3, some of them have characteristics similar to those of defect images, which can also cause a serious false alarm problem.

**Figure 1.** Examples of subway tunnel images used in this study. (**a**,**b**) are sample images taken from a visible camera for inspection. (Resolution: 1 mm/pixel, Image size: 12,088 × 10,000 pixels).

**Figure 2.** Example of defect images. (**a**–**f**) represent cracks, cold joint, construction repair, deposition, peeling, and trace of water leakage, respectively. (Resolution: 1 mm/pixel, Image size: 256 × 256 pixels).

**Figure 3.** Example of background images. (**a**–**f**) show cable, concrete joint, connection component of overhead conductor rail, passage tunnels, overhead conductor rail, and lighter, respectively. (Resolution: 1 mm/pixel, Image size: 256 × 256 pixels).

#### **4. Methodology**

Inspired by Inception-v4, ASPP module, and U-Net, we propose a new model for defect detection. The proposed network combines the advantages of all three existing models. We explain data augmentation in Section 4.1 and introduce the architecture of our network in Section 4.2.

#### *4.1. Data Augmentation*

In this subsection, we propose our data augmentation strategy and patch selection method. First, we divide high-resolution subway tunnel images into multiple patches as shown in Figures 2 and 3. Let *Pi*(*i* = 1, 2, 3, ..., *I*) denote divided patches derived from the original images shown in Figure 1, where *I* represents the number of patches. Because of the imbalance distribution and multi-scale defects, we used an overlap strategy to ensure exhaustive defect patches, which extend the patch dataset. In addition, to construct the dataset via patch selection, we experimentally obtained a large-scale dataset containing background *Bn* (*n* = 1, 2, ..., *N*) and defect patches *Dm* (*m* = 1, 2, ..., *M*). Note that the ratio between *M* and *N* is approximately 7:3 and *N* + *M* = *I*.

For the training phase, since the dataset includes superfluous patches and a approximately half of them are background patches, it can cause a data imbalance problem. Under this condition, we randomly excluded some background patches to balance the number of patch samples. It should be noted that this strategy does not influence the detection accuracy. Finally, the ratio between defect and background patches can reach 1:1.

The advantage of data augmentation is that features between distributions of data can be resolved by pseudo-data generation. The model acquires a high degree of generality by learning to identify the transformed images as input. In recent years, this idea has been incorporated into self-supervised learning. In self-supervised learning, a transformation similar to data augmentation is performed, and learning is performed without labels. It has been reported that this method can dramatically improve the representational capability of the model itself. In this paper, we focus on data augmentation because we are interested in supervised learning.

#### *4.2. Network Architecture*

In this subsection, we explain the network architecture used in our method. Figure 4 depicts a model architecture of the proposed method, and Table 1 represents the details of our network. We chose U-Net as our backbone model to achieve a high performance in the special data segmentation task. To increase the rate of detection of multi-scale defects in subway tunnel data, first, we replaced the convolution blocks of the U-Net architecture with inception blocks modified from Inception-v3 as shown in Table 1. Inception blocks extend the feature capture area to increase accuracy and mitigate the over-fitting problem. Second, we added the ASPP module to our model, and we imitated the usage of the ASPP in Deeplab-v3+ to set it after the last layer of the encoder (the bridge layer, middle of the network) shown in Figure 5a. In shallow architectures, the size of the encoder's last layer is no less than 16 × 16. We adjusted the parameter settings of multiple parallel atrous convolutions in the ASPP module for adaptation to our task. In the following, we explain the details of our model.


**Table 1.** Architecture of the proposed model.

Our network consists of stacked layers of modified inception blocks shown in Figure 5b in the U-Net-based encoder–decoder network. The inception blocks consist of four parallel branches. Three of them have convolution layers with different kernel sizes, and the last one has one max-pooling layer. We replaced the 5 × 5 convolution layer with 5 × 1 and 1 × 5 convolution layers to decrease the training parameters. In the original U-Net architecture, the encoder part contains 8 convolution blocks. In addition, the output of every 2 convolution blocks is down-sampled by a max-pooling layer, and to construct a deeper network, we add one inception block before each max-pooling layer, increasing the total number of convolution operations in the encoder from 8 to 12.

**Figure 4.** Overview of our defect detection network architecture.

At the end of the encoder part, we replaced the bridge's first convolution layer with the ASPP module, which is shown in Figure 5a; the input was split into 5 equal partitions. In the original ASPP module, the atrous rates of three 3 × 3 convolutions were set to 6, 12, and 18 (with 256 filters and batch normalization) to adapt to the input size, which is over 37 × 37. When the rate value is close to the feature map size, the 3 × 3 filter degenerates to a 1 × 1 filter, and the atrous convolution loses its effectiveness. In our task, the input size was limited to 256 × 256 pixels, and after 4 max-pooling operations, the final input size of the ASPP module became 16 × 16, which is less than the required 37 × 37. Therefore, we changed the atrous rates from 4, 8, and 16 to 2, 4, and 6 to adapt to the input size. After the ASPP module, a 1 × 1 convolution operation (with 1024 channels) was added to merge the bridge layer.

In the decoder part, we used a convolution transpose layer (with a kernel size of 3 × 3 and a stride size of 2) to perform the up-sampling operation. Instead of using the deeper architecture as the encoder, we replaced all basic convolution layers with inception blocks.

**Figure 5.** Modules introduced in our method. (**a**) represents the architecture of ASPP module and (**b**) represents the inception module.
