**5. Experiments and Results**

This section shows quantitative and qualitative evaluations to confirm our network's effectiveness for detecting defects in subway tunnel images. The experimental settings are explained in Section 5.1, and the results and discussion are presented in Sections 5.2 and 5.3, respectively. Experimental data were provided by Tokyo Metro Co., Ltd, a Japanese subway company.

#### *5.1. Settings*

In our experiments, 47 images made up the subway tunnel image dataset. The images were obtained from visible cameras with high resolutions (e.g., 12, 088 × 10, 000 pixels or 12, 588 × 10, 000 pixels), and we divided the images into multiple patches of 256 × 256 pixels with a sliding interval of 64 pixels.

In the training phase, we filtered the patches using the strategy introduced in Section 4.1. The pixel-ground truth of defects was determined by inspectors. We selected 280,000 patches from 29 images as our training dataset. In this dataset, the ratio between the background and defect patches was set to 1:1. Then, in the validation phase, seven images were divided by the same strategy, as in the training phase, and finally, 71,818 patches were selected. The last 11 images were used in the test phase. We only used the same dividing strategy without abandoning background patches. Therefore, the number of patches used in the test phase was 326,172, which is significantly larger than that in the training phase. After the test phase, we generated estimation images by recombining the estimation results and the average probability of each pixel.

For the semantic segmentation task, Recall, Precision, F-measure, and Intersection over Union (IoU) were used to evaluate the binary classification performance as our estimation metrics. They can be calculated as follows:

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}'} \tag{1}$$

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}'} \tag{2}$$

$$\text{F-measure} = \frac{2 \times \text{Recall} \times \text{Precision}}{\text{Recall} + \text{Precision}},\tag{3}$$

$$\text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}'} \tag{4}$$

where TP, TN, FP, and FN represent the number of true-positive, true-negative, falsepositive, and false-negative samples, respectively.

We compared our method with classic segmentation methods including Deeplab-v3+ (CM1) [55], FCN (CM2) [56], and SegNet (CM3) [57]. Since the input of the network was set to 256 × 256, the output size of the encoder in Deeplab-v3+ was 16 × 16. According to our method, we adjusted the parameter settings of multiple parallel atrous convolutions in the ASPP module using the same strategy as introduced in Section 4.2. In addition, since our network is based on the U-Net architecture, we added several previous U-Net versions as comparative methods (CM4-CM7). The design of each method is shown in Table 2. Among them, CM5 [58] added additional down-sampling blocks to both the encoder and decoder of the network, changing the down-sampling stride from 16 to 32.

**Table 2.** Differences in the proposed method (PM) and U-Net-based comparative methods (CM4- CM7) used in the experiment.

