**Problem 1:**

Subway tunnel images have a high resolution and limited areas of defects. Hence, the problem of imbalance between the background and foreground in semantic segmentation is prominent.

#### **Problem 2:**

Defects in subway tunnels have multi-scale variations. It is necessary to distinguish between these types since the repair action differs depending on the type of defect.

#### **Problem 3:**

Subway tunnel image contains a complex background. Although there are no defects in the background area, it often contains structures similar to the defects due to the construction conditions.

Hence, it is desirable to devise more effective network architectures that can recover the details of defects in subway tunnel images and improve the detection accuracy of multi-scale defects.

To solve the above problems, we focus on the U-Net architecture [28], one of the most widely used methods in biomedical image segmentation tasks. The U-Net's skip connection method, which can concatenate up-sampled feature maps with feature maps skipped from an encoder, makes it possible to effectively capture details and location information about objects. U-Net and its variants have achieved impressive segmentation results in computer vision tasks, especially in detecting multi-scale targets [29–32]. Because the cracks feature in our task is long and thin, we require the network to have the capacity to maintain the feature in high resolution; U-Net is a suitable choice for this. Specifically, the feature of cracks (small targets) is mainly captured by the high-resolution layer, and the water leakage feature is mostly captured by the low-resolution layer. Because of the succinct architecture, it is easy to add extra modules or change the architecture to improve the detection capacity for different kinds of segmentation targets in our task. The U-Net architecture is, therefore, suitable for our task.

In this paper, we propose an improved version of the U-Net architecture to solve the above problems. As a network design for the multi-scale target segmentation of a particular image dataset, the U-Net architecture is a suitable foundation network for our task. To solve Problem 1, we adjust the image dataset to balance background and foreground images to overcome the problem of background examples dominating gradients. To solve Problems 2 and 3, we optimize the network architecture using the following strategies: First, we replace all convolution blocks of the U-Net architecture with inception blocks [33]. Since the inception module consists of four different branches with different kernel sizes and enlarges the network's receptive field, we can improve the network adaption to different scales of features. For our task, this improvement increases the capacity to detect multi-scale defects. In addition, for the same purpose, we replace the first convolution layer of the bridge layer with an atrous spatial pyramid pooling (ASPP) module from Deeplab-v2 [34]. Combining these two kinds of structures results in more precise detection and mitigates the over-fitting problem.

Our contributions are summarized as follows:


This paper is organized as follows: Summaries of related works on defect detection and classification are presented in Section 2. Next, Section 3 shows the data characteristics, and Section 4 shows the proposed method and the adopted network architectures. The experimental results are shown in Section 5. Finally, our conclusion is presented in Section 6.

#### **2. Related Works**

In this section, we discuss related works of computer vision tasks for application, U-Net family, and defect detection, respectively. Recent application tasks in computer vision are mentioned in Section 2.1, more specific architectures based on U-Net are explained in Section 2.2, and methods for defect detection are presented in Section 2.3.
