*2.1. Computer Vision Task for Application*

Computer vision tasks have made great progress with the rise of deep learning technologies [35]. In the past, computer vision tasks have been studied mainly with the aim of recognizing objects in images; however, with the rise of deep learning, the level of accuracy close to real-world applications has been achieved [25,36]. The recognition level of general objects exceeded human accuracy in a competition held in 2015, and various methods for more advanced tasks such as object detection and pixel-level segmentation have been proposed [37,38]. In parallel, this technology has been applied not only in the field of computer science but also in various other fields. Transfer learning has shown that feature representations acquired by general image recognition can be useful for tasks in other domains [39,40]. In addition, a number of studies have been proposed for tasks where the amount of data is not sufficient [41].

Following general images, medical images are the next area where the technology is expected to be applied to society [42–44]. Medical images are highly specialized due to the clarification of imaging standards, but the quality of the captured images is high. Therefore, supervised learning, which is the speciality of deep learning, has succeeded in building relatively accurate models [45].

#### *2.2. Deep Learning with U-Net and Its Variants*

As a well-known biomedical image segmentation network, U-Net architecture in 2015 has a completely symmetric encoder–decoder structure, where features extracted from the same size convolutional layers are concatenated with corresponding up-sampling layers; thus, high- or low-level feature maps can be preserved and inherited by the decoder to obtain more precise segmentation. After that, its variants were proposed in the following years and are still applied to real-world segmentation tasks nowadays.

The common improved variants of U-Net are committed to redesigning convolutional modules and modifying down- and up-sampling. Namely, many varying methods such as TernausNet [46], Res-UNet [47], Dense U-Net [48], and R2U-Net [31] have been proposed. For example, TenausNet replaces the encoder part with VGG11, Res-UNet and Dense U-Net replace all submodules with res-connection and dense-connection, and R2U-Net combines recurrent convolution and res-connection as a submodule. U-Net++ [29] and U-Net 3+ [30]

hope to increase multi-scale target detection capacity. The main advantage of these variants is that they can capture features of different levels and integrate them through feature superposition.
