*2.3. DenseUNet*

#### 2.3.1. Network Architecture

We chose U-Net as the primary network architecture. In semantic segmentation, in order to achieve better results, it is essential to retain low-level details while acquiring high-level semantic information. The low-level features can be copied to the corresponding high-level to create information transmission paths, allowing signals to propagate naturally between the lower and higher levels, which not only helps the backpropagation in the training process but also compensates for the low-level and details of the high-level semantic features. We show that making use of dense units instead of ugly units can further improve the performance of U-Net. In this paper, the dense block is used as sub-module for feature extraction. By design, DenseUNet allows the layer to access all of its previous feature maps. DenseUNet takes advantage of the potential of the network to e fficient compression models through feature reuse. It encourages reuse of features throughout the network and leads to a more compact model.

To restore the spatial resolution, FCN introduces an up-sampling path that includes convolution, up-sampling operations (transpose convolution or linear interpolation), and skip connections. In DenseUNet, we replace the convolution operation with up-sampling operations and transform it. The transition up module consists of a transposed convolution, which upsamples the previous feature mapping. Then the up-sampling feature map is connected to the input from the encoder skip connection to form a new input. We utilize an 11-level deep neural network architecture to extract road areas, as shown in Figure 2.

**Figure 2.** The architecture of the proposed deep DenseUNet. The dense block takes advantage of the potential of the network to e fficient compression models through feature reuse.

#### 2.3.2. Dense Block

Deep neural networks extract multi-level features of remote sensing images from low to high by convolution and pooling operations. The first few layers of convolution neural networks mainly extract low-level features such as road edges and textures, while deep-level networks extract features more complete, including road contours and location information. It can improve the performance of multi-layer neural networks and extract higher-level semantic information; however, it may hinder training and cause degradation problems. This is a problem with backpropagation [48]. He et al. [49]

proposed residual neural networks to speed up training and solve degradation problems. The residual neural network consists of a series of residual units. Each unit can be represented in the following form:

$$Z\_l = H\_l(Z\_{l-1}) + Z\_{l-1} \tag{1}$$

Among them, *Zl*−<sup>1</sup> and *Zl* are the input and output of the *lth* residual unit, and *Hl*(·) is the residual function. Therefore, for ResNet model, the output of the *lth* layer is the composition of the *l*−1*th* identity mapping and the *l*−1*th* nonlinear transformation. The connection between the low-level and the high-level of the network will facilitate the dissemination of information without degradation. However, this kind of integration destroys the information flow between the layers of the network to a certain extent [50]. Here, we present the DenseUNet, a semantic segmentation neural network that combines the advantages of a densely concatenated convolutional network and U-Net. This architecture can be considered an extension of ResNet, which iteratively sums up the previous feature mappings. However, this small change has some exciting implications: (1) feature reuse, all layers can easily access the previous layer, so that the information in previously computed feature map can be easily reused; (2) parameter e fficiency, DenseUNet is more e ffective in parameter usage; (3) implicit in-depth supervision, because of the short-path of all feature graphs in the architecture, DenseUNet provides deep supervision. Figure 3 is the basic dense network unit in this paper.

**Figure 3.** Dense network unit. Fractal structures have statistical or similar self-similar forms.

Dense network elements are fractal architectures. **Dense block layers are connected to each other so that each layer in the network accepts the characteristics of all its previous layers as input**. Left: simple extended rules generate fractal architectures with l intertwined columns. Basically, *<sup>H</sup>*1(*Z*) has a single layer of the selected type (e.g., convolution) between input and output. The connection layers compute the average value of element-wise. Right: Deep convolution neural network reduces spatial resolution periodically by pooling. A fractal version uses *<sup>H</sup>*1(*Z*) as the building block between pooling layers. A block such as Stack *B* produces a network whose total depth (measured as a

convolutional layer) is *B* × 2*C*−1. Dense units consist of three parts: dense connection, growth rate, and bottleneck layers.

Dense connections. In order to further enhance the transmission of information among network layers, this paper constructs a different connection mode: by introducing direct connections from any layer to all subsequent layers. Figure 3 shows the layout of DenseUNet. Consequently, the *Zl* layer receives the feature-maps of all other layers. *Z*0,*Z*1, ··· ,*Zl*−1, as input:

ˈˈˈ

$$Z\_l = H\_l([Z\_0, Z\_1, \dots, Z\_{l-1}]) \tag{2}$$

Among them [*<sup>Z</sup>*0,*Z*1, ··· ,*Zl*−<sup>1</sup>] refers to the series of features generated in layer 0, ... , *l* − 1. To promote implementation, the multiple inputs of *Hl*(·) in Equation (2) are concatenated into a single tensor. We define *Hl*(·) as a composite function of three continuous operations: batch normalization, followed by a 3 × 3 convolution and a rectified linear unit. ˈˈˈ

**Growth rate**. *Hl* generates G feature-maps, and then the lth layer has *G*0 + *<sup>G</sup>*·(*l* − 1) input feature maps, where *G*0 is the number of channels in the input layer. The difference between DenseUNet and existing network architectures is that DenseUNet can have skinny layers. The hyper-parameter G is called the growth rate of the network.

**Bottleneck layers**. Although each layer generates only G output element mappings, it usually has more inputs. Literature [51] has noticed that before each 3 × 3 convolution, 1 × 1 convolution can be introduced as the bottleneck layer to reduce the number of input feature maps and improve the computational efficiency. We utilize such a bottleneck layer to refer to our network, i.e., the BN-Conv-ReLU version for *Hl*. Figure 4 shows the operation of dense block layers, transition down and transition up.

**Figure 4.** Basic layers of dense block, Transition Down, and Transition Up. (**a**) The dense block layer consists of BN, followed by ReLU and dropout; (**b**) Transition Down consists of BN followed by ReLU, dropout and a max-pooling of size 2 × 2; (**c**) Transition Up consists of a convolution, using nearest-neighbor interpolation to compensate for the loss of pooling process spatial information.

In our experiments on Conghua roads dataset and Massachusetts roads dataset, we used DenseUNet structure with five dense blocks on 256 × 256 input images. The number of feature maps in other layers also follows the setting G. In the present study, and we used Adam optimizer to minimize the classification cross-entropy. Let *Y* be a reference foreground segmentation with values *yi*, and *X* be a prediction probability map of the foreground markers on the N image elements *xi*, where the

probability of background class is 1 − *xi*. The cross-entropy represents the dissimilarity between the approximate output distribution and the real distribution of the labels. The cross-entropy describes the di fference between the true distribution of the input data and the distribution of the model obtained through training. The binary cross-entropy loss function is defined as:

$$loss = -\frac{1}{N} \sum\_{i}^{N} \left( y\_i \cdot \log \mathbf{x}\_i + (1 - y\_i) \cdot \log(1 - \mathbf{x}\_i) \right) \tag{3}$$

The reasonable ratio of positive and negative samples is about 1:1 for feature selection in binary classification tasks. However, we find that the serious class imbalance between foreground and background is the central cause of high-resolution remote sensing images in the training process of semantic segmentation.

When the loss function gives equal weight to positive and negative samples, the category with large sample dominates the training process, and the training model is inclined to the category with a large sample, which reduces generalization ability of the model. We sugges<sup>t</sup> reshaping the standard cross-entropy loss to solve the class imbalance problem in order to reduce the loss assigned to large samples. The weighted cross-entropy form of two-class can be expressed as:

$$\text{loss} = -\frac{1}{N} \sum\_{i}^{N} \left( \theta\_1 \cdot y\_i \cdot \log \mathbf{x}\_i + (1 - y\_i) \cdot \log(1 - \mathbf{x}\_i) \right) \tag{4}$$

where θ1 is attributed to the weight of the foreground class, here defined as:

$$\theta\_1 = \frac{N - \sum\_{i}^{N} \mathbf{x}\_i}{\sum\_{i}^{N} \mathbf{x}\_i} \tag{5}$$

By appropriately increasing the loss caused by the fault positive samples, the problem of the vast di fference between the positive and negative samples is solved to some extent.
