**3. Experiments**

#### *3.1. Model Preprocessing*

#### 3.1.1. Software and Hardware Environment

In order to examine the proposed method, we construct a system platform, which is mainly composed of two parts: the software and hardware environment. The training and testing of deep neural networks require high-performance machines, which consumes a lot of video memory during training. TensorFlow is provided with the advantages of high e fficiency, strong expansibility, and high flexibility design, and with the support of TensorFlow researchers, the e fficiency of TensorFlow is improved. Based on the above reasons, this paper selects TensorFlow framework for network training. The basic configuration is shown in Table 1:



#### 3.1.2. Data Augmentation

The deep learning model is trained with su fficient data, with the increase of the input size of the deep neural network, the training parameters after convolution operation also increase. In order to make use of the video memory and increased training e fficiency, we utilize a 256 × 256 window to crop image blocks. One of the main problems in such models and signature verification systems is the low number of samples for training the model. Although transfer learning is e ffective in other domains, the remote sensing images are essentially di fferent from traditional images by rich spectral setting, a wide range of image values, and di fferent color and texture distributions. The image enhancement method is introduced to improve the generalization ability of the model. The deep learning method uses the method to add more data to the training dataset, which is called data augmentation. Data augmentation has already proved to bring many benefits to convolutional neural networks (CNNs) [52]. For example, as a regularizer, it is used to prevent overfitting in neural networks [53] and to improve the performance of unbalanced class problems [54]. As shown in Figure 5, the training set is expanded by six times.

**Figure 5.** Data augmentation. The method mainly includes rotation, flipping (horizontally and vertically), and cropping operations.

#### 3.1.3. Hyper-Parameters Selection

The process of searching optimal models requires parallel training of multiple models. The selection of learning batch size, learning rate, and optimization algorithm makes the model unique and di fferent from other models. The process of selecting the best model requires the hyper-parameters to be optimized. We use TensorFlow to perform parallel data training with many models. Three hyper-parameters batch size (batch size, learning rate, and epochs) allow parallel training of multiple models, and the accuracy of test datasets determines the best model. We have studied various methods to enable deep learning models to be learned from the training dataset. We studied various methods to learn deep learning models from training data sets. Hyper-parameters can be used to activate the training process. Adam is an adaptive learning method that requires less tuning, is computationally efficient, and is superior to other stochastic optimization methods. The network hyper-parameter settings are shown in Table 2. We chose Adam as the optimization method, and it represents faster convergence than the standard stochastic gradient with momentum. We fix the parameters of Adam as recommended in Reference [55]: β1 = 0.9 and β2 = 0.999.



Compared with the classical U-Net, SegNet, GL-Dense-U-Net, and FRRN-B network, we evaluated the proposed method on two urban scenario datasets: Conghua road dataset, and Massachusetts road dataset. For the sake of quantitatively estimate the performance of the semantic segmentation method, we show the precision, recall, F1-Score, intersection over union (IoU) and kappa as different metrics for performance. The recall rate is defined as the ratio of the correct detection category to the correct detection category and the sum of a false negative, which will be used to assessments of the road integrity. The precision rate is the proportion of successes made by a classifier over the whole instance set, which reflects on the correctness of the road. The F1-score is the harmonic average of precision and recall, computed based on the number of errors detected by computers and manual evaluators. The Intersection over Union (IoU) is only the ratio of the overlap area between the truth and predicted regions of interest on the ground to the area surrounded by them. The kappa coefficient is a statistic which measures inter-rater agreemen<sup>t</sup> for specific items, and it is generally used to assess the accuracy of remote sensing image classifications.

#### *3.2. Massachusetts Dataset*

The Massachusetts dataset [56] has an image resolution of 1 m, and each image contains 3 × 1500 × 1500 pixels. The open road dataset contains 1711 aerial images with a total area of more than 2600 square kilometers. The dataset is divided into 1108 training images, 14 validation images, and 49 test images. Figure 6 shows that U-Net, SegNet, and FRRN-B models can correctly identify most of the roads. Although these models eliminate the effects of shadows and buildings to a certain extent, the extraction results show that the correctness of intensive road is lower than in other regions. The results of these models are poorly continuous, and the edge of the road is not distinct enough. U-Net and SegNet performed poorly and lack of the necessary connectivity in the intensive road. From the sixth and seventh columns, the performance ability of GL-Dense-U-Net is equal to that of DenseUNet. Both models show good results in both single lane and dual lanes

**Figure 6.** Images of the original actual color composite image are displayed and classified in three regions using deep learning methods. True positive (TP), false negative (FN) and false positive (FP) were marked as green, blue, and red, respectively.

#### *3.3. Conghua Dataset*

The image resolution of Conghua dataset is 0.2 m, which consists of three bands: Red, Green, and Blue (RGB). There are 47 aerial images in this dataset, and each image consists of 3 × 6000 × 6000 pixels. Among these, 80% of the data is used for training, and the remaining 20% data is used for model validation. Figure 7 shows that the white dotted line area is covered with thick trees, especially in urban environments, where model performance is more challenging than other areas, and the road

occlusion is more frequent due to trees. The method we propose is hardly affected by shadow occlusion, and the average performance is better than the other three classical semantic segmentation algorithms based on convolution neural network. The performance of the GL-Dense-U-Net model on this data set is comparable to that of DenseUNet, and the extracted road edge information is relatively complete, which maintains functional connectivity. We can extract the local feature information of the image accurately and effectively. Figure 8 shows the details of the shaded area.

(a) Input image (b) Ground truth (c) U-Net (d) SegNet (e) FRRN-B (f) GL-Dense-U-Net (g) DenseUNet

**Figure 7.** Images of the original actual color composite image are displayed and classified in three regions using deep learning methods. True positive (TP), false negative (FN) and false positive (FP) were marked as green, blue, and red, respectively. The white dotted line in the images is enlarged for close-up inspection in Figure 7.

(a) Input image (b) Ground truth (c) U-Net (d) SegNet (e) FRRN-B (f) GL-Dense-U-Net (g) DenseUNet

**Figure 8.** A close-up view of the original true-color composite image and classification results is displayed across three regions using the deep learning method. The images are the subset from the white dotted line marked in Figure 7. True positive (TP), false negative (FN) and false positive (FP) were marked as green, blue, and red, respectively.

#### *3.4. Accuracy Evaluation*

Table 3 shows a comparison of the accuracy of automatic classification. We find that the proposed method achieves the highest accuracy, and both F1-score and kappa are significantly higher than three classical semantic segmentation methods on both datasets. The kappa metrics for the classification results were 0.703 and 0.801, respectively. The proposed method provides the most important value

for the F1-score, which involves recall and accurate metrics. The experimental results show that the average performance of the method in recall rate, accuracy, and F1-score is better than the other three classical semantics segmentation methods. In addition, it was found that the method can produce the relatively high average performance of IoU, and kappa over all the images in the test set, which is consistent with the predicted results of Figures 6 and 8.


**Table 3.** The experimental results of road extraction.

Figures 6 and 8 illustrate three example results of U-Net, SegNet, FRRN-B, GL-Dense-U-Net, and the proposed DenseUNet. The results show that compared with the other four methods, our method has the advantages of high accuracy and low noise. Especially in the case of dense roads and shadows, our method can divide each lane with high reliability and ge<sup>t</sup> prominent shadows, as shown in the third row of Figures 6 and 8.

#### *3.5. Model Analysis*

Road background information is essential when analyzing complex structured objects. Our network takes into account the information around the road to facilitate the distinction between roads and similar objects, such as building roofs and dense trees. The context information is robust when the road is occluded. From the first row of Figure 7, some of the roads in the circle are covered by trees. Three classical semantics segmentation methods cannot detect the road under the tree; however, our method has successfully marked them to some extent. A case of failure is shown in the gold dotted line of Figure 8; the proposed method has a distinct error detection rate in impervious surface. It is mainly because most of the roads in the urban impervious surface are not labeled. Therefore, considering that our network regards them as contextual information of the foreground, these roads share the same characteristics as normal roads. We provide a better insight into the performance of the proposed method. In Figure 9, we show the loss and performance curves during system training. The loss of the four models slowly decreases as the training time increases and eventually stabilizes. Although the U-Net model showed large changes in the initial stage of the model training, it finally reached a convergence state. It can be seen that the improved model ultimately achieves good convergence. The connections in dense units and skipping connections between the lower and higher levels of the network help to spread information without degradation, so that a neural network with fewer parameters can be designed; however, better comparability can be achieved in semantic segmentation performance.

**Figure 9.** Loss of training. (**a**) The five curves of blue, yellow, green, red and purple represent the losses of U-Net, SegNet, FRRN-B, GL-Dense-U-Net, and DenseUNet; (**b**) The four curves represent models with di fferent growth rates and modified weights

DenseUNet extracts multi-level features from di fferent stages of the dense block, which strengthens the fusion of di fferent scales. We train DenseUNet with di fferent growth rates, G. The main results on two sets of data dataset are shown in Table 4. It can be seen from the accuracy that the model has the best performance (when the parameter G is equal to 24). Besides, Table 4 shows that relatively small growth rates are su fficient to achieve excellent results on the test datasets. The growth rate defines the amount of new information provided by each layer for the global state. It can be accessed from anywhere in the network and does not need to be replicated between layers in traditional network architecture.


**Table 4.** Results of di fferent growth factors.
