**4. Discussion**

Table 5 shows the statistics of the deep learning model and the variations of DenseUNet. The average running time was calculated by iterating 50 times. U-Net adopts a shallow Encoder–Decoder structure, which requires less computational resources and less reasoning time than other models. However, the road integrity extracted from two sets of data is sparse. DenseUNet adopts a custom encoder–decoder architecture, so it maintains a balance between computing resources and reasoning time. It consumes less computing resources and reasoning time than other models.

**Table 5.** Compare the network e fficiency between the tested deep learning model and DenseUNet.


GL-Dense-U-Net is equivalent to DenseUNet in terms of various indicators. GL-Dense-U-Net consists of Local Attention Units (LAU) and Global Attention Units (GAU). The 1 × 1, 3 × 3, 5 × 5, 7 × 7 kernels are respectively used for convolution operation by LAU, and finally integrated step by step from the bottom to the top. GAU introduces global average pool (GAP) into the unit to extract comprehensive road information. However, since the GL-Dense-U-Net encoding and decoding layers are composed of dense unit blocks provided by DenseNet, and LAU unit (the feature graph of di fferent scale is fused to realize the attention of pixel-level information) is added in the encoding stage while GAU unit (feature maps from low and high levels are considered, and global information is provided to restore features) is connected later in the decoding stage, the GL-Dense-U-Net model is the largest and the inference time is the longest. DenseUnet adopts dense unit modules in the coding stage, while the sampling stage in the decoding layer adopts the skip connection characteristic of U-Net. Therefore, DenseUNet requires less inference time of 316 ms and a smaller model size of 118 MB than other models. In general, DenseNet is more e ffective than most models. On the other hand, G feature maps are output after the convolution of all layers in the dense block. The model sets a small growth rate (G = 12) to ge<sup>t</sup> good results, as shown in Table 4. The overall accuracy and the mIoU of DenseUNet-G-12 in Massachusetts datasets reached 92.22% and 73.24% respectively

In order to further verify the reliability of the proposed model, two groups of remote sensing image data with di fferent resolutions were selected to compare four classical image segmentation models. In Massachusetts datasets, the overall accuracy and the mIoU of DenseUNet in the Massachusetts dataset achieved 93.93% and 74.47%, respectively. The Conghua dataset achieved 95.02% and 80.89%, respectively. In particular, the classification result of Massachusetts is better than that of GL-Dense-U-Net [41]. In general, DenseUNet performs better than Massachusetts datasets in Conghua datasets, which may be a higher data resolution from the dataset.

The developed DenseUNet has excellent potential for improvement. First, the smoothness of road contour is a key factor that a ffects the accuracy of road extraction. In the two sets of prediction datasets, we found that compared with the ground truth value, the predicted result road had information loss of edge and contour. Obtaining accurate road profile information is still a challenging task. Second, di fferent network models are suitable for di fferent scenarios, such as PSPNet [57], DeepLabv3+ [58], and BiSeNet [59], etc., which are suitable for real-time segmentation of street view. It is usually necessary to design the network according to specific tasks to obtain the best performance. Neural Architecture Search (NAS) is a kind of automated neural network design technology, which can automatically design high-performance network structure according to the sample set through the algorithm. This architecture can e ffectively reduce the use and implementation cost of the neural network. Third, we only focused on the performance of di fferent deep learning models during the experiment. Traditional methods, such as threshold-based methods and object-based methods [60], have not been compared, and a more comprehensive comparison of these methods is needed in the future.
