1. Introduction
Research on extracting road information from remote sensing images has been carried out for many years. However, due to the different width and shape characteristics of different grades of roads, such as national roads, provincial roads, village roads, and mountain roads; roads with different materials have different color and texture characteristics, such as cement, asphalt, earth road, etc.; at the same time, the road area is blocked by buildings, trees, the central green belt of the road and many other factors, so the accurate extraction of road information is still the research frontier and poses a technical difficulty in the field of remote sensing information extraction.
Road extraction can be described as a pixel-level binary classification problem that distinguishes whether each pixel belongs to a road or not [
1]. Recently, deep convolution neural networks (DCNNs) have been demonstrated to have significant improvements to typical computer vision tasks such as semantic segmentation [
2]. Road semantic segmentation has applications in many fields, such as autonomous driving [
3,
4], traffic management [
5], and smart city construction [
6]. Semantic segmentation requires pixel-level classification [
7,
8,
9], and it must combine pixel-level accuracy with multi-scale contextual reasoning [
7,
8,
9,
10]. In general, the simplest way to aggregate multi-scale context is inputting multi-scale information into the network for merging all scales of features. Some researchers have made much progress in the image processing fields. Farabet et al. [
11] obtained different scale images by transforming the input image through a Laplacian pyramid. References [
12,
13] applied multi-scale inputs sequentially from coarse-to-fine. References [
7,
14,
15] directly resized the input image for several scales. Meanwhile, another aggregating multi-scale context way is adopting an encoder-decoder structure, such as SegNet [
16], U-Net [
17], RefineNet [
18], and other networks [
19,
20,
21], which have demonstrated the effectiveness of models based on encoder-decoder structure. In addition, the context module is an effective way to aggregate multi-scale context information, such as merging DenseCRF [
9] into DCNNs [
22,
23]. The spatial pyramid pool structure is also a common method to aggregate multi-scale context, such as Pyramid Scene Parsing Net (PSP) [
24,
25].
The larger receptive field is critical for networks because it can capture more global context information from the input images. For a standard convolution neural network (CNN), the traditional way to expand the receptive field is stacking more convolutional layers with a bigger convolutional kernel size, while the operation could result in the exponential expansion of the training parameters, which makes networks hard to train. The alternative way to expand the receptive field is stacking more pooling layers, which can expand the receptive field by reducing the dimension of the feature maps and maintaining the saliency characteristics. Although the pooling operations did not add the training parameters, much information would be lost because of the decrease in spatial resolution.
Reference [
23] developed a convolutional network module, dilated convolution, which aggregates multi-scale contextual information without increasing the training parameters and decreasing resolution. Further, the module can also aggregate multi-scale contextual information with different expanding rates of dilated convolution kernel size. Besides, the module can be plugged into existing architectures for any resolution image, which is appropriate for dense prediction. Therefore, DeepLab v2 [
26], DeepLab v3 [
27], DeepLab v3+ [
28], and D-LinkNet [
1], which adopted dilated convolution for semantic segmentation, presented better performances.
Another effective strategy to increase the capture capabilities of global features is to introduce attention mechanism. Reference [
29] first introduced an attention mechanism into computer vision tasks, which has been proven to be reliable. DANet [
30] adopts a spatial and channel attention module to obtain more global context information. CBAM [
31] introduced a lightweight spatial and channel attention module. DA-RoadNet [
32] constructed a novel attention mechanism module to improve the network’s ability to explore and integrate roads.
The network structure for semantic segmentation was divided into several parts, and those networks [
1,
26,
27,
28] only adopted dilated convolutions in one part. In fact, the encoder part and decoder part of existing architectures for semantic segmentation is built by stacking residual blocks or dense blocks. So, dilated convolution layers after each block have been added to capture more global context information. In the research, a new structure, D-Dense blocks, combined with traditional convolution layers and dilated convolution layers, has been proposed. Further, a network is built with D-Dense block and the center part of D-LinkNet for road extraction from satellite images. To increase the capabilities of capturing global features, the DA mechanism [
30] is also introduced into the network. With the above design, the dilated convolution can run through the whole network and effectively integrate with the attention mechanism to obtain more global features and information. The presented context network was evaluated through controlled experiments with the Massachusetts Road dataset. The experiments demonstrate that the D-Dense block with attention mechanism architectures reliably increases the pixel-level accuracy for semantic segmentation.
Since SAR has the advantages of all-weather and strong penetration, using SAR images has irreplaceable advantages in remote sensing road extraction, which can further improve the accuracy of road information extraction. Many traditional road segmentation methods of SAR images have been proposed and proved effective. Methods based on human–computer interaction are called semi-automatic methods. Bentabet, L et al. [
33] were the first to use the snake model for SAR image road extraction. The results of the experiments show that straight or curved roads could be accurately extracted by this model, but this model needs a large number of human–computer interactions [
34]. Some automatic methods were also proven to be useful. Cheng Jianghua et al. [
35] proposed a method based on the Markov random field (MRF). In order to maximize calculation efficiency, this method is developed on GPU-accelerated road extraction. Besides, there are also Deep-Learning methods of road extraction on SAR images. Wei X et al. [
36] used Ordinal Regression and introduced Road-Topology Loss, which improves the baseline up to 11.98% in the
IoU metric in their own dataset.
Focused on some problems of low-grade roads in remote sensing images, we study how to improve the accuracy of the road extraction in complex scenes using the powerful feature expression ability of deep learning and the penetrating feature of SAR images.
In this paper, we propose a novel deep learning network model called SDG-DenseNet to improve the accuracy of low-grade road extraction from optical remote sensing images. We fuse the extraction results from the SAR image into that of the optical image at the decision level, which improves the accuracy of low-grade road extraction in practical application scenarios. Therefore, the main contribution of this study can be summarized as:
- (1)
A novel SDG-DenseNet network for low-grade road extraction in optical images is proposed. The stem block is taken as the starting module to expand the receptive field and preserve image information, while the stem block also reduces the number of parameters. A novel D-dense block is introduced to construct the encoder and decoder of the network, which applies the dilated convolution in all parts from the encoder to the decoder to improve the receptive field of the network. Moreover, in order to make the dilated convolution run through the entire network, this paper introduces a GIRM module combining the dilated convolution and a double self-attention mechanism. The introduction of the GIRM module aims to enhance the network’s ability to obtain global information. The segmentation effect of the novel network is better than that of many existing networks;
- (2)
A decision-level fusion method is proposed for the low-grade road extraction based on optical images and SAR images, which repairs some interrupted roads in the optical image extraction results. The extraction accuracy of decision-level fusion methods is higher than that of optical image-based deep learning methods in practical application scenarios.
2. Methods
In order to improve the image semantic segmentation accuracy, a novel SDG-Densenet network for low-grade road extraction in optical images is proposed. The construction of the novel SdG-Densenet for optical image semantic segmentation is composed of three parts: an encoder path, a decoder path, and the center part—the global information recovery module. The encoder path takes RGB images as input parameters and extracts features by stacking convolutional layers and pooling layers. The decoder path restores the detailed information and expands the spatial dimensions of the feature maps with deconvolutional layers. The center part is responsible for enlarging the receptive field, integrating multi-scale features, and maintaining the detailed information simultaneously. The skip connection encourages the reuse of the feature maps to help the decoder path recover spatially detailed information. Besides, a decision-level fusion method is also introduced in order to fuse the results optical image and SAR image, which mainly contains six steps: data preparation, pretreatment, image registration, road extraction, road segmented, and decision level integration.
Figure 1 shows the overall structure of the proposed method.
2.1. Architecture of SDG-DenseNet Network
Because low-grade roads are easily blocked by vegetation or buildings, there are often problems of fracture and discontinuity in extracting low-grade roads in optical images. At the same time, due to the low construction standard of low-grade roads, their materials are often consistent with the surrounding environment, and they are often integrated into the background in the optical orthographic projection, resulting in a poor extraction effect. Based on the above problems, it is imperative to specialize in the novel network and improve the ability of global information extraction.
Based on D-LinkNet, the SDG-DenseNet was proposed. In order to improve the extraction ability of global information, the global information recovery module was introduced to the proposed new Network for semantic segmentation. Furthermore, the novel network took DenseNet as its backbone instead of ResNet and replaced the initial block with the stem block. Additionally, the Attention mechanism was introduced to improve the ability to obtain global information. The The overall structure of SDG-DenseNet network is shown in
Figure 2.
2.2. Improved D-Dense Block and Stem Block
The construction of the D-Dense block is shown in
Figure 3. In contrast to the original Dense block, we added three consecutive dilated convolution layers with different expanding rates after the original Dense block. The expanding rates of these three dilated convolutions are 2, 4, and 8, respectively. The structure of each dilated convolution could be set as BN-ReLU-Conv (1 × 1)-BN-ReLU-D_Conv (3 × 3, rate = 2, or 4, or 8). The same computation process with the original Dense block repeated (
n + 3) times and makes the D-Dense block generate feature maps with (
n + 3) ×
k channels.
The encoder starts with an initial block and performs convolution on the input image with a kernel of 7 × 7 size and a stride of 2 followed by a 3 × 3 max pooling. In addition, the output channels of the initial block are 64. Inspired by Inception v3 [
37] and v4 [
38], References [
39,
40] replaced the initial block [
41] 7 × 7 convolution layer, stride = 2 followed by a 3 × 3 max pooling layer by the stem block. The Stem block is composed of three 3 × 3 convolution layers and one 2 × 2 mean pooling layer. The stride of the first convolution layer is 2 and the others are 1. In addition, the output channels for all the three convolution layers are 64. The experiment results in Reference [
40] proved that the initial block applied would lose much information due to two consecutive down-sample operations, making it hard to recover the marginal information of the object in the decoder phase. The stem block is helpful for object detection, especially for small objects. So, the research also adopts the stem block at the beginning of the encoder phase.
2.3. Global Information Recovery Module (GIRM) Based on d-Blockplus and Attention Mechanism
In order to weaken and eliminate the problem of road fracture or low recall in low-grade road extraction, this paper proposes a global information recovery module, which is composed of a dual attention mechanism and d-blockplus. The global information extraction module aims to further improve the network’s ability to obtain global information to ensure the integrity of the extraction results.
As shown in
Figure 4, the global information extraction module is mainly composed of two parts. The dual attention mechanism mainly starts from the two directions of spatial attention and channel attention, extracts and integrates the global information of space and channel, and improves the attention to road targets. d-blockplus introduces multi-layer hole convolution to improve the receptive field, so as to improve the ability of the network to maintain the integrity of road extraction.
In the center part of the SDG-DenseNet, in addition to the d-blockplus, the position attention module (PAM) and the channel attention module (CAM) are also introduced. PAM and CAM are two reliable self-attention modules, which improve the ability of the network to obtain global information in the spatial dimension and channel dimension, respectively.
Figure 5 shows the structure of PAM. In PAM, the input feature maps go through two branches, and one of them will be used as
Q and
K to generate a (
H ×
W) × (
H×
W) Attention probability map. In another branch, it is used as
V. Where,
V,
Q, and
K represent value features, query features, and key features, respectively;
C,
H, and
W represent the channel, height, and weight of the characteristic graph, respectively. The overall structure of PAM is shown in Equation (1):
Figure 6 shows the structure of CAM. The structure of CAM is basically similar to that of PAM. CAM pays more attention to the information on the channel. In this network structure, the size of the probability map generated by CAM is (
C ×
C), which helps to boost feature discrimination. The overall structure of CAM is shown in Equation (2):
D-block has four paths that contain dilated convolution in two cascade modes and two parallel modes, respectively. In each path, dilated convolutions are stacked with different expanding rates. Consequently, the receptive field of each path is different, and the network can aggregate multi-scale context information. Inspired by MobileNetV2 [
42], to save network parameters and improve network performance, the bottleneck block is introduced into d-block to build d-blockplus.
Figure 7 shows the structure of D-blockplus.
2.4. Decision-Level Fusion Algorithm for Low Grade Roads
In optical images, low-grade roads often show the problem where the roads are blocked by buildings, vegetation, shadows, and so on. However, the background of buildings and vegetation is often quite different from the road, and the blocked part is often not judged as a road in the process of deep learning, which directly leads to the phenomenon of fracture or undetected in the extraction results of low-grade roads.
Figure 8 shows some examples of blocked roads. In these pictures, the roads in red boxes show fractures in the optical image because it is obscured by vegetation, buildings, or shadows.
For the problems of the above complex scenes, the imaging mechanism of the optical image determines that the SDG-DenseNet network model cannot solve the problem of poor road continuity well. In this paper, the optical image extraction results based on the SDG-DenseNet network model and the SAR image extraction results based on Duda and path operators [
43] realize decision-level fusion.
The Duda operator is a linear feature extraction operator that divides an
N ×
N window into three parallel linear parts. The specific structure of the Duda operator is shown in
Figure 9, where A, B, C, C1, and C2 represent the mean gray values of the three parts. What’s more, the operator shown in
Figure 9a has a relatively strong ability to extract roads in the horizontal direction, and the operator shown in
Figure 9b has a relatively strong ability to extract roads with a certain inclination angle.
The other two types of Duda Operators are a 90-degree rotation of the above two. The function to determine the new value of a pixel can be expressed as follows:
Path operators refer to path openings and closings, which are morphological filters applied to analyze oriented linear structures in images. The morphological filter defines the adjacency graphs as structuring elements. Four different adjacency graphs are defined as horizontal lines, vertical lines, and two diagonal lines, respectively. Applying these four adjacency graphs to a binary image, the maximum path length of each pixel can be achieved. Then, the pixels, whose maximum path lengths are larger than the threshold Lmin, are retained in the image.
The specific algorithm flow of the decision-level fusion method for low grade roads is shown in
Figure 10.
Figure 10 shows the overall technical process of the road extraction algorithm based on the decision-level fusion of high-resolution optical and SAR remote sensing images. The specific steps of the algorithm are as follows.
Step 1: Data preparation. Obtain optical remote sensing images and SAR images in the same area, and their imaging time should be as close as possible;
Step 2: Pretreatment. The optical remote sensing image and SAR image are preprocessed, respectively, including radiometric correction, geometric correction, geocoding, and so on;
Step 3: Image registration. The optical remote sensing image and SAR image are matched and transformed into the same pixel coordinate system;
Step 4: Road extraction. Roads in optical remote sensing images are extracted by SDG-DenseNet and those in SAR images are extracted by the method in Reference [
43], which is based on Duda and Path operator;
Step 5: Roads segmented. For the road extraction results of optical remote sensing image and SAR image, the road segments are obtained by segment method, and the attributes of each segment are recorded;
Step 6: Decision level fusion. Taking the line segment as the basic unit, the final road distribution map is obtained by decision-making level fusion of the roads extracted from the optical remote sensing image and SAR image.
3. Experiments
Our network experiments are performed on the Massachusetts Roads Dataset, and we test the fusion method in our own dataset that came from WorldView-2, WorldView-4, and TerraSAR-X. The TensorFlow platform was selected as the deep learning framework to train and test all networks. All models are trained on one NVIDIA GTX 2080 Ti GPU.
3.1. Dataset and Data Augmentation
Three sets of satellite images were applied to evaluate the Low-Grade road extraction method. To verify the effectiveness of the proposed SDG-DenseNet network on public datasets, we tested the SDG-DenseNet on the Massachusetts dataset. In addition, we conducted low-grade road extraction experiments on the self-built Chongzhou–Wuzhen dataset. Finally, we conducted decision-level fusion experiments on two sets of large-scale images from the Chongzhou and Wuzhen regions including optical and SAR images.
We trained and tested our SDG-DenseNet network model on the Massachusetts Roads Dataset [
44], which consists of 1108 training images, 14 validation images, and 49 test images. The size of each image is 1500 × 1500. We cut each 1500 × 1500 image into four 1024 × 1024 images. Therefore, we obtained 4432 training images, 56 validation images, and 196 test images. Further, we performed data augmentation on the training set, including rotation, flipping, cropping, and color jittering, which could prevent the training set from overfitting. After data augmentation, we obtained 22,160 training images in total. Finally, we obtained 22,160 training images, 56 validation images, and 196 test images.
In order to test the proposed SDG-DenseNet network of low-grade road extraction, this paper also tests the SDG-DenseNet on the self-built dataset: The Chongzhou–Wuzhen dataset.
Table 1 displays the three source images of the self-built dataset. We cut the three source images into 13,004 512 × 512 images. Therefore, we obtained 11,788 training images, 204 validation images, and 1012 test images. After the data augmentation of the training set, we got 47,152 training images. Finally, we obtained 47,152 training images, 204 validation images, and 1012 test images.
We also test our decision-level fusion experiments on two sets of large-scale images from the Chongzhou and Wuzhen regions including optical and SAR images. The optical images came from WorldView-2 and WorldView-4, while we got the SAR images from TerraSAR-X. As shown in
Table 2, in order to test the effect under application conditions, the decision-level fusion experiment is mainly tested on the two large-scale images.
3.2. Hybrid Loss Function and Implementation Details
In previous work, most networks train their models only by using the cross-entropy loss [
45], which is defined as Equation (3):
where
indicates categories.
and
mean the label and prediction vectors, respectively. Since an image consists of pixels, for road area segmentation, the imbalance of sample points (where the roads only cover a small part of the whole image) makes the direction of the gradient decrease toward the back corner (
Figure 11a), which leads to a local optimum, especially in the early stage [
46]. The Jaccard loss function is defined as:
Its surface is shown in
Figure 11b. As we can see, the Jaccard loss can address this problem if we sum the Jaccard loss and the cross-entropy loss together. So, the whole loss function is defined as:
where
is the weight of the Jaccard loss in the whole loss. Furthermore, the red, green, and blue points in
Figure 11 represent the local maxima, saddle points, and local minima on the loss surface, respectively.
In the training phase, we chose Adam as our optimizer and originally set the learning rate to be 0.0001. We reduce the learning rate by 10 times while observing the loss value decreasing slowly. The loss weight λ is set to 1. The batch size during the training phase is set to 1.
3.3. Decision-Level Fusion Experiment
To verify the effect of every step in the decision-level fusion method for low-grade roads, we apply the fusion method to the road extraction results from the network and method based on the Duda operator and Path operator, using the large-scale images mentioned in
Table 2 and the details of
Step 6, where decision-level fusion is operated as in
Figure 12. The detailed workflow of the Decision level fusion.
As shown in
Figure 12, the main process is divided into five steps:
Step 1: Road binary map extracted from input optical image and SAR image (not segmented);
Step 2: Segment the road binary map extracted from the SAR image, including extracting the road feature direction map, decomposing the binary map according to the direction feature, thinning the decomposed layer based on the curve fitting algorithm, and optimizing the line segment overlap, continuity and intersection to obtain the road segment set extracted from the SAR image;
Step 3: Segment the road binary map extracted from the optical image, optimize the overlap and continuity of segments, and record the updated segments of continuity optimization;
Step 4: For each road segment extracted from the SAR image, we judge whether it meets the fusion conditions with optical image road extraction results according to the overlap ratio in the corresponding optical extraction road binary layer, and record the qualified SAR road segments;
Step 5: After morphological expansion according to the width feature, the continuously optimized and updated line segments and the SAR road line segments meeting the fusion conditions are calculated with the original optically extracted road binary map according to pixels to obtain the fused Road Distribution binary map.
The specific method of searching line segments satisfying fusion conditions is shown in
Figure 13. Assuming that
Am represents the road area on layer m after the decomposition of optical image extraction results, L
mn is the line segment n on layer m from SAR image road extraction results. They belong to the same layer m, that is, the road has similar directional features. We then count the number of pixels
ln1 and
ln2 where L
mn falls inside and outside the
Am region, and calculate the overlap rate r =
ln1/(
ln1 +
ln2). If r is greater than the threshold T
r, L
mn is recorded as the road segment meeting the fusion conditions. In a practical application, the threshold tr takes an empirical value of 0.3. We traverse all SAR extracted road segments until all SAR image-extracted road segments meeting the above fusion conditions are recorded.
3.4. Evaluation Metrics
In order to evaluate the performance of different road segmentation models, four evaluation metrics are used to evaluate the extraction results, including intersection-over-union (
IoU), completeness (
COM), correctness (
COR), and
F1-score [
47], which are defined as:
TP (True Positive) indicates that the extraction result is determined as a road, which is actually part of the road; FP (False Positive) indicates that the extraction result is determined as a road, but it is not actually part of the road; FN (False Negative) indicates that the extraction result is determined to be not a road, but it is actually part of the road. The COM scores of different models show the ability to maintain the completeness of the segmented roads. The higher the score, the better the road continuity extracted by the model. The COR scores of different models show the ability on reducing false detection of the segmented roads. The higher the score, the fewer areas will be falsely detected. The IoU and F1 scores are the overall evaluation metrics that synthesize COM and COR scores and evaluate the overall quality of segmentation results.
Based on these evaluation metrics, we can obtain the performance of model road extraction results in different aspects from COM and COR scores, and obtain the overall performance judgment from F1 and IoU scores.
5. Conclusions
In this research, a D-Dense block module was proposed, which combined traditional convolution and dilated convolution based on a dense connection structure. Further, the new semantic segmentation network (SDG-DenseNet) was built with a D-Dense block, and it also adopted the center part of the D-LinkNet for high-resolution satellite imagery road extraction. Since the network also replaces the initial block with the stem block to hold more detailed information, it can be easier to recover the marginal information of the object in the decoder phase. In addition, the introduction of an attention mechanism also improves the ability of the network to obtain global information. Besides, to improve the accuracy of road extraction in large-scale images in practical application, a decision-level fusion method was proposed, which fused the information in optical images and SAR images.
Three sets of satellite images were applied to evaluate the network. The extraction results from the Massachusetts Roads dataset show that the SDG-DenseNet not only has the highest IoU and F1 score but is also suitable to extract roads in complicated scenes. Experiments showed that the IoU and F1 scores of SDG-DenseNet based on D-Dense block and GIRM modules were 3.61% and 2.75% higher, respectively, than the baseline D-LinkNet. The stem block is helpful to develop the accuracy for road extraction. Furthermore, the Chongzhou–Wuzhen dataset, based on three large-scale optical images, was applied to evaluate the models’ extraction ability of the low-grade roads. The results show that the SDG-DenseNet performs best in four networks and its IoU score is 6.65% higher than that of D-LinkNet. At the same time, its model size is reduced by about 600 MB to D-LinkNet. Further, two pairs of large-scale optical and SAR images were applied to evaluate the decision-level fusion method. The results show that the fusion method performed well in accurately extracting the roads. After decision-level fusion of road binary map from SAR and optical image based on two tested data, the F1 is improved by about 8.4–11.5%, COR is about 7.4–7.7%, and COM is about 9.3–13.7%.
SDG-DenseNet improves d-block as d-blockplus and combines it with an attention mechanism, which not only ensures road completeness in the segmentation task but also greatly improves the correctness of the segmentation results. Therefore, the network maintains a perfect balance between correctness and completeness. In addition, the decision-level fusion method had been proposed to improve the extraction effect on the task of low-grade road extraction, and the presentation quality is better after the decision-level fusion. In future research, the contribution of each part of the network and every hyperparameter in the training phase should be taken into consideration.