Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China

Wang, Xudong; Cai, Yujie; He, Kang; Wang, Sheng; Liu, Yan; Dong, Yusen

doi:10.3390/rs15194686

Open AccessArticle

Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China

by

Xudong Wang

^1,†

,

Yujie Cai

^1,†

,

Kang He

²,

Sheng Wang

^1,2,

Yan Liu

³ and

Yusen Dong

^1,2,4,*

¹

School of Computer Science, China University of Geosciences, Wuhan 430078, China

²

Hubei Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences, Wuhan 430078, China

³

State Key Laboratory of Geological Processes and Mineral Resources, China University of Geosciences, Wuhan 430078, China

⁴

Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences (Wuhan), Wuhan 430078, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2023, 15(19), 4686; https://doi.org/10.3390/rs15194686

Submission received: 27 August 2023 / Revised: 18 September 2023 / Accepted: 18 September 2023 / Published: 25 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Road extraction is crucial in urban planning, rescue operations, and military applications. Compared to traditional methods, using deep learning for road extraction from remote sensing images has demonstrated unique advantages. However, previous convolutional neural networks (CNN)-based road extraction methods have had limited receptivity and failed to effectively capture long-distance road features. On the other hand, transformer-based methods have good global information-capturing capabilities, but face challenges in extracting road edge information. Additionally, existing excellent road extraction methods lack validation for the Chinese region. To address these issues, this paper proposes a novel road extraction model called the global–local information fusion network (GLNet). In this model, the global information extraction (GIE) module effectively integrates global contextual relationships, the local information extraction (LIE) module accurately captures road edge information, and the information fusion (IF) module combines the output features from both global and local branches to generate the final extraction results. Further, a series of experiments on two different Chinese road datasets with geographic robustness demonstrate that our model outperforms the state-of-the-art deep learning models for road extraction tasks in China. On the CHN6-CUG dataset, the overall accuracy (OA) and intersection over union (IoU) reach 97.49% and 63.27%, respectively, while on the RDCME dataset, OA and IoU reach 98.73% and 84.97%, respectively. These research results hold significant implications for road traffic, humanitarian rescue, and environmental monitoring, particularly in the context of the Chinese region.

Keywords:

road extraction; high-resolution remote sensing images; global–local information fusion network; deep learning

Graphical Abstract

1. Introduction

Road information constitutes highly important geospatial data, and the rapid and accurate extraction of road information plays a significant role in urban development [1], autonomous driving [2,3], humanitarian aid [4,5], and geospatial information upgradation [6,7,8,9]. With the significant improvement in the resolution of remote sensing imagery and the availability of a large number of high-resolution remote sensing images, road extraction from remote sensing images has gradually become mainstream [10,11]. However, road extraction from remote sensing images poses several challenges, as shown in Figure 1:

1. High intra-class differences in road types: For example, there are variations in domains in the same geographic location. Rural roads are often made of gravel or cement, with relatively simple road structures, while urban roads are typically made of asphalt with more complex configurations. Different geographic locations exhibit distinct spatial and geographical characteristics due to varying levels of urbanization, scale, and development.

2. Shadows obstructing road features: Road features in remote sensing images can be easily concealed by shadows, thereby introducing irrelevant noise interference and increasing the difficulty of road extraction.

3. Complex image backgrounds and similarities between roads and certain objects in the image: The presence of cluttered backgrounds in the images, along with high similarity between road features and other objects, poses significant challenges in road extraction tasks.

Road extraction methods can be divided into two categories: traditional methods [12] and methods developed based on various deep learning frameworks [13]. Traditional methods heavily rely on manually designed features, thereby making it challenging to accurately identify road shape features in images [14,15,16]. Moreover, early manual extraction methods required significant time and human resources and were prone to subjective interference [17,18]. The rapid development of deep learning in recent years has propelled the progress of road extraction tasks, gradually replacing traditional methods as the mainstream approach.

1.1. CNN-Based Method for Road Extraction

The fully convolutional network (FCN) [19]—as the pioneering semantic segmentation model based on the encoder–decoder (upsampling/deconvolution) structure of CNN—replaces the fully connected layers in traditional CNN networks with convolutional layers while addressing the problem of image size reduction caused by convolution and pooling using upsampling. In early research, FCN-based road extraction methods successfully extracted roads from remote sensing images while preserving road continuity and integrity [20]. SegNet [21] employs a one-to-one correspondence between the encoder and decoder, using index values retained during encoding to restore road edge positions during decoding. In addition, road extraction methods based on the Deeplabv3+ framework [22] utilize pyramid pooling and dilated convolutions to preserve more spatial information and capture the multi-scale contextual relationships of roads, thereby achieving good road segmentation results. Further, a patch-based convolutional neural network (CNN) model [23] has been proposed for road extraction, thus significantly improving performance compared to traditional methods. RoadNet [24] was introduced to address the extraction of roads, centerlines, and edges in multiple tasks. Lian et al. [25] introduced a CNN-based road extraction model called DeepWindow, which tracks roads directly from remote sensing images using a sliding window approach. A recently developed CNN-based convolutional network [26] preserves more spatial features, thereby extracting more complete road information from remote sensing images. Li et al. [27] introduced an end-to-end road extraction network based on CNN, which utilizes geographic features of roads in images to extract urban roads and incorporates a direction-aware attention block to maintain road connectivity. However, CNN-based methods inherently face a significant limitation. The convolutional kernels in CNNs cannot be too large, thereby limiting the ability of models to leverage global information and impacting the distinguishability of extracted features.

1.2. Transformer-Based Method for Road Extraction

In recent years, the transformer [28], initially proposed for natural language processing (NLP), has garnered significant attention and has been applied in various computer vision domains [29,30]. The transformer introduces self-attention mechanisms to replace convolutional operators, thereby enabling models to better learn global contextual information from images. With the success of vision transformer (ViT) [31] across different computer vision tasks, numerous researchers have begun exploring the use of transformer-based models in remote sensing image segmentation. Swin transformer [32] restricts self-attention computations to non-overlapping local windows while enabling cross-window connections to improve efficiency, thereby generating hierarchical feature representations. Furthermore, pyramid vision transformer (PVT) [33] generates multi-scale features by introducing a progressively decreasing pyramid and a spatial reduction attention layer, thereby achieving impressive performance in object detection and segmentation tasks. Twins [34] replaced fixed positional encodings in PVT with conditional encodings, thereby enabling the model to flexibly handle features at different scales. Given that the long-distance geometric characteristics of roads often exhibit a global nature within images, an increasing number of researchers have started to explore the integration of transformers into road extraction tasks. Road network graph detection by a transformer (RNGDet) [35], for instance, emerged as a solution for extracting road network information from aerial images. By incorporating transformer structures and deep queries, it achieved enhanced results in capturing road segments around complex intersections. The road shape-aware network (RSANet) [36] takes into account the shape features of roads and employs the efficient strip transformer module (ESTM) to effectively model the long-distance dependencies of roads. To effectively capture global representations, Lin et al. proposed the bi-directional transformer network (BDTNet) to enhance the extraction capabilities of both global and local information in remote sensing images [37]. This approach yielded more detailed segmentation results by enhancing the semantic information in feature maps. To address challenges posed by occlusion from trees and shadows, as well as complex topological structures, Wang et al. proposed a topology-enhanced road network extraction method called TERNformer [38], which strengthens road network topologies by exploring local connectivity. Chen et al., capitalizing on the global context modeling capability of Swin Transformer and the local feature extraction prowess of ResNet, devised a novel dual-branch encoding block named CoSwin [39]. This innovation led to an enhancement in the completeness and connectivity of road extraction. Compared to CNN-based methods, the aforementioned transformer-based methods effectively leverage their global information extraction capabilities, thereby resulting in improved road segmentation results [40,41]. However, although these methods capture global information from the entire image, their performance in extracting details at road edges is not entirely satisfactory.

1.3. Gap in Road Extraction Research in the Chinese Region

Since deep-learning-based road extraction models rely on road datasets for training, their performance is, to a certain extent, influenced by the training dataset. Due to the diversity in the levels of development and cultural differences among different countries, the road construction conditions vary significantly across countries. Consequently, most available datasets for road extraction are heavily biased toward specific regions [4]. The introduction of well-known public road datasets—such as Massachusetts [42] and DeepGlobe [43]—along with their corresponding challenges, has spurred the development of numerous excellent road extraction models. An end-to-end road segmentation network [44], benefiting from the introduction of attention mechanisms and considering road strip features, achieves more accurate segmentation results on the DeepGlobe and Massachusetts road datasets. Further, a spatial attention network based on ResNet [45] enhances road extraction capabilities by considering the dependencies between different locations when determining the class of each pixel; its effectiveness has been validated on the Massachusetts road dataset. However, China’s vast geographical expanse encompasses diverse terrains and urban environments, ranging from large cities to rural areas, from mountainous regions to plains. These areas feature various types of roads. Therefore, road extraction tasks in China necessitate addressing the diversity and complexity of different terrains. Additionally, China is home to some of the world’s largest cities, with very high population densities in some urban areas. This implies that road networks are often more densely interconnected, and road extraction models need to cope with the challenge arising from high traffic and urban planning intricacies. Due to the cross-domain differences in road information, applying these road extraction methods that perform well on these datasets for road extraction tasks in the Chinese region may not necessarily yield sufficiently high-quality results. In recent years, various models have been proposed that utilize Chinese regional road datasets for road extraction and demonstrate their superiority, to a certain extent, in addressing the lack of models for accurately extracting roads in the Chinese region. To overcome the limitations of the image receptive field, the split depth-wise separable graph-convolution network (SGCN) [46] employs feature separation and graph-convolution modules, thereby achieving the best road extraction performance on a self-built mountain road dataset by the authors. The partial-to-complete network (P2CNet) [47] introduces a gated self-attention module (GSAM) to capture long-range semantic information of roads, filters out irrelevant information, and uses missing part (MP) loss to focus more on missing road pixels, thereby achieving the best results on the SpaceNet [48] dataset. However, the datasets used by these models have limited coverage of Chinese regional areas and lack geographical robustness. For example, the mountain road dataset used by SGCN mainly comprises mountainous areas in Gansu Province and does not cover roads in other regions and urban areas of China. In addition, the SpaceNet dataset used by P2CNet only includes the Shanghai region, which cannot fully represent the complex road environments in China. In other words, the effectiveness of these road extraction models still needs to be validated in the face of the high intra-class differences in Chinese roads.

1.4. Contributions and Structure

To address the challenges in road extraction from remote sensing images, such as poor extraction performance, fragmented results, blurry edges, and the lack of validation in the Chinese region for existing methods, this paper proposes a road extraction network model called “global–local information fusion network (GLNet)”. The global information extraction (GIE) module adopts the transformer structure to focus on the global contextual information of images, thereby enabling the capture of long-distance feature dependencies in road detection. The local information extraction (LIE) module addresses the shortcomings of the transformer in capturing multi-scale and fine-grained edge features by introducing multi-scale feature (MSF) and spatial–channel dual attention (SC-Att) modules. The following are the main contributions of this paper:

(1) The proposed road extraction network model, called GLNet, effectively captures global and local road features from remote sensing images for road extraction. To solve the problem of shadow occlusion, the LIE module enhances the detailed features of road edges. In addition, the GIE module effectively perceives long-distance relationships of roads and takes into account the continuity features of roads in the global space, thereby alleviating the interference of complex backgrounds and ground features.

(2) Extensive comparative experiments on two geographically robust public road datasets in China demonstrate significant performance improvements of the proposed model in road extraction tasks. Compared to state-of-the-art models like the dual attention network (DANet) [49], GLNet achieves an increase of 0.84% in IoU and 0.64% in F1 Score on the CHN6-CUG road dataset, and an improvement of 1.53% in IoU and 0.91% in F1 Score on the RDCME dataset. These results highlight the superiority of GLNet in Chinese road extraction tasks, thereby providing a more effective approach for road research in China.

The remainder of this paper is structured in the following manner: In Section 2, detailed information regarding the two datasets used in this study is presented, along with an explanation of the rationale underlying the selection of these two datasets. Section 3 provides an overview of the overall architecture of the network and presents the details of each module within the network. Section 4 showcases the details of the comparative experiments and ablation experiments, followed by a comparison and analysis of the results. Finally, Section 5 presents the conclusions of the study.

2. Data

2.1. CHN6-CUG Road Dataset

The CHN6-CUG road dataset [50] is a large-scale, pixel-level, very high-resolution (VHR) satellite image dataset that represents major cities in China. The dataset is sourced from Google Earth (Google Inc., San Jose, CA, USA), and the images are sourced from various regions, encompassing the Chaoyang District in Beijing, Yangpu District in Shanghai, the central area of Wuhan, Nanshan District in Shenzhen, Shatin District in Hong Kong, and Macau. Each image in the dataset has a resolution of 50 cm/pixel and a size of 512 × 512 pixels. The dataset contains 4511 labeled images, with 3608 images used as the training set and 903 images used as the test set.

The CHN6-CUG road dataset has the following characteristics:

1. A large-scale, pixel-level VHR satellite image dataset representing major cities in China: Unlike most publicly available road datasets, CHN6-CUG addresses the issues of limited data volume and the lack of publicly available datasets for Chinese roads.

2. Road diversity: The CHN6-CUG dataset includes roads in various complex scenarios, as shown in Figure 2, such as urban roads, forest roads, rural roads, and elevated bridges. Therefore, the dataset exhibits a high degree of heterogeneity and variability in road extraction tasks.

3. Geographical variation: The data in the CHN6-CUG dataset are sourced from six different cities, all with varying urbanization levels, sizes, development stages, structures, histories, and geometric features. These differences contribute to variations in spectral characteristics, texture features, and geometric properties of roads in different cities.

Based on these characteristics, we believe that the diversity of the CHN6-CUG dataset can effectively verify the generalizability of our proposed model. Moreover, it can validate the performance of the model in extracting roads with high intra-class differences in Chinese regions and help build road extraction models that can be applied to different regions in China.

2.2. Road Datasets in Complex Mountain Environments (RDCME)

RDCME [51] is a multi-spectral image dataset collected from the northwest region of China. The images in this dataset have a resolution of 0.61 m/pixel. RDCME comprises a total of 775 image samples, each with a size of 256 × 256 pixels. For the purpose of this study, 497 image samples were used for training, 155 samples for testing, and 124 samples for validation.

The road situation in RDCME is shown in Figure 3. The following are the characteristics of the roads in the RDCME:

(1) The roads have greater coverage, but there is an issue of edge blurriness. Due to the complex environment and image quality limitations, the road surfaces and edges in the images appear blurry. Additionally, in reality, mountain roads are typically only four to five meters wide, thereby resulting in road objects in the images occupying only approximately 11–14 pixels in width.

(2) The study region encompasses extensive mountain ranges, and due to the overshadowing caused by mountains, roads might appear considerably dim, rendering them challenging to discern from their surrounding environment, as illustrated in Figure 3d.

(3) The mountain roads share similar characteristics with other terrain features. The RDCME mountain road dataset includes various terrain objects, such as mountains and streams. From a geological perspective, features of mountain ranges, rivers, and other objects exhibit linear characteristics. Therefore, compared to other datasets, the roads in the RDCME mountain road dataset have similar morphological features to other objects in the images. As shown in Figure 3a,b, the road in the image bears resemblance to the mountains. Similarly, as depicted in Figure 3c, the road in the image is similar to the rivers.

3. Method

3.1. Overall Architecture

To address the limitations of existing road extraction models in terms of their structural design and the lack of accurate road extraction models for the Chinese region, this paper proposes GLNet for road extraction from remote sensing images. The architecture of GLNet is illustrated in Figure 4. GLNet consists of three components: the GIE module, the LIE module, and the information fusion (IF) module. The GIE module adopts a typical encoder–decoder structure to capture the long-range dependencies of roads in the images. On the other hand, the LIE module is designed to extract the edge details of roads in order to compensate for the blurring effect that occurs during road extraction with the GIE module.

The GIE module mainly comprises four transformer blocks. In road extraction tasks, it is desirable for the network model to extract long-range information on roads and focus on the salient features of roads. The GIE achieves this by extracting multi-scale features of roads through multiple transformer blocks, thereby preserving the local continuity of roads through overlapped patch embedding and leveraging self-attention mechanisms to focus on the global contextual features of roads. Therefore, the GIE module can effectively identify the long-range interrelationship of roads.

The LIE module introduces the MSF module and the SC-Att module. The MSF module obtains multi-scale feature information of roads through different dilation rates. The SC-Att module amplifies the cross-dimensional channel–spatial dependencies in the feature map through channel attention mechanisms and then preserves the road feature mappings through spatial attention mechanisms. Therefore, the LIE module effectively extracts the edge features of roads.

Further, the IF module is used to fully integrate the features extracted by the GIE module and the LIE module. It consists of two 3 × 3 convolutional layers. The output feature maps obtained from the GIE and LIE modules are concatenated along the channel dimension, and then the channel dimension is reduced through two 3 × 3 convolutional kernels for information fusion. Finally, the road segmentation results of the images are obtained through upsampling operations.

3.2. Global Information Extraction Module

The GIE module structure is illustrated in Figure 5. Given an input image, the image passes through four transformer blocks. Subsequently, the output of each block is processed by a 1 × 1 convolutional layer to ensure consistency in the channel dimension. Then, a resize operation is applied to maintain the same size for these four output feature maps. Finally, the four output feature maps are concatenated along the channel dimension and further processed by a 1 × 1 convolutional layer.

3.2.1. Overlapped Patch Embedding

In the ViT network, the input image is divided into N non-overlapping patches, where there is no overlap between adjacent patches. However, this approach fails to preserve local continuity between each patch. To address this limitation, we propose an improved method called “overlapped patch embedding”, which retains local continuity by merging overlapping patches. This method is implemented using zero-padding convolution. Unlike non-overlapping patch embedding, we enlarge the patch window so that each patch overlaps with its adjacent patch by half of its area. Specifically, given an input of size H×W×C, we use a convolutional operation with a stride of S, a kernel size of 2S-1, and padding of S-1, thereby resulting in an output size of H/S×W/S×C. To achieve this, we set three parameters—K, S, and P, where K stands for the kernel size, S stands for the stride between adjacent patches, and P stands for the padding size. In our experiments, we used two sets of parameters ([K = 7, S = 4, P = 3] and [K = 3, S = 2, P = 1]) for this operation, thereby producing patches of the same size.

3.2.2. Self-Attention Mechanism

In essence, the attention mechanism enables the machine to focus on the correlations among different parts of the input. In the original transformer network, the computational complexity mainly lies in the self-attention mechanism layer. In the multi-head self-attention mechanism of ViT, the dimensions of each head’s Q, K, and V matrices are N × C. Therefore, the formula for self-attention mechanism is as given below:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{head}}}) V

(1)

The aforementioned process has a computational complexity of O

(N^{2})

, thereby making it challenging to apply to segmentation tasks with large image resolutions or deploy in real-time applications. To reduce resource consumption and successfully apply it to the training process of high-resolution images, we introduce a sequence reduction process [33]. This process involves adding a new parameter called the scaling factor R to the self-attention mechanism, thereby resulting in the following modifications:

\begin{matrix} \hat{K} = Reshape (\frac{N}{R}, C \cdot R) (K) \end{matrix}

(2)

\begin{matrix} K = Linear (C \cdot R, C) (\hat{K}) \end{matrix}

(3)

Here, K represents the input sequence to be reduced. The term Reshape

(\frac{N}{R}, C \times R) (K)

in Equation (2) denotes reshaping the sequence K to have a new shape of

(\frac{N}{R} * (C \cdot R))

. The term

(C \cdot R, C)

in Equation (3) represents applying a linear transformation to an input vector of dimension

C \cdot R

and generating an output vector of dimension C as the linear output layer. Consequently, the new sequence K has a dimensions of

(\frac{N}{R} \times C)

, which reduces the computational complexity of the self-attention mechanism from O

(N^{2})

to O

(\frac{N^{2}}{R})

.

3.2.3. Mix-FFN

The ViT model uses positional encoding to introduce positional information and determine the position of each patch. However, a problem arises where the positional encoding is fixed and cannot be altered during training. This leads to an inevitable decrease in accuracy if the resolution of the test and training differs and interpolation is applied to the positional encoding. To address this issue, conditional positional encodings for vision transformers (CPVT) [52] proposes a position encoding generator (PEG) based on conditional position encoding (CPE), which makes the transformer more flexible and generalizable. In our approach, we believe that positional encoding is not essential for road extraction. Therefore, we introduce mix-FFN, which dynamically expresses the positional relationship between patches using only 3 × 3 convolutions in the feed-forward network (FFN) [53]. The expression for mix-FFN is as given below:

X_{out} = M L P (G E L U ({Conv}_{3 * 3} (M L P (X_{in})))) + X_{in}

(4)

In mix-FFN,

X_{i n}

represents the features from the attention module. In each FFN, a 3 × 3 convolution and a multi-layer perceptron (MLP) are combined. It is worth noting that the activation function used is the Gaussian error linear unit (GELU) [54] instead of the commonly used ReLU. Non-linearity is an important property in neural network modeling, and in order to enhance model generalization, random regularization techniques, such as dropout, are often employed (setting certain outputs to zero randomly, which can be considered a form of stochastic non-linear activation). Although random regularization and non-linear activation are separate concepts, the model’s input is influenced by both. GELU incorporates the idea of random regularization within the activation function. It provides a probabilistic description of neuron inputs, which is more intuitive and closer to our natural understanding. Moreover, Hendrycks et al. [54] have found that GELU performs better than ReLU. The approximate expression for GELU is as given below:

G E L U (x) = 0.5 x (1 + t a n h [\sqrt{\frac{2}{π}} (x + 0.044715 x^{3})])

(5)

3.3. Local Information Extraction Module

The LIE module structure is depicted in Figure 6. Given the input image, it first goes through the ResNet50-C [55] network. Subsequently, it passes through a residual structure comprising the MSF and SC-Att modules. Finally, a 3 × 3 convolution and upsampling are applied to ensure that the output feature map has the same dimensions as the output from the GIE module.

3.3.1. The Multi-Scale Feature Module

The MSF module utilizes the atrous spatial pyramid pooling (ASPP) structure [56], which is based on spatial pyramid pooling [57,58,59], to extract multi-scale information. In this paper, we employ an ASPP structure with four parallel branches, consisting of one 1 × 1 convolutional layer and three 3 × 3 dilated convolutional layers, as illustrated in Figure 7. Each branch uses a different dilation rate, thereby resulting in different receptive fields for each branch. This enables the MSF module to effectively extract multi-scale contextual information, thereby addressing the multi-scale nature of the target. The input feature map is passed through these four parallel branches to obtain outputs, which are then concatenated along the channel dimension.

3.3.2. The Spatial–Channel Dual Attention Module

The SC-Att module incorporates the concept of multi-head attention [60]. It comprises a channel attention sub-module based on a multi-layer perceptron (MLP) and a spatial attention sub-module. This design aims to reduce information loss and enhance the ability of the module to obtain information related to roads in images. The overall structure of the SC-Att module is depicted in Figure 8, and its formula is given below:

F_{2} = M_{C} (F_{1}) \otimes F_{1}

(6)

F_{3} = M_{S} (F_{2}) \otimes F_{2}

(7)

In the above formula,

F_{1}

represents the given input feature map,

F_{2}

represents the intermediate state feature map, and

F_{3}

represents the output feature map;

M_{c}

corresponds to the channel attention, and

M_{s}

corresponds to the spatial attention. First, the input feature

F_{1}

is processed through a channel attention module to obtain

M c

(

F_{1}

). Then,

M_{c}

(

F_{1}

) is multiplied with the original input feature matrix

F_{1}

to derive

F_{2}

. Next,

F_{2}

is processed through a spatial attention module to derive

M_{s}

(

F_{2}

). Finally,

M_{s}

(

F_{2}

) is multiplied with

F_{2}

to obtain the ultimate output

F_{3}

.

The channel attention sub-module is shown in Figure 9. It employs a 3D arrangement to preserve the three-dimensional information of the input features, then utilizes an MLP with a reduction ratio of r in an encoder–decoder structure to amplify the cross-dimensional channel–spatial dependencies. The result is transformed back to the original dimensions and passed through a batch normalization (BN) function before being outputted.

The spatial attention sub-module is shown in Figure 10. It incorporates two 7 × 7 convolutional kernels and the same reduction ratio r to fuse spatial information, thereby ensuring a more concentrated representation of spatial features. Additionally, to prevent information loss, we excluded the max pooling layer from the spatial attention module. Because of the removal of the max pooling layer, the number of parameters in the spatial attention sub-module increase significantly at certain times. To address this issue, we adopted the approach used in channel group convolution [61].

3.4. Loss Functions and Classifiers

The probability of pixel l belonging to class i is calculated using the softmax normalization exponential function, which maps the multiple outputs of the neural network to the (0, 1) interval, thereby ensuring that the sum of probabilities for all classes is equal to 1. The formula for the probability of class i for pixel l in the image is as given below:

p_{i} = \frac{e x p (x_{l}^{i})}{\sum_{j = 1}^{k} (x_{l}^{j})}

(8)

where k represents the total number of classification classes, and

x_{l}^{i}

represents the value of the i-th channel of pixel l.

For a single pixel, correctly distinguishing between roads and backgrounds can be handled as a binary classification problem. Therefore, the binary cross-entropy loss function [62] is used in this paper. The formula for binary cross-entropy loss is as given below:

L o s s = - y (l o g \hat{y}) - (1 - y) l o g (1 - \hat{y})

(9)

where y is the true value of the label and

\hat{y}

is the predicted value.

Cross-entropy is a metric that quantifies the dissimilarity between two different probability distributions of the same random variable. In the context of machine learning, it measures the difference between the true probability distribution and the predicted probability distribution. A lower value of cross-entropy indicates better performance of the model in terms of prediction accuracy. Cross-entropy is commonly used in classification problems in conjunction with the softmax function. The softmax function processes the model’s output to ensure that the predicted probabilities for multiple classes sum up to one. The cross-entropy loss is then calculated to measure the discrepancy between the predicted probabilities and the true labels.

4. Experimental Results and Discussion

4.1. Experimental Setup

All experiments in this study were conducted using the MMSegmentation toolbox [63] based on the PyTorch framework and implemented on two NVIDIA GeForce RTX 2080Ti 11G GPUs. Through multiple experiments and reference to the recommended configuration of the MMSegmentation toolbox, the following experimental settings were ultimately adopted. The polynomial decay learning strategy was employed, and the learning rate gradually decreases during training iterations until it reaches 0.0001 at the end of training. The stochastic gradient descent (SGD) [64] was the selected optimizer, where a random batch of samples is selected in each iteration to compute the gradient of the loss function and update the relevant parameters of the network model. A few parameters for SGD were set in the following manner: initial learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005.

After continuous fine-tuning, all models were trained for multiple epochs on the CHN6-CUG road dataset and RDCME, thereby ensuring the convergence of each model. It is worth noting that to prevent overfitting and enhance the robustness of the models, various data augmentation techniques such as random scaling and random flipping were employed during the training process [65]. The random scaling ratio ranges from 0.5 to 2 times the size of the original image, and there is a 50% probability of applying random flipping to the image, including both horizontal and vertical flips.

4.2. Evaluation Metrics

Road extraction from images can be effectively treated as a segmentation problem. Consequently, we adopted commonly used evaluation metrics designed for segmentation tasks to assess the performance of road extraction models. In this study, we employed five widely recognized evaluation metrics to comprehensively evaluate the road extraction efficacy of our deep learning models on the two datasets. These evaluation metrics are as described below:

1. Overall Accuracy (OA): OA is a measure of global accuracy, which represents the percentage of correctly predicted road samples from among the total number of samples. It is calculated using the following formula:

O A = \frac{T P + T N}{T P + F N + F P + T N}

(10)

In the above formula, TP represents the true positive, which is the number of positive samples correctly classified by the model; FN represents the false negative, which is the number of positive samples incorrectly classified by the model; FP represents the false positive, which is the number of negative samples incorrectly classified by the model; TN represents the true negative, which is the number of negative samples correctly classified by the model.

2. Precision: This evaluation metric is primarily used to reflect the percentage of accurate classifications in the model’s road extraction results. It represents the proportion of samples predicted as roads that are correctly classified as roads. It can be expressed in the following manner:

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

3. Recall: This evaluation metric represents the percentage of correctly predicted road pixels from among the total number of true road pixels in the image. Recall can be expressed in the following manner:

R e c a l l = \frac{T P}{T P + F N}

(12)

4. F1 Score: In order to consider both Precision and Recall together, the F1 Score is introduced. It is a weighted harmonic mean of the two above-mentioned evaluation metrics and provides a better evaluation of the performance of road extraction models. The calculation formula is given below:

F 1 - s c o r e = \frac{2 T P}{2 T P + F P + F N}

(13)

5. Intersection over Union (IoU): The IoU is a metric that measures the overlap between the predicted road extraction result and the ground truth. It is calculated by taking the ratio of the intersection of the two sets to their union. This evaluation metric provides a clear indication of the extent of overlap between the predicted and ground truth values. A higher IoU value indicates a better road extraction result of the model. The calculation formula is given below:

I o U = \frac{T P}{T P + F P + F N}

(14)

4.3. Ablation Study

We conducted ablation experiments to validate the effectiveness of each module in GLNet. The ablation experiments were performed on the RDCME.

These experimental results are presented in Table 1. We first validated the effectiveness of the MSF module and SC-Att module in the LIE module. The IoU value of the LIE module with both the MSF module and the SC-Att module increased by 3.87% and 0.13%, respectively, compared to the LIE module without the SC-Att or MSF modules. This indicates that the simultaneous use of the MSF module and the SC-Att module can effectively capture the edge details of roads in the image, thereby achieving optimal performance for the LIE module.

Then, we conducted an effectiveness analysis of the mix-FFN module within the GIE module. It can be observed that the removal of the mix-FFN module results in a significant decrease in IoU compared to the complete GIE module. Next, we took the GIE module without the MSF and SC-Att modules as the baseline model. In terms of IoU value, when we added the MSF module and the SC-Att module to the baseline model, the IoU increased by 1.1% and 3.94%, respectively. Finally, we incorporated both the MSF and the SC-Att module into the baseline, which is our proposed model. At this point, compared to the baseline, our model achieved improvements of 0.53% in OA, 4.92% in IoU, and 2.96% in the F1 Score. This demonstrates that each module in our proposed model plays a significant role in road extraction tasks.

4.4. Comparison with State-of-the-Art Models

We conducted comprehensive evaluations of the superiority of GLNet by comparing it with other well-known segmentation methods and state-of-the-art road extraction methods on various evaluation metrics, including OA, Precision, Recall, IoU, and F1 Score. The models used for comparison include DANet [49], Deeplabv3+ [22], PSPNet [66], Segformer-b5 [67], UNet [68], Light Roadformer [51], D-LinkNet [69], RADANet [70], and SPBAM-LinkNet [71]. To ensure fair comparisons, all models were evaluated under the same experimental conditions. In the mentioned approaches, DANet, UNet, DeepLabv3+, and PSPNet represent classic segmentation models. Among them, PSPNet employs a pyramid pooling structure to gather contextual information. DeepLabv3+ utilizes the ASPP module to develop expanded convolutions for enlarging the receptive field and aggregating multi-scale features. Additionally, when comparing this paper’s model with Light Roadformer, D-LinkNet, RADANet, and SPBAM-LinkNet, which are specifically designed for road extraction, better assessment of the effectiveness of the proposed model can be achieved.

4.4.1. Experiments Based on CHN6-CUG Road Dataset

The comparative experimental results are presented in Table 2 and Figure 11. Our proposed model achieved the SOTA results in terms of IoU (63.27%), Recall (75.51%), and F1 Score (77.51%). It outperformed DANet and Deeplabv3+, the two top-performing networks, by 0.84% and 0.89% in IoU, 3.24% and 1.86% in Recall, and 0.64% and 0.68% in F1 Score, respectively. However, our model slightly lagged behind DANet and Deeplabv3+ in terms of OA and Precision. DANet leveraged both position and channel dual attention to capture more global contextual information, while Deeplabv3+ employed a spatial pyramid structure to strengthen the network’s multi-scale feature representation and improve segmentation accuracy. Compared to the methods specifically designed for road extraction, our model exhibited an improvement of 2.84–7.31% in terms of IoU. Although the road extraction methods in Table 2 have achieved excellent results in their respective previous work, the road extraction effect is not satisfactory under the complex environment and high intra-class differences of roads in the Chinese region.

In summary, although our model may have a slightly lower OA compared to DANet, GLNet demonstrates significant superiority over DANet and other baseline models in other evaluation metrics. This confirms the effectiveness and reliability of our proposed model in road extraction in the Chinese region.

For a more intuitive performance evaluation of our proposed model, we selected seven representative images from different cities and scenarios from the dataset. In Figure 12, we present the road extraction results of seven different networks on these five images. From left to right, the images reveal the real image, the ground truth label, the road extraction results of our proposed network, Deeplabv3+, DANet, Segformer-b5, UNet, D-LinkNet, and Light Roadformer, respectively.

In the comparative experiment, our model’s GIE module demonstrates its effectiveness in capturing long-range feature dependencies for CNNs and attention-based networks that can only extract local information. Additionally, for transformer-based networks, our model’s LIE module reveals its advantages in extracting multi-scale information and edge details from the images.

As displayed in the first and fifth rows of Figure 12, our proposed model accurately extracts information from road intersections. In the second and fourth rows, the road extraction results of the compared models struggle to distinguish between the road and surrounding terrain, while our model performs better in discrimination. In the region depicted in the sixth row, amidst a complex background composed of diverse structures, aided by the exceptional global contextual modeling capability of the GIE module, our proposed model takes into account the road’s continuity feature, enabling accurate extraction. In the third and seventh rows, owing to the edge-detail extraction capability of the LIE module, our proposed model successfully extracts roads obscured by shadows cast by trees and buildings, which other models fail to achieve.

Through visualizing the results, we can conclude that the performance of GLNet is superior to the aforementioned baseline methods. The ability of the global–local information fusion model to capture long-range road relationships and extract edge details enables the extracted results to effectively preserve the integrity and continuity of the roads.

4.4.2. Experiments on RDCME

Road extraction on the RDCME presents some challenges that are distinct from the CHN6-CUG road dataset due to the following two main reasons:

(1) Characteristics of RDCME: Given factors such as low image resolution, road occlusion by mountain shadows, and the similarity between road and other terrain features, the road extraction model requires a robust capability to extract multi-scale road features.

(2) Insufficient data: The road extraction model needs to learn sufficient road feature information from a limited set of data samples.

Table 3 and Figure 13 present the performance of different models on the RDCME. It is evident from the table that our model achieves SOTA performance in all evaluate metrics, with road IoU and F1 Score reaching 84.97% and 91.88%, respectively. Compared to other methods, our proposed model improves the road IoU by 1.36–7.07% and the F1 Score by 0.81–4.3%.

Specifically, in terms of IoU, our proposed model improves by 2.72%, 3.84%, 1.53%, 1.36%, and 7.07% compared to DANet, Deeplabv3+, UNet, Light Roadformer, and D-LinkNet, respectively. This also demonstrates the superiority of our proposed model in road extraction tasks and the effectiveness of capturing long-range feature dependencies for domain adaptation.

To provide an intuitive performance evaluation of our proposed model, we selected six representative images from the dataset. In Figure 14, we present the road extraction results of seven different networks for these six images. From left to right, the displayed images include the original image, ground truth label, as well as the road extraction results of our proposed network, Deeplabv3+, DANet, Segformer-b5, UNet, D-LinkNet, and Light Roadformer.

In terms of the extraction results, our proposed model demonstrates superior adaptability in complex scenarios. In the first row, while all models manage to extract roads, our proposed model excels in capturing the continuity and edge details of the road. In the second row, most comparative models erroneously extract features in the upper right corner of the image resembling roads, yet our model accurately discerns them as non-road entities. In the regions depicted in the third and fifth rows, where roads closely resemble other features, competing models struggle to precisely extract roads from the image. In contrast, our model accurately identifies roads while avoiding the extraction of features resembling the road in those areas. In the visualization results of the fourth and sixth rows, compared to other models, our proposed model effectively extracts roads obscured by mountain shadows and ensures edge smoothness.

These visualizations in Figure 14 further substantiate the superior adaptability of our proposed model over other approaches in complex scenarios.

5. Conclusions

To address the challenges in road extraction tasks—such as poor road extraction results, fragmented road segmentation, blurry road boundaries, and the lack of effective verification in the Chinese region—this paper proposed a road extraction method called the GLNet. GLNet consists of three modules: GIE, LIE, and IF modules.

The GIE module can accurately extract roads under conditions of complex backgrounds and high similarity between road features and other objects. In the GIE module, we improved the multi-level feature representation of the model by using an overlapped patch merging technique. Additionally, we enhanced the multi-head attention mechanism to better capture the long-range dependencies of roads and obtain a larger receptive field. To address the challenge posed by shadow occlusion, we introduce the LIE module. In the LIE module, we introduced an MSF module and an SC-Att module to extract detailed edge information of roads.

GLNet is evaluated on two publicly available Chinese road datasets with geographic robustness—CHN6-CUG and RDCME. Extensive experimental results demonstrate that our proposed model outperforms other SOTA models in road segmentation tasks. GLNet fills the gap in precise road extraction models specifically designed for the Chinese region and provides a new benchmark for road extraction work in China.

GLNet exhibits extensive prospects in the domains of road traffic, humanitarian relief, and environmental monitoring. In terms of road traffic, GLNet boasts the capability to accurately extract and update traffic information, providing robust support for the development and optimization of intelligent urban transportation systems. In the realm of humanitarian assistance, GLNet aids rescue personnel in swiftly assessing traffic conditions in disaster-stricken areas, thereby expediting the formulation of rescue plans. Regarding environmental monitoring, GLNet excels in monitoring and evaluating the impact of land use changes, urban expansion, and transportation infrastructure construction on the ecological environment through enhanced road extraction.

In the future, we aim to delve deeper into refining the architecture of the GLNet, seeking to enhance the model’s capacity to learn both local and global connectivity of road networks. This will help in reducing gaps in predicted road networks. We will also assess the impact of these improvements on the road extraction performance. Furthermore, we are planning an in-depth analysis of road features in various geographical regions beyond China. This aims to further unearth the capability of GLNet to comprehend diverse road characteristics and explore the potential application of GLNet in road extraction endeavors in different regions.

Author Contributions

Conceptualization, Data curation, Methodology, Software, X.W.; Formal analysis, Validation, X.W. and Y.C.; Writing—original draft preparation, Y.C.; Writing—review and editing, K.H., S.W., Y.L. and Y.D.; Funding acquisition, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is funded by Geological survey projects conducted by the Geological Survey of China (No. DD20220995, DD20230135, and ZD20220409), the National Natural Science Foundation of China (No. U21A2013 and 41925007), and the Opening Fund of the Key Laboratory of Geological Survey and Evaluation of the Ministry of Education under Grant GLAB2022ZR02.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, Y.; Zhang, K.; Ji, S. Simultaneous road surface and centerline extraction from large-scale remote sensing images using CNN-based segmentation and tracing. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8919–8931. [Google Scholar] [CrossRef]
Yang, F.; Wang, H.; Jin, Z. A fusion network for road detection via spatial propagation and spatial transformation. Pattern Recognit. 2020, 100, 107141. [Google Scholar] [CrossRef]
Claussmann, L.; Revilloud, M.; Gruyer, D.; Glaser, S. A review of motion planning for highway autonomous driving. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1826–1848. [Google Scholar] [CrossRef]
Bonafilia, D.; Gill, J.; Basu, S.; Yang, D. Building high resolution maps for humanitarian aid and development with weakly-and semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1–9. [Google Scholar]
He, K.; Dong, Y.; Han, W.; Zhang, Z. An assessment on the off-road trafficability using a quantitative rule method with geographical and geological data. Comput. Geosci. 2023, 177, 105355. [Google Scholar] [CrossRef]
Panteras, G.; Cervone, G. Enhancing the temporal resolution of satellite-based flood extent generation using crowdsourced data for disaster monitoring. Int. J. Remote Sens. 2018, 39, 1459–1474. [Google Scholar] [CrossRef]
Han, W.; Feng, R.; Wang, L.; Cheng, Y. A semi-supervised generative framework with deep learning features for high-resolution remote sensing image scene classification. ISPRS J. Photogramm. Remote Sens. 2018, 145, 23–43. [Google Scholar] [CrossRef]
Han, W.; Chen, J.; Wang, L.; Feng, R.; Li, F.; Wu, L.; Tian, T.; Yan, J. Methods for small, weak object detection in optical high-resolution remote sensing images: A survey of advances and challenges. IEEE Geosci. Remote Sens. Mag. 2021, 9, 8–34. [Google Scholar] [CrossRef]
Levin, N.; Duke, Y. High spatial resolution night-time light images for demographic and socio-economic studies. Remote Sens. Environ. 2012, 119, 1–10. [Google Scholar] [CrossRef]
Wei, Y.; Wang, Z.; Xu, M. Road structure refined CNN for road extraction in aerial image. IEEE Geosci. Remote Sens. Lett. 2017, 14, 709–713. [Google Scholar] [CrossRef]
Zhu, M.; Xie, G.; Liu, L.; Wang, R.; Ruan, S.; Yang, P.; Fang, Z. Strengthening mechanism of granulated blast-furnace slag on the uniaxial compressive strength of modified magnesium slag-based cemented backfilling material. Process Saf. Environ. Prot. 2023, 174, 722–733. [Google Scholar] [CrossRef]
Liu, R.; Ma, X.; Lu, X.; Wang, M.; Wang, P. Automatic extraction of urban road boundaries using diverse LBP features. Natl. Remote Sens. Bull. 2022, 26, 14. [Google Scholar] [CrossRef]
Tao, J.; Chen, Z.; Sun, Z.; Guo, H.; Leng, B.; Yu, Z.; Wang, Y.; He, Z.; Lei, X.; Yang, J. Seg-Road: A Segmentation Network for Road Extraction Based on Transformer and CNN with Connectivity Structures. Remote Sens. 2023, 15, 1602. [Google Scholar] [CrossRef]
Valero, S.; Chanussot, J.; Benediktsson, J.A.; Talbot, H.; Waske, B. Advanced directional mathematical morphology for the detection of the road network in very high resolution remote sensing images. Pattern Recognit. Lett. 2010, 31, 1120–1127. [Google Scholar] [CrossRef]
Shao, Y.; Guo, B.; Hu, X.; Di, L. Application of a fast linear feature detector to road extraction from remotely sensed imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2010, 4, 626–631. [Google Scholar] [CrossRef]
Kahraman, I.; Karas, I.; Akay, A.E. Road extraction techniques from remote sensing images: A review. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, 42, 339–342. [Google Scholar] [CrossRef]
Mattyus, G.; Wang, S.; Fidler, S.; Urtasun, R. Enhancing road maps by parsing aerial images around the world. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1689–1697. [Google Scholar]
Wang, J.; Song, J.; Chen, M.; Yang, Z. Road network extraction: A neural-dynamic framework based on deep learning and a finite state machine. Int. J. Remote Sens. 2015, 36, 3144–3169. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Zhong, Z.; Li, J.; Cui, W.; Jiang, H. Fully convolutional networks for building and road extraction: Preliminary results. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1591–1594. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Alshehhi, R.; Marpu, P.R.; Woon, W.L.; Dalla Mura, M. Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2017, 130, 139–149. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xia, M.; Wang, X.; Liu, Y. RoadNet: Learning to comprehensively analyze road networks in complex urban scenes from high-resolution remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2043–2056. [Google Scholar] [CrossRef]
Lian, R.; Huang, L. DeepWindow: Sliding window based on deep learning for road extraction from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1905–1916. [Google Scholar] [CrossRef]
Cui, F.; Feng, R.; Wang, L.; Wei, L. Joint Superpixel Segmentation and Graph Convolutional Network Road Extration for High-Resolution Remote Sensing Imagery. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2178–2181. [Google Scholar]
Li, X.; Wang, Y.; Zhang, L.; Liu, S.; Mei, J.; Li, Y. Topology-Enhanced Urban Road Extraction via a Geographic Feature-Enhanced Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8819–8830. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Xu, G.; Song, T.; Sun, X.; Gao, C. TransMIN: Transformer-Guided Multi-Interaction Network for Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6000505. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Xu, Z.; Liu, Y.; Gan, L.; Sun, Y.; Wu, X.; Liu, M.; Wang, L. Rngdet: Road network graph detection by transformer in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Wang, C.; Xu, R.; Xu, S.; Meng, W.; Wang, R.; Zhang, J.; Zhang, X. Towards accurate and efficient road extraction by leveraging the characteristics of road shapes. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404616. [Google Scholar] [CrossRef]
Luo, L.; Wang, J.X.; Chen, S.B.; Tang, J.; Luo, B. BDTNet: Road extraction by bi-direction transformer from remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2505605. [Google Scholar] [CrossRef]
Wang, B.; Liu, Q.; Hu, Z.; Wang, W.; Wang, Y. TERNformer: Topology-enhanced Road Network Extraction by Exploring Local Connectivity. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4406314. [Google Scholar] [CrossRef]
Chen, T.; Jiang, D.; Li, R. Swin transformers make strong contextual encoders for VHR image road extraction. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3019–3022. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Chen, Z.; Chang, R.; Pei, X.; Yu, Z.; Guo, H.; He, Z.; Zhao, W.; Zhang, Q.; Chen, Y. Tunnel geothermal disaster susceptibility evaluation based on interpretable ensemble learning: A case study in Ya’an–Changdu section of the Sichuan–Tibet traffic corridor. Eng. Geol. 2023, 313, 106985. [Google Scholar] [CrossRef]
Ma, Y.; Chen, D.; Wang, T.; Li, G.; Yan, M. Semi-supervised partial label learning algorithm via reliable label propagation. Appl. Intell. 2023, 53, 12859–12872. [Google Scholar] [CrossRef]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Tan, J.; Gao, M.; Yang, K.; Duan, T. Remote sensing road extraction by road segmentation network. Appl. Sci. 2021, 11, 5050. [Google Scholar] [CrossRef]
Chen, R.; Hu, Y.; Wu, T.; Peng, L. Spatial attention network for road extraction. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1841–1844. [Google Scholar]
Zhou, G.; Chen, W.; Gui, Q.; Li, X.; Wang, L. Split depth-wise separable graph-convolution network for road extraction in complex environments from high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614115. [Google Scholar] [CrossRef]
Xu, Q.; Long, C.; Yu, L.; Zhang, C. Road Extraction With Satellite Images and Partial Road Maps. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. Spacenet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A global context-aware and batch-independent network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, Y.; Wang, L.; Han, W.; Feng, R.; Fan, R.; Wang, S. Complex Mountain Road Extraction in High-Resolution Remote Sensing Images via a Light Roadformer and a New Benchmark. Remote Sens. 2022, 14, 4729. [Google Scholar] [CrossRef]
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Wei, X.; Xia, H.; Shen, C. Conditional positional encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
Islam, M.A.; Jia, S.; Bruce, N.D. How much position information do convolutional neural networks encode? arXiv 2020, arXiv:2001.08248. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 558–567. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Grauman, K.; Darrell, T. The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 17–21 October 2005; Volume 2, pp. 1458–1465. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 2169–2178. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, K.; Zhang, Z.; Niu, X.; Zhang, Y.; Cao, C.; Xiao, F.; Gao, X. Retinal vessel segmentation of color fundus images using multiscale convolutional neural network with an improved cross-entropy loss function. Neurocomputing 2018, 309, 179–191. [Google Scholar] [CrossRef]
MMLab Contributors. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 26 August 2023).
Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of the COMPSTAT’2010; Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186. [Google Scholar]
Dai, L.; Zhang, G.; Zhang, R. RADANet: Road augmented deformable attention network for road extraction from complex high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602213. [Google Scholar] [CrossRef]
Bai, X.; Feng, X.; Yin, Y.; Yang, M.; Wang, X.; Yang, X. Combining Images and Trajectories Data to Automatically Generate Road Networks. Remote Sens. 2023, 15, 3343. [Google Scholar] [CrossRef]

Figure 1. The challenges in road extraction tasks: (a) Highlights the issue of high intra-class differences in road types; (b) Demonstrates the problem of roads being concealed by shadows; (c) Shows the complexity of image backgrounds and the difficulty in distinguishing roads from surrounding objects.

Figure 2. A few images in CHN6-CUG Road Dataset.

Figure 3. The roads in the RDCME that are situated in complex environments: (a) A road along a mountain ridge; (b) A road on a mountain top: (c) A road in close proximity to a river; (d) A road that is obscured by shadows.

Figure 4. Overall architecture of GLNet.

Figure 5. GIE module.

Figure 6. LIE module.

Figure 7. Structure of the MSF module.

Figure 8. SC-Att module.

Figure 9. Channel attention sub-module.

Figure 10. Spatial attention sub-module.

Figure 11. The segmentation performance of different models on CHN6-CUG road dataset is depicted using a bar chart.

Figure 12. Visualization results of different methods on the CHN6-CUG dataset.

Figure 13. The segmentation performance of different models on RDCME is depicted using a bar chart.

Figure 14. Visualization results of different methods on the RDCME dataset.

Table 1. Ablation experiment results of our model on RDCME. Bold text indicates the result that performed the best.

Local-MSF	Local-SC-Att	Global- withoutmix- FFN	Global	OA (%)	IoU (%)	F1-Score (%)
✓				98.22	79.49	88.57
	✓			98.58	83.23	90.85
✓	✓			98.59	83.36	90.93
		✓		96.68	62.66	77.04
			✓	98.2	80.05	88.92
✓			✓	98.37	81.15	89.89
	✓		✓	98.66	83.99	91.3
✓	✓		✓	98.73	84.97	91.88

Table 2. Segmentation performance of different models on the CHN6-CUG road dataset. DANet, DeepLabv3+, PSPNet, and UNet are CNN-based methods for segmentation tasks. Segformer-b5 is a transformer-based method for segmentation. Light Roadfomer is a transformer-based road extraction method, whereas RADANet, D-LinkNet, and SPBAM-LinkNet are CNN-based road extraction networks. RADANet and SPBAM-LinkNet are implemented by [70,71], respectively. ‘-’ indicates that the paper did not provide this result. Bold text indicates the result that performed the best.

Model Name	OA (%)	IoU (%)	Recall (%)	F1-Score (%)	Precision (%)	Params
DANet	97.51	62.43	72.27	76.87	82.09	49.82 M
Deeplabv3+	97.45	62.38	73.65	76.83	80.29	43.58 M
PSPNet	97.28	59.68	70.21	74.75	79.92	134.76 M
Segformer-b5	97.19	59.09	70.97	74.29	77.94	81.97 M
UNet	97.24	59.46	70.71	74.58	78.9	29.06 M
Light Roadformer	97.18	59.25	71.61	74.41	77.44	68.72 M
D-LinkNet	97.21	55.96	61.86	71.76	85.44	52.36 M
RADANet	-	60.43	-	75.34	-	73.85 M
SPBAM-LinkNet	96.95	-	-	73.69	-	-
Ours	97.49	63.27	75.51	77.51	79.61	77.95 M

Table 3. Segmentation performance of different models on RDCME. Bold text indicates the result that performed the best.

Model Name	OA (%)	IoU (%)	Recall (%)	F1-Score (%)	Precision (%)
DANet	98.47	82.25	90.69	90.26	89.84
Deeplabv3+	98.39	81.13	88.96	89.58	90.21
PSPNet	98.24	79.58	88.11	88.63	89.15
Segformer-b5	98.54	83.02	91.64	90.72	89.82
UNet	98.6	83.44	90.31	90.97	91.64
Light-Roadformer	98.6	83.61	91.61	91.07	90.54
D-LinkNet	98.07	77.9	87.34	87.58	87.82
Ours	98.73	84.97	91.66	91.88	92.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Cai, Y.; He, K.; Wang, S.; Liu, Y.; Dong, Y. Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China. Remote Sens. 2023, 15, 4686. https://doi.org/10.3390/rs15194686

AMA Style

Wang X, Cai Y, He K, Wang S, Liu Y, Dong Y. Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China. Remote Sensing. 2023; 15(19):4686. https://doi.org/10.3390/rs15194686

Chicago/Turabian Style

Wang, Xudong, Yujie Cai, Kang He, Sheng Wang, Yan Liu, and Yusen Dong. 2023. "Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China" Remote Sensing 15, no. 19: 4686. https://doi.org/10.3390/rs15194686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global–Local Information Fusion Network for Road Extraction: Bridging the Gap in Accurate Road Segmentation in China

Abstract

1. Introduction

1.1. CNN-Based Method for Road Extraction

1.2. Transformer-Based Method for Road Extraction

1.3. Gap in Road Extraction Research in the Chinese Region

1.4. Contributions and Structure

2. Data

2.1. CHN6-CUG Road Dataset

2.2. Road Datasets in Complex Mountain Environments (RDCME)

3. Method

3.1. Overall Architecture

3.2. Global Information Extraction Module

3.2.1. Overlapped Patch Embedding

3.2.2. Self-Attention Mechanism

3.2.3. Mix-FFN

3.3. Local Information Extraction Module

3.3.1. The Multi-Scale Feature Module

3.3.2. The Spatial–Channel Dual Attention Module

3.4. Loss Functions and Classifiers

4. Experimental Results and Discussion

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Ablation Study

4.4. Comparison with State-of-the-Art Models

4.4.1. Experiments Based on CHN6-CUG Road Dataset

4.4.2. Experiments on RDCME

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI