An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery

Wu, Ruixiang; Zhang, Lun; Guan, Longkai; Ni, Xiangrong; Gong, Jianxing

doi:10.3390/rs17173037

Open AccessArticle

An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery

by

Ruixiang Wu

¹

,

Lun Zhang

^1,2

,

Longkai Guan

¹,

Xiangrong Ni

¹ and

Jianxing Gong

^1,2,*

¹

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

²

National Key Laboratory of Equipment State Sensing and Smart Support, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3037; https://doi.org/10.3390/rs17173037

Submission received: 5 August 2025 / Revised: 27 August 2025 / Accepted: 29 August 2025 / Published: 1 September 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Extracting road network information from satellite remote sensing images is an effective method of dealing with dynamic changes in road networks. At present, the use of deep learning methods to automatically segment road networks from remote sensing images has become mainstream. However, existing methods often produce fragmented extraction results. This is usually caused by insufficient feature extraction and occlusion. In order to solve these problems, we propose an enhanced feature extraction and multi-branch occlusion discrimination network (EFMOD-Net) based on an encoder–decoder architecture. Firstly, a multi-directional feature extraction (MFE) module was proposed as the input for the network, which utilizes multi-directional strip convolution for feature extraction to better capture the linear features of the road. Subsequently, an enhanced feature extraction (EFE) module was designed to enhance the performance of the model in the feature extraction stage by using a dual-branch structure. The proposed multi-branch occlusion discrimination (MOD) module combines the attention mechanism and strip convolution to learn the topological relationship between pixels, enhance the network’s detection of occlusion and complex backgrounds, and reduce the generation of road debris. On the public dataset, the proposed method is compared with other SOTA methods. The experimental results show that the network designed in this paper achieves an IoU of 64.73 and 63.58 on the DeepGlobe and CHN6-CUG datasets, respectively, which is 1.66% and 1.84% higher than the IoU of performance-based methods. The proposed method combines multi-directional bar convolution and a multi-branch structure for road extraction, which provides a new idea for linear object segmentation in complex backgrounds that could be applied directly to urban renewal, disaster assessment, and other application scenarios.

Keywords:

semantic segmentation; feature extraction; remote sensing; road extraction; deep learning

1. Introduction

Roads constitute a vital traffic infrastructure, laying a foundation for various forms of ground transportation. An accurate road network consistent with the real world is very important for various applications, such as autonomous driving [1,2], urban planning [3,4], and geographic information system (GIS) updates [5,6]. To track road changes, caused by accidents, natural disasters, policy planning, etc., map service providers like Google apply various road extraction methods on different types of measuring data (LIDAR point cloud data, GPS data, or even manually labeled data) collected by patrol vehicles. These methods require a large amount of manpower and resources, but the accuracy of the road extraction data cannot be guaranteed. In response to these challenges, researchers have turned to high-resolution remote sensing images, seeking a more accurate, efficient, and cost-effective approach [7,8,9].

Traditional road extraction from remote sensing images typically leads to the design of road features first, and roads are extracted from remote sensing images based on specific rules [10,11]. The accuracy of manually designed features cannot be guaranteed; meanwhile, these methods require significant time consumption, making real-time application unattainable. With the advancement of deep learning (DL), convolutional neural networks (CNNs) have found applications in a wide range of computer vision domains. In particular, fully convolutional neural networks [12] have proven to be effective in semantic segmentation tasks [13,14,15,16]. Currently, convolutional neural networks are also utilized in mainstream research with encoder–decoder structures for road extraction from remote sensing images, such as Link-Net [17], CoaNet [18], and Deep Road Mapper [19]. These networks establish long-distance context relationships through cross-layer connections, enabling the network to learn road information more effectively. These models have demonstrated their ability to extract road networks quickly and efficiently on some public datasets [20,21,22]. However, deep learning-based road extraction from remote sensing images still faces several challenges, such as occlusion caused by non-road objects, including buildings, street trees, and vehicles; complex traffic environments; and the irregular shapes of roads themselves. These issues often result in fragmented extraction results. To address the aforementioned challenges, scholars have tried various methods of improving the model’s ability, including contextual information modeling, multi-scale and multi-branch feature extraction, feature recombination, and specialized convolution development, to enhance segmentation performance. Lu et al. [23] proposed a global perception road detection network based on multi-scale residual learning (GAMS-Net). This network employs multi-scale residual learning to obtain multi-scale features and utilizes global perception operations to capture spatial contextual dependencies and inter-channel dependencies. Zhu et al. [24] used dilation convolution with different expansion rates to extract the multi-scale features of roads in parallel. Mosinska et al. [25] used the U-Net network to design a new loss term combined with other loss functions to refine the image segmentation results. Yang et al. [26] designed a specialized convolution module that uses strip convolutions in four different directions to process the feature map, thereby capturing the topological structure of the road and addressing the issue of road occlusion. Liu et al. [27] integrated multi-level features such as road edges, centerlines, and road surfaces to provide additional information and enhance the model’s learning performance.

In addition to the above methods, some studies have introduced attention mechanisms into the network to enable more effective focus on the feature representation of the road. For example, Zhang et al. [28] designed an extended convolutional strip attention (DCSA) module to focus on the characteristics of the vertical and horizontal directions of the road. Hou et al. [29] proposed a novel attention mechanism that embeds location information while weighting channels to enhance the feature representation of objects of interest. Xu et al. [30] proposed IDANet, which uses an iterative D-LinkNet model with an attention module to improve network segmentation performance. Although using an attention mechanism to improve the model’s understanding of the road is an effective method for improving feature extraction performance, for the occlusion problem, the road fragmentation phenomenon still exists due to its inability to learn the relationship between pixels.

This paper proposes an enhanced feature extraction and multi-branch occlusion discrimination network (EFMOD-Net) based on an encoder–decoder architecture. Firstly, the multi-directional feature extraction module (MFE) is used as the input to the network, and multi-directional bar convolution is used for feature extraction. Square convolution is inconsistent with the slender and irregular shape of roads, which limits the learning of linear road features. Therefore, the MFE module uses four stripe convolutions in different directions to fully capture road features. In addition, the enhanced feature extraction module (EFE) supplements feature learning through additional branches to enhance the feature extraction ability of the model, while the multi-branch occlusion discrimination module (MOD) uses an attention mechanism and strip convolution to learn the topological relationship between adjacent pixels to alleviate road fragmentation caused by occlusion. Our contributions are summarized as follows:

A multi-directional feature extraction module is proposed to improve the model’s ability to extract linear road features.
An enhanced feature extraction module is designed, which utilizes additional branches to supplement feature information and enhance the learning of road features.
A multi-branch occlusion discrimination module is designed. It uses an attention mechanism and multi-directional bar convolution to learn the topological structure between adjacent pixels and reduce the fragmentation path.
An enhanced feature extraction and multi-branch occlusion discrimination network is proposed to enhance road extraction performance and improve the accuracy of road extraction from remote sensing images. Compared with other methods, it achieves better results on widely used public datasets.

The remainder of this paper is organized as follows: Section 2 introduces related work on road extraction. Section 3 provides a detailed explanation of the overall network architecture and the structure of each module. Section 4 describes the specific details of the experiments, including datasets and experimental platforms, and presents comparison results with state-of-the-art models on different public datasets. Finally, Section 5 concludes the paper with a summary and discussion.

2. Related Work

2.1. Input Header

The input header, serving as the initial stage of the network architecture, is designed to separate and extract preliminary features from input images through channel expansion. This process separates features into different channels for subsequent feature extraction. Previous studies have used different methods to achieve this operation: Zhou et al. [31] utilized three consecutive

3 \times 3

convolutional layers in their SGCN network for channel expansion, followed by downsampling via a stride convolution and a two-layer square convolution for initial feature extraction. Similarly, LinkNet [17] adopted a

7 \times 7

convolution kernel for feature separation, complemented by a

3 \times 3

convolution kernel for downsampling before transmitting features to downstream layers.

However, conventional square convolutions exhibit inherent limitations in capturing road features. Roads typically have elongated, curvilinear structures, which square convolution kernels struggle to represent effectively due to their isotropic nature and lack of rotational invariance [32]. This often leads to information loss during the initial feature encoding phase. Losing some information in the initial input header stage will lead to insufficient feature extraction in the subsequent stage. This usually causes the road to be misclassified or missed. In most of the current studies, the pretrained model is usually used as the encoder network, but few studies focus on the design of the input header part of the network.

2.2. Feature Extraction

Road segmentation in high-resolution remote sensing imagery predominantly employs encoder–decoder architectures due to their inherent capability to establish long-range contextual relationships through cross-layer connections. These networks simultaneously integrate global semantic information with fine-grained local feature representation, thereby optimizing segmentation precision. Within this framework, the encoder module assumes critical responsibility for hierarchical feature extraction, where the efficacy of this process directly determines the ultimate segmentation performance. Consequently, the selection of backbone networks constitutes a pivotal factor influencing model outcomes, as it fundamentally governs the quality of multi-scale feature representation and information propagation throughout the network hierarchy.

The emergence of residual networks (ResNet) [33] marked a significant breakthrough in deep learning, effectively addressing the performance degradation problem in deep neural networks through innovative residual connections. This architectural advancement has enabled the development of substantially deeper networks while maintaining training stability, leading to ResNet’s widespread adoption across numerous computer vision tasks, including road segmentation from remote sensing imagery. In road segmentation applications, ResNet’s powerful hierarchical feature extraction capabilities have made it a predominant choice for encoder structures. Several notable implementations demonstrate this trend: Lu et al. [34] developed the CasMT framework, incorporating Link-Net50 with a ResNet50 backbone for robust road surface feature extraction. Similarly, Li et al. [35] proposed MBRE-Net, utilizing a U-shaped D-LinkNet architecture with ResNet34 in the encoder stage, while Zhou et al. [31] employed ResNet50 as the primary feature extraction backbone network in their SGCN network. Wang et al. [36] proposed the lightweight resnet18 as an encoder in UNetformer to carry out the semantic segmentation of urban scenes.

A fundamental limitation persists across these architectures: the inevitable loss of global contextual information and progressive focus on local features during deep feature extraction, compounded by information degradation through successive downsampling operations. At the same time, shallow ResNet or simple network layers will lead to insufficient feature extraction, which increases the difficulty of segmenting roads that account for a small proportion of the image.

Research proves that a multi-branch structure can also enhance the feature extraction ability of the model. Xin et al. [37] proposed GPINet and designed an encoder with a dual-branch structure and a local–global interaction module (LGIM) to make full use of the local and global context for feature refinement. Wang et al. [38] proposed an FDNet with a dual-branch structure, which is used to enhance the model’s ability to extract high-frequency and low-frequency information and to reduce distortion while completing image compression.

Multi-scale convolution is a good method for enhancing feature extraction. For example, KIM et al. [39] developed a multi-scale convolutional neural network structure composed of parallel convolution paths with different kernel sizes, extracted features from multiple timescales, and applied them to the fault diagnosis of rotating machines, achieving good results. Wang et al. [40] proposed MSTA-YOLO, which combines multi-scale convolution with an attention mechanism to enhance the feature extraction ability of the model so as to effectively detect landslides. Xie et al. [41] proposed SDDGRNets for change detection in remote sensing images, using multi-scale convolution to improve the representation of salient features.

In summary, the multi-scale convolution and multi-branch method can help the network extract more feature information and enhance its feature extraction performance, which guides our research.

2.3. Occlusion Discrimination

Connectivity is one of the most important features of roads and is essential for autonomous driving, vehicle navigation, and path planning. The connectivity of roads significantly affects the final path selection and planning results. However, remote sensing images often contain many occlusions on roads, such as roadside trees, tall buildings, shadows, and parked vehicles, which can cover parts of the road surface, leading to road fragmentation in the prediction results. In response to this issue, recent studies have given it considerable attention and proposed various methods to address road occlusion.

The challenge of maintaining road connectivity in occluded scenarios has prompted various methodological innovations. Máttyus et al. [19] developed an approach combining encoder–decoder segmentation with shortest-path-based post-processing to infer missing connections in aerial imagery. Bastani et al. [8] introduced RoadTracer, an iterative graph-based method that progressively constructs road networks through CNN-guided node prediction. Alternative strategies have focused on architectural modifications, such as the topology-aware loss function proposed by Mosinska et al. [25] for U-Net architectures, which explicitly preserves network connectivity during segmentation refinement. Similarly, Batra et al. [9] demonstrated that the joint optimization of directional features and segmentation through multi-branch convolution could enhance topological accuracy, though such iterative methods often incur substantial computational overhead.

Recent advances have explored multi-task learning frameworks to address connectivity challenges. Zhang et al. [42] developed a dual-branch network for simultaneous prediction of node confidence and connectivity graphs, while Liu et al. [43] employed hierarchical feature learning to jointly extract road surfaces, edges, and centerlines. The RoadCorrector framework by Li et al. [35] further advanced this paradigm through specialized extraction and fusion of road surfaces, centerlines, and intersections. Complementary approaches have investigated data fusion strategies, with Zhang et al. [27] and Xu et al. [44] demonstrating that GPS trajectory integration can effectively compensate for occlusion-induced information loss in optical imagery but requires complex preprocessing of trajectory data.

However, the use of post-processing or hierarchical prediction methods will increase the time consumption and computational costs of the entire process of road extraction. The introduction of additional data to supplement information usually involves data collection and complex preprocessing, which is adapted to the task of semantic segmentation and also increases the costs associated with the entire task.

3. Methodology

In this section, we will first introduce the overall architecture of EFMOD-Net and then detail each part of the entire network, namely the MFE, EFE, and MOD modules.

3.1. EFMOD-Net

The proposed network architecture is illustrated in Figure 1. This encoder–decoder structure has gained widespread adoption in image segmentation tasks due to its ability to establish long-range contextual relationships through cross-layer connections while maintaining fine-grained spatial information. The entire network can be divided into three parts: the encoder, the decoder, and the output head. The encoder consists of the MFE module, the EFE module, and the Atrous Spatial Pyramid Pooling (ASPP) module, which work together to complete feature extraction. The MFE module employs a multi-branch structure and multi-directional strip convolution for preliminary feature extraction, aiming to capture more detailed road features. The EFE module utilizes additional residual branches from the input source to reintroduce global information into the corresponding feature extraction stage, supplementing important features that may have been missed in the early stages of feature extraction. As the final part of the encoder network, the ASPP module uses convolution kernels with different dilation rates to extract road features and perform feature fusion. Convolution with different dilation rates increases the receptive field and captures features at multiple scales. The decoder consists of four MOD modules. In each MOD module, the input feature map is enhanced, and strip convolution is applied to distinguish connectivity. The PSP_Head is selected as the output head of the network; PSP_Head is a module of the Pyramid Scene Parsing Network [45]. The final output feature map is pooled again using four convolution kernels of different sizes, and the resulting feature maps are deconvolved and fused to integrate information from different scales. Finally, prior information is incorporated into the fused feature map, enabling more detailed road information modeling and improving segmentation accuracy.

3.2. MFE-Block

The input head of road segmentation networks serves as a crucial transitional component between the raw input data and the encoder that is responsible for initial feature separation and preliminary extraction.

Current implementations, as evidenced by multiple studies [17,18,31], commonly rely on conventional square convolution operations for channel expansion and early-stage feature processing. While this approach demonstrates effectiveness in general computer vision tasks, it presents significant limitations when applied to road segmentation due to a fundamental mismatch between the isotropic nature of square convolution kernels and the anisotropic characteristics of road networks. The symmetric receptive fields of conventional square convolution kernels prove particularly inadequate for capturing the slender, curvilinear geometric patterns characteristic of road networks. This architectural limitation frequently leads to inefficient representation of linear features during the initial encoding phase, where critical topological information may be lost. The loss of critical information during initial processing stages may result in insufficient feature extraction, which can ultimately lead to partial road omission in the final segmentation results.

The slender shape of the strip convolution highly matches the shape of the road, enabling the extraction of linear road features and significantly reducing interference from irrelevant information. However, real-world road networks exhibit complex directional variations that require adaptive feature capture. To address this, our proposed MFE module incorporates strip convolutions along multiple orientations to comprehensively characterize road features. This design significantly enhances the network’s ability to preserve critical structural information during initial feature extraction, effectively mitigating the common problem of early-stage information loss. The detailed architecture and implementation of this innovative MFE module are presented in Figure 2.

First, given an input

X \in R^{(N \times M)}

, Z is obtained by

3 \times 3

convolution kernel separation. Then, using the structure of the residual block,

t_{[h, w]}^{a n g}

represents strip convolution in four different directions

(a n g \in 0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ})

, where h and w are the shape and size of the strip convolution kernel and

a n g

represents the direction of the strip convolution. Multi-scale feature extraction of the input feature map is performed to obtain

z_{i}

(i = 1, 2, 3, 4)

, and y is obtained by applying splicing to the channel dimension. Then, the features Z and y obtained in the first step are added to obtain the final output feature map F. The calculation process for the MFE module can be expressed as

\begin{matrix} Z & = k \cdot X \\ z_{i} & = t_{[h, w]}^{a n g} \cdot Z \\ y & = C a t (z_{1}, z_{2}, z_{3}, z_{4}) \\ F & = y + Z \end{matrix}

(1)

where

C a t

stands for concatenation by channel.

3.3. EFE Module

In the encoder–decoder structure, the encoder is primarily responsible for feature extraction, which critically impacts the final results. As the network depth increases, the extracted features tend to focus more on local details, while information fusion may cause a loss of key global features. To address these issues, this paper proposes the EFE module, whose overall architecture is illustrated in Figure 3. This module comprises two distinct branches designed to capture complementary features for improved road segmentation performance. The multi-branch structure can process the feature map in parallel. Using different operations in the multi-branch mechanism, we can obtain richer semantic features and enhance the feature extraction performance of the model. The main branch focuses on extracting detailed spatial features from the encoder’s intermediate outputs. The input

F_{i}

corresponds to the output of each block in the encoder. In the main branch, the input features first pass through a max pooling layer to reduce the resolution and then undergo horizontal and vertical strip convolutions,

t_{(1 \times 3)}

and

t_{(3 \times 1)}

, to learn road features, resulting in a feature map,

y_{b}

, of the main branch. Simultaneously, the auxiliary branch processes the original input image X to provide complementary global information. The image is first max-pooled at a specified magnification to match the resolution of the corresponding main branch feature. A feature separation operation is then applied to align the channel dimensions. Inspired by residual learning paradigms, we employ a modified residual block to extract hierarchical features,

y_{r} e s

. Subsequently, the feature maps of the two branches are concatenated. Finally, the output features are obtained through a

1 \times 1

convolution layer. The calculation process from the input to the output of the EFE module can be expressed as follows:

\begin{matrix} y_{b} & = M a x p o o l (F_{i}) \cdot t_{1 \times 3} \cdot t_{3 \times 1} \\ y_{r e s} & = D S_{n} (X) \cdot k_{3} \cdot k_{3} + D S (X) \\ F_{i + 1} & = C a t (y_{r e s} \cdot y_{b}) \cdot k_{1} \end{matrix}

(2)

where

M a x P o o l

represents the maximum pooling operation,

D S

represents the downsampling operation; subscript n represents the specific downsampling operation, k denotes the convolution kernel, and its subscript represents the size of the convolution kernel. For example,

k_{3}

represents a convolution kernel with a size of

3 \times 3

. Since the additional branch directly downsamples the original input image and incorporates the information into the corresponding feature extraction stage, the network can learn more features.

3.4. MOD Module

The decoder leverages the semantic features extracted by the encoder to progressively upsample the image’s spatial features, thereby restoring its original resolution. However, remote sensing images frequently suffer from occlusion issues, where roadside trees, vehicles, and tall buildings obstruct road surfaces, leading to fragmented road topology and inadequate feature representation. Conventional square convolution struggles to address these occlusions effectively. Given the elongated and directional nature of roads, strip convolution proves more suitable for capturing their structural patterns [18,26].

Meanwhile, attention mechanisms have been widely adopted due to their ability to enhance feature discrimination. By adaptively weighting feature maps, these mechanisms enable the network to concentrate on semantically critical regions, improving localized feature extraction. Building on these insights, we propose the MOD decoder module, which integrates the strengths of multi-head attention and strip convolution. This design facilitates multi-scale fine-grained representation across network layers while improving the discrimination of road topology under occlusions, as illustrated in Figure 4.

The MOD module processes input features through the following steps: Given the input feature map

E_{i}

, it is first split into four segments along the channel dimension. Firstly, using a lightweight alternative similar to Vision Transformer (ViT), the learnable parameter

p_{i}

is introduced to multiply the feature map element by element to obtain the global semantic information. Depthwise separable convolution is applied to the fourth branch. The feature map

s_{i}

obtained by each branch is weighted channel-wise and reorganized along the channel dimension into the same channel dimension as the input feature, resulting in the weighted and reorganized feature

L_{i}

. Then, four strip convolutions of different orientations

t_{[h, w]}^{a n g}

are used for convolution operations to predict the connectivity of the road. Finally, the output feature map

E_{i + 1}

is obtained. The MOD module can be expressed as

s_{j} = \{\begin{matrix} e_{j} ⊙ p_{j} (1 \leq j \leq 3) \\ D W (e_{j}) (j = 4) \end{matrix}

(3)

where

D W

stands for depthwise separable convolution and ⊙ stands for the Hadamard product. The recombination feature can be expressed as follows:

\begin{matrix} L_{i} & = C a t (S E (s_{j})) (j = 1, 2, 3, 4) \end{matrix}

(4)

where

C a t

represents the splicing operation, and SE represents the channel attention. The specific implementation of SE can refer to the work of Jie et al. [46]. The implementation process of SE is as follows:

Y = X \cdot S i g m o i d (L N (L N (G p (X))))

(5)

where X is the input, Y is the output, Gp represents the global average pooling, Sigmoid is the activation function, and LN is the fully connected layer. The final output can be expressed as follows:

E_{i + 1} = C a t (t_{[h, w]}^{a n g} \cdot L_{i}) (a n g \in 0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ})

(6)

In the multi-branch architecture, feature maps undergo channel-wise grouping and adaptive weighting, enabling the network to prioritize road regions and enhance feature representation. The subsequent occlusion discrimination module dynamically adjusts receptive fields through variable-size strip convolution kernels. This design allows for contextual learning from peripheral pixels for connectivity inference, while slanted strip convolutions, as shown in Figure 4, extend spatial perception beyond conventional horizontal/vertical orientations by capturing diagonal connectivity patterns. Through concatenation and hierarchical fusion of these multi-orientation features, the system effectively corrects occluded road predictions.

4. Experiments

In this section, we conduct extensive comparative experiments to validate the effectiveness of the proposed model. The experimental details will be systematically presented, including dataset descriptions, evaluation metrics, implementation settings, and result analyses.

To ensure scientific rigor and fairness in the model comparison while enhancing experimental reproducibility, we integrate all baseline models into a unified evaluation framework. The subsequent sections will detail the experimental framework specifications and hardware configurations used.

4.1. Datasets

The models in this experiment were trained and evaluated on two datasets: DeepGlobe (DP) and CHN6-CUG (CHN6).

(1): DeepGlobe [20]: This dataset was released during the 2018 CVPR DeepGlobe Road Extraction Challenge. It consists of 8570 high-resolution remote sensing images from India, Thailand, and Indonesia, of which 6226 images are annotated. The image size is $1024 \times 1024$ pixels, with a resolution of 0.5 m per pixel. The dataset covers a variety of scenarios, including but not limited to urban, rural, coastal, and tropical forest areas. For the experiments, we selected the 6226 images with labeled data and divided them into training and test sets at a ratio of 85% to 15%. During training, a data augmentation strategy involving random cropping was adopted. Specifically, the $1024 \times 1024$ images were randomly cropped into $512 \times 512$ patches using a sliding window approach with a step size of 340 pixels.
(2): CHN6 [22]: This dataset consists of high-resolution remote sensing images from six cities with varying urbanization levels, urban scales, development degrees, urban structures, and historical and cultural backgrounds. These cities include Chaoyang District of Beijing, Yangpu District of Shanghai, downtown Wuhan, Nanshan District of Shenzhen, Sha Tin District of Hong Kong, and Macao, China. The dataset contains a total of 4511 images, each with a size of $512 \times 512$ pixels and a resolution of 0.5 m per pixel. In the experiments, the same dataset from the original paper was used, with 3608 images allocated for training and 903 images for testing. No data augmentation strategy was applied to this dataset during the experiments.

4.2. Implementation Details

All the road extraction networks used in this experiment were implemented on the MMSegmentation platform, which is part of the OpenMMLab series jointly developed by SenseTime and the Chinese University of Hong Kong. This framework is built on the deep learning framework PyTorch [47] and integrates numerous open-source algorithms. The hardware used includes 4 × NVIDIA Tesla V100 GPUs, and the operating system is Ubuntu 22.04.3. For training, the optimizer selected is AdamW, with an initial lr of 0.002 and a weight decay of 0.05. The loss function combines binary cross-entropy (BCE) loss and Dice loss. Training follows an iterative approach with a total of 320,000 iterations, divided into two stages: a warm-up stage and a formal training stage. Specifically, a linear learning rate is applied during the first 100 iterations for warm-up, followed by a polynomial learning rate for the remaining iterations. This combination accelerates model convergence. The final output of the model is determined by selecting the prediction with the highest confidence. The experimental parameters are summarized in Table 1. The loss function is calculated as follows:

\begin{matrix} L_{B C E} & = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot l o g ({\bar{y}}_{i} + (1 - y_{i}) \cdot l o g (1 - {\bar{y}}_{i}))] \end{matrix}

(7)

\begin{matrix} L_{D i c e} & = \frac{2 \sum_{i = 1}^{N} (y_{i} {\bar{y}}_{i})}{\sum_{i = 1}^{N} y_{i}^{2} + \sum_{i = 1}^{N} {\bar{y}}^{2}} \end{matrix}

(8)

\begin{matrix} L o s s & = (1 - L_{D i c e}) + L_{B C E} \end{matrix}

(9)

4.3. Evaluation Metrics

In this paper, we choose precision (P), recall (R), F1-score (F1), and Intersection over Union (IoU) as the evaluation metrics for model performance in semantic segmentation tasks. All four metrics can be calculated from the confusion matrix. Among them, F1 and IoU are the most comprehensive evaluation indicators. The calculation formulas of these four indicators are as follows:

\begin{matrix} P & = \frac{T P}{T P + F P} \end{matrix}

(10)

\begin{matrix} R & = \frac{T P}{T P + F N} \end{matrix}

(11)

\begin{matrix} F 1 & = \frac{2 T P}{2 T P + F P + F N} \end{matrix}

(12)

\begin{matrix} I o U & = \frac{T P}{T P + F P + F N} \end{matrix}

(13)

where P, N, TP, FP, and FN represent the positive, negative, true positive, false positive, and false negative pixels in the prediction map, respectively.

4.4. Experiment 1: DeepGlobe Dataset

We conducted comparative experiments on the DeepGlobe dataset between the proposed EFMOD-Net and several state-of-the-art road segmentation models from recent years. These models include U-Net [48], DeepLabv3 [49], LinkNet [17], D-LinkNet [50], MACU-Net [51], RCFS-Net [52], MSMDFF-Net [53], CARE-Net [54], and CMLFormer [55]. To ensure a fair comparison, all these models were implemented within the MMSegmentation framework, with unified configuration files for training and testing [56].

The quantitative evaluation presented in Table 2 clearly demonstrates the superior performance of our EFMOD-Net architecture using the DeepGlobe road extraction benchmark. The proposed model achieves SOTA performance with an F1-score of 78.69% and an IoU of 64.73%, representing significant improvements of +1.24% and +1.66%, respectively, over the previous best-performing model MSMDFF-Net. These metrics, which, respectively, reflect the balanced precision–recall characteristics and spatial overlap accuracy, collectively validate the effectiveness of our architectural innovations.

This performance improvement can be attributed to the additional feature compensation branch in the encoder of the proposed network. This branch supplements feature information from the input to the corresponding feature extraction stages, enabling the network to capture more detailed road information during feature extraction. As a result, the proposed network can segment roads that are challenging for other models to predict. Furthermore, the connectivity discrimination module enhances the network’s focus on road regions by weighting the feature maps, thereby improving the modeling of road topology. This enhancement strengthens the connectivity of the segmented roads and reduces road fragmentation. Additionally, the input header designed to reduce information loss in the early stages is also one of the reasons for performance improvement.

Qualitative analysis: Figure 5 is a partial visualization of the experimental results on the DeepGlobes dataset. The occlusion in this dataset mainly comes from vegetation. This dense occlusion often causes many models to fail to segment the road. However, as the visualization results show, due to the multi-scale feature extraction of the EFE and MFE modules, the model’s ability to capture road linearity and its feature extraction performance are enhanced, so our proposed model successfully segments high-precision roads. Even in areas with generally poor segmentation performance, our model extracts the road to the greatest extent, so as to obtain more accurate and clearer segmentation results. For example, in the images of lines 2 and 4, dense vegetation covers almost all parts of the road, but the model proposed in this paper can still obtain the best segmentation results compared with the other models. In addition, buildings with similar colors to the road often lead to misconceptions in the model, identifying areas that should not be roads as roads. For example, in the third line, it can be seen from the displayed image that the color of the road is very similar to the color of the building, and almost all the models have incorrectly segmented this image. MSMDFF-Net is one of the models that mistook the field-shaped building areas for roads. In contrast, our proposed network produces the least error segmentation among all the models. In addition, our model can also achieve clearer and more accurate results in correctly segmented regions. These experimental results show that the proposed method can extract more comprehensive road information. Furthermore, the connectivity discrimination module enhances road connectivity and reduces road fragmentation, further improving the segmentation quality.

4.5. Experiment 2: CHN6 Dataset

The CHN6 dataset presents substantially greater challenges for road extraction compared to conventional benchmarks, characterized by its complex urban scenes, heterogeneous road typologies, and diverse occlusion patterns. As quantitatively demonstrated in Table 3, our proposed network achieves superior performance, with an IoU of 63.58% and an F1-score of 76.74%, representing improvements of +1.84% for IoU and +1.41% for F1 over the previous state-of-the-art model (MSMDFF-Net). The CHN6 dataset was collected from six representative cities in China, primarily focusing on urban scenes. From the images in the dataset, it can be observed that the roads exhibit diverse shapes and are subject to more complex occlusions compared to the DeepGlobe dataset, particularly due to shadows from dense and tall buildings. This significantly increases the difficulty of road extraction. The experimental results demonstrate that our proposed model maintains strong performance even in such complex scenarios.

Qualitative analysis: Figure 6 shows the partial visualization of the experimental results on the CHN6 dataset. From the visualization results, it can be observed that the roads in the city are complex and changeable, and dense urban agglomerations appear in the image. Some high-rise buildings and their shadows block the roads, meaning that the model is unable to obtain road information from the blocked area. This creates significant challenges for segmentation. Thanks to the MOD module, the weighted method can be used to obtain the end-point attention area, and bar convolution is used to extract the relationship between the target pixel and the domain pixel. Even in more complex urban environments, our method can achieve optimal performance.

We also compare the parameters and calculation speeds of the different models. From the data in Table 4, we can see that the parameters of the method proposed in this paper are not the lowest, as in the models with the closest IOU index, the parameters are lower. At the same time, the calculation speed of the model proposed in this paper is not the fastest, which is due to the fact that, compared with other models using standard convolution, the area of each convolution is smaller, so the calculation speed for bar convolution is slower. This also leads to lower FLOPs for this model.

4.6. Ablation Experiment

The effectiveness of the proposed module: In EFMOD-Net, we designed MFE and EFE modules to enhance feature extraction performance in the encoder, aiming to acquire more useful information during the feature extraction phase and enable the network to learn more road features. The MOD module, designed for the decoder, focuses on the road areas of interest and learns the topological structure of roads by weighting feature maps and applying multi-directional strip convolutions, thereby reducing road fragmentation caused by occlusions. To validate the effectiveness of the proposed modules, we conducted experiments with different configurations on the DeepGlobe and CHN6 datasets, selecting various combinations of the three aforementioned modules. The evaluation metrics chosen were F1 and IoU, two of the most representative assessment indicators. The experimental results are presented in Table 5. The MFE and EFE modules were removed from the data in the table. On the DeepGlobe dataset, F1 was reduced by 1.4% and IoU was reduced by 1.97%. On the CHN6 dataset, F1 decreased by 1.02% and IoU decreased by 1.35%. This also reflects that when these two modules are removed, the feature extraction ability of the model is greatly weakened.

In the MFE module, we employed strip convolutions and conducted experiments with varying kernel sizes to determine the optimal configuration. The evaluations were performed on the DeepGlobe dataset, with detailed results documented in Table 6.

In order to determine the optimal core size of the MOD module, we conducted ablation studies with different dimensions. The experimental results in Table 7 show that when the kernel size is set to 9, the performance of the model reaches its peak.

The MOD module uses the attention mechanism to make the model pay more attention to the road part. Figure 7 shows a heatmap of the ablation experiment for the MOD module. In this experiment, we plotted a heatmap of the last encoder output in the network. It can be seen from the figure that after using the attention mechanism to weigh the feature map, the weight of the road part increased, and the model paid more attention to the road part.

5. Discussion

For road segmentation in remote sensing images, the small proportion of roads in an image, the limited information contained, and the interference of complex backgrounds pose a great challenge for the extraction of roads. Our method aims to more accurately extract a greater number of roads from high-resolution remote sensing images, so three different modules are proposed to improve the performance of road extraction.

From the results of the ablation experiments, it can be seen that the enhanced feature extraction network composed of the EFE and MFE modules can enhance the feature extraction ability of the model. In the MFE module, multi-scale strip convolution can be used to learn the slender linear features of the road while extracting multi-scale features. At the same time, MFE can use additional auxiliary branches to extract features from the input image and input them into the corresponding backbone features through the downsampling of the corresponding magnification, which reduces the loss of features and supplements the features at each stage. The data in Table 5 shows that replacing the MFE and EFE modules in the encoder with standard convolution in the encoder–decoder architecture significantly reduces the final F1 and IoU scores.

For instance, removing both modules results in a decrease of +1.4% in F1 and +1.97% in IoU on the DeepGlobe dataset, and a decrease of +1.02% in F1 and +1.35% in IoU on the CHN6 dataset. The encoder section is crucial for feature extraction, and the removal of these modules severely impairs this capability, leading to insufficient learning of road features and consequent omission of certain road sections. The function of the MOD module is to enhance the network’s ability to predict occluded road sections. This module first directs the model’s attention toward the road regions by weighting the feature map and then learns the features of the occluded parts through strip convolution, thereby capturing the topological structure of the road. Removing the MOD module results in insufficient ability to handle occlusions, leading to road fragmentation. In the experimental results on the DeepGlobe dataset, removing the MOD module reduced the F1-score by 0.51% and the IoU score by 0.69%. Similarly, on the CHN6 dataset, removing the MOD module reduced the F1-score by 0.58% and the IoU score by 0.65%. Figure 7 presents a heatmap of the ablation experiment for the MOD module. In this experiment, we plotted a heatmap of the output from the last encoder in the network. As shown in the figure, the addition of the MOD module enables the model to focus more on learning road features. When the MOD module is removed, the attention weight assigned to road regions decreases, particularly in areas that are challenging to segment. For example, in the case of DP_2, the road in the upper-left corner of the building area is narrow and partially obscured by shadows from trees and buildings. The heatmap reveals that the MOD module helps the model pay more attention to this occluded road section. Additionally, as seen in CHN6_2, the introduction of the MOD module reduces the attention given to non-road regions, thereby lowering the likelihood of misclassification.

The experimental results in Table 6 show that when the kernel size is set to 5, the performance of the MFE reaches its peak. This is because in high-resolution remote sensing images the road itself is slender, and the use of bar convolution can capture this feature well, but this is limited to smaller bar convolution capture areas, as an oversized convolution kernel capture area will contain a lot of background information. In the ablation experiment for the strip convolution kernel size of the EFE module, it can be seen that when the convolution kernel size is 9, the performance of the module is optimal. In the MOD module, strip convolution is used to learn the relationship between the center pixel of the cross-sectional area and the surrounding pixels to reduce road fragmentation. A larger kernel size reveals more pixels and introduces more background pixels, which adversely affects the model’s judgment.

The ablation results demonstrate the proposed modules’ efficacy in enhancing segmentation accuracy for high-resolution remote sensing imagery. Moreover, when all three modules are used, the network achieves its highest performance. This further demonstrates the overall effectiveness of the proposed network.

6. Conclusions and Future Work

This paper proposes an EFMOD-Net based on a new encoder–decoder structure for road extraction from high-resolution remote sensing images. The proposed network aims to solve two key challenges in remote sensing road segmentation: insufficient feature extraction and serious road fragmentation.

In order to solve the above two challenges, we propose EFMOD-Net. In the proposed network, we design three key modules: MFE, EFE, and MOD. The MFE module uses strip convolution in different directions to work with the dual-branch EFE module and ASPP module to form an encoder to enhance the feature extraction ability of the network. The MOD module constitutes a decoder, which aims to focus on learning the topology of the road and predicting the blocked road sections by integrating bar convolution and attention mechanisms in multiple directions. Additionally, an output head is employed to further enhance the modeling of road spatial features. In order to verify the effectiveness of the proposed method, comparative experiments with multiple SOTA models were carried out on the DeepGlobe and CHN6 datasets. The experimental results show that the network designed in this paper achieves IoUs of 64.73 and 63.58 on the DeepGlobe and CHN6-CUG datasets, respectively, which are 1.66 and 1.84 higher than the IoUs of the performance-based method assessed. Finally, we performed extensive ablation studies on each proposed module, confirming their individual effectiveness and the rationality of the parameter settings within each module.

However, due to the use of a lot of bar convolution in the proposed model, as well as its own limited computing power, the overall computing power of the model is not high. Therefore, the number of parameters of the model proposed in this paper was increased compared with the original model. In future research, we will focus on reducing the number of parameters of the model and improving the computational power of the model.

Author Contributions

Conceptualization, R.W., X.N., L.G. and L.Z.; methodology, R.W., L.Z. and X.N.; software, R.W., X.N. and L.G.; validation, L.Z. and L.G.; data curation, R.W. and J.G.; Writing—original draft, R.W. and L.Z.; writing—review and editing, R.W. and L.Z.; visualization, R.W. and X.N.; supervision, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available at https://github.com/foeverai/EFMOD-NET (accessed on 25 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, T.; Xie, Y.; Ding, M.; Yang, L.; Tomizuka, M.; Wei, Y. A road surface reconstruction dataset for autonomous driving. Sci. Data 2024, 11, 459. [Google Scholar] [CrossRef]
Wen, C.; Sun, X.; Li, J.; Wang, C.; Guo, Y.; Habib, A. A deep learning framework for road marking extraction, classification and completion from mobile laser scanning point clouds. ISPRS J. Photogramm. Remote Sens. 2019, 147, 178–192. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Cui, W.; Jiang, H. Fully convolutional networks for building and road extraction: Preliminary results. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1591–1594. [Google Scholar] [CrossRef]
Cui, J.; Liu, F.; Janssens, D.; An, S.; Wets, G.; Cools, M. Detecting urban road network accessibility problems using taxi GPS data. J. Transp Geogr. 2016, 51, 147–157. [Google Scholar] [CrossRef]
Campbell, A.; Both, A.; Sun, Q.C. Detecting and mapping traffic signs from Google Street View images using deep learning and GIS. Comput. Environ. Urban Syst. Comput. 2019, 77, 101350. [Google Scholar] [CrossRef]
Nouriddine, H.; Achraf, O.; Abdelilah, R.; Mohamed, B.; Soufiane, M.; Amine, A. GIS-Based Methodology for Assessing Public Transport Accessibility: A Case Study of Marrakech, Morocco. In Proceedings of the 2024 IEEE 15th International Colloquium of Logistics and Supply Chain Management (LOGISTIQUA), Sousse, Tunisia, 2–4 May 2024; pp. 1–9. [Google Scholar] [CrossRef]
Alshehhi, R.; Marpu, P. Hierarchical graph-based segmentation for extracting road networks from high-resolution satellite images. ISPRS J. Photogramm. Remote Sens. 2017, 126, 245–260. [Google Scholar] [CrossRef]
Bastani, F.; He, S.; Abbar, S.; Alizadeh, M.; Balakrishnan, H.; Chawla, S.; Madden, S.; DeWitt, D. Roadtracer: Automatic extraction of road networks from aerial images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4720–4728. [Google Scholar] [CrossRef]
Patil, D.; Jadhav, S. Road Network Extraction Using Multi-path Cascade Convolution Neural Network from Remote Sensing Images. J. Indian Soc. Remote Sens. 2024, 52, 525–541. [Google Scholar] [CrossRef]
Laptev, I.; Mayer, H.; Lindeberg, T.; Eckstein, W.; Steger, C.; Baumgartner, A. Automatic extraction of roads from aerial images based on scale space and snakes. Mach. Vis. Appl. 2000, 12, 23–31. [Google Scholar] [CrossRef]
Chai, D.; Forstner, W.; Lafarge, F. Recovering line-networks in images by junction-point processes. In Proceedings of the 2013 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 1894–1901. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell 2017, 39, 640–651. [Google Scholar] [CrossRef]
Yu, H.; Xiao, Z.; Fang, Z. A real-time semantic segmentation network with multi-path structure. In Proceedings of the 2022 10th International Conference on Information Systems and Computing Technology (ISCTech), Virtual, Online, China, 28–30 December 2022; pp. 523–527. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Y.; Wei, Y.; Feng, B.; Hu, X.; Guo, D.; Zou, B. Weak-shot Semantic Segmentation Based on Encoder-Decoder Structure. In Proceedings of the 2023 6th International Conference on Software Engineering and Computer Science (CSECS), Chengdu, China, 22–24 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Yin, H.; Jin, G.; Hua, J.; Chen, C. Using Deep Semantic Segmentation Model to Divide the Human Body Images. In Proceedings of the 2023 4th International Conference on Computers and Artificial Intelligence Technology (CAIT), Macau, China, 13–15 December 2023; pp. 143–146. [Google Scholar] [CrossRef]
Das, P.K.; Sahu, A.; Xavy, D.V.; Meher, S. A Deforestation Detection Network Using Deep Learning-Based Semantic Segmentation. IEEE Sens. Lett. 2024, 8, 1–4. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar] [CrossRef]
Mei, J.; Li, R.; Gao, W.; Cheng, M. CoANet: Connectivity attention network for road extraction from satellite imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef]
Mattyus, G.; Luo, W.; Urtasun, R. DeepRoadMapper: Extracting road topology from aerial images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3458–3466. [Google Scholar] [CrossRef]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar] [CrossRef]
VanEtten, A.; Lindenbaum, D.; Bacastow, T. SpaceNet: A Remote Sensing Dataset and Challenge Series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A Global Context-aware and Batch-independent Network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
Lu, X.; Zhong, Y.; Zheng, Z.; Zhang, L. GAMSNet: Globally aware road detection network with multi-scale residual learning. ISPRS J. Photogramm. Remote Sens. 2021, 175, 340–352. [Google Scholar] [CrossRef]
Kumar, K.M. RoadTransNet: Advancing remote sensing road extraction through multi-scale features and contextual information. Signal Image Video Process. 2024, 18, 2403–2412. [Google Scholar] [CrossRef]
Mosinska, A.; Marquez-Neila, P.; Kozinski, M.; Fua, P. Beyond the pixel-wise loss for topology-aware delineation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3136–3145. [Google Scholar] [CrossRef]
Yang, R.; Zhong, Y.; Liu, Y.; Lu, X.; Zhang, L. Occlusion-AwareRoad Extraction Network for High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Zhang, J.; Hu, Q.; Li, J.; Ai, M. Learning From GPS Trajectories of Floating Car for CNN-Based Urban Road Extraction With High Resolution Satellite Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1836–1847. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, L.; Wang, Y.; Xu, W. AGF-Net: Adaptive global feature fusion network for road extraction from remote-sensing images. Complex Intell. Syst. 2024, 10, 4311–4328. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, Online, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Xu, B.; Bao, S.; Zheng, L.; Wu, G.Z.W. IDANet: Iterative D-LinkNets with Attention for Road Extraction from High-Resolution Satellite Imagery. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Beijing, China, 29 October–1 November 2021; pp. 140–152. [Google Scholar] [CrossRef]
Zhou, G.; Chen, W.; Gui, Q.; Li, X.; Wang, L. Split Depth-Wise Separable Graph-Convolution Network for Road Extraction in Complex Environments From High-Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Phaye, S.S.R.; Sikka, A.; Dhall, A.; Bathula, D. Dense and diverse capsule networks: Making the capsules learn better. arXiv 2018, arXiv:1805.04001. [Google Scholar] [CrossRef]
Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
Lu, X.; Zhong, Y.; Zheng, Z.; Chen, D.; Su, Y.; Ma, A.; Zhang, L. Cascaded Multi-task Road Extraction Network for Road Surface, Centerline, and Edge Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Li, J.; He, J.; Li, W.; Chen, J.; Yu, J. RoadCorrector: A Structure Aware Road Extraction Method for Road Connectivity and Topology Correction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, W.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2024, 190, 196–214. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–18. [Google Scholar] [CrossRef]
Wang, J.; Ling, Q. FDNet: Frequency Decomposition Network for Learned Image Compression. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11241–11255. [Google Scholar] [CrossRef]
Hyeongmin, K.; Chan Hee, P.; Chaehyun, S.; Minseok, C.; Heonjun, Y.; Byeng, D.Y. MPARN: Multi-scale path attention residual network for fault diagnosis of rotating machines. J. Comput. Des. Eng. 2024, 10, 860–872. [Google Scholar] [CrossRef]
Wang, B.; Su, J.; Xi, J.; Chen, Y.; Cheng, H.; Li, H.; Chen, C.; Shang, H.; Yang, Y. Landslide Detection with MSTA-YOLO in Remote Sensing Images. Remote Sens. 2025, 17, 2795. [Google Scholar] [CrossRef]
Xie, Z.; Wan, G.; Yin, Y.; Sun, G.; Bu, D. SDDGRNets: Level–Level Semantically Decomposed Dynamic Graph Reasoning Network for Remote Sensing Semantic Change Detection. Remote Sens. 2025, 17, 2641. [Google Scholar] [CrossRef]
Zhang, J.; Hu, X.; Wei, Y.; Zhang, L. Road Topology Extraction From Satellite Imagery by Joint Learning of Nodes and Their Connectivity. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xia, M.; Wang, X.; Liu, Y. RoadNet: Learning to comprehensively analyze road networks in complex urban scenes from high-resolution remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2043–2056. [Google Scholar] [CrossRef]
Xu, Y.; Shi, Z.; Xie, X.; Chen, Z.; Xie, Z. Residual Channel Attention Fusion Networkfor RoadExtraction Based on Remote Sensing Images and GPS Trajectories. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8358–8369. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Jie, H.; Li, S.; Samuel, A.; Gang, S.; Enhua, W. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F. PyTorch: An Imperative Style, High Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–12. [Google Scholar]
Ronneberger, O.; Brox, P.F.T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for semantic segmentation of fine-resolution remotely sensed images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Yang, Z.; Zhou, D.; Yang, Y.; Zhang, J.; Chen, Z. Road Extraction From Satellite Imagery by Road Context and Full-Stage Feature. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Wang, Y.; Tong, L.; Luo, S.; Xiao, F.; Yang, J. A Multiscale and Multidirection Feature Fusion Network for Road Detection From Satellite Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Qu, Z.; Li, M.; Chen, Z. CARENet: Satellite Imagery Road Extraction via Context-Aware and Road Enhancement. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Wu, H.; Zhang, M.; Huang, P.; Tang, W. CMLFormer: CNN and multiscale local-context transformer network for remote sensing images semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7233–7241. [Google Scholar] [CrossRef]
Wang, H.; Bai, L.; Xue, D.; Momi, M.C.; Ye, Z.; Quan, S. FRCFNet: Feature Reassembly and Context Information Fusion Network for Road Extraction. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the EFMOD-Net network, in which the encoder part is composed of a multi-directional feature extraction (MFE) module and an enhanced feature extraction (EFE) module, the decoder structure is composed of a multi-branch occlusion discrimination (MOD) module, and the PSP output head is used as the final part of the network.

Figure 2. Multi-directional feature extraction (MFE) module.

Figure 3. The structure of enhanced feature extraction (EFE) module.

Figure 4. The structure of the multi-branch occlusion discrimination (MOD) module.

Figure 5. Qualitative comparison of our method with several other state-of-the-art methods on the DeepGlobe dataset (a–d). White represents true positives, green represents false positives, and red represents false negatives.

Figure 6. Qualitative comparison of our method with several other state-of-the-art methods on the CHN6 dataset (a–d). White represents true positives, green represents false positives, and red represents false negatives.

Figure 7. The ablation experiment for the MOD module and the heatmap of the final encoder output result (a–d).

Table 1. Experimental settings, without special instructions, under which all semantic segmentation models in this experiment operate.

Entries	Detail Setting
System	Ubuntu22.04.3
GPU	4 × NVIDIA Tesla v100
CUDA version	11.8
PyTorch	1.8.1
MMSegmentation	1.2.2
Compiler	Python 3.8
Optimizer	AdamW
lr	0.002
Weight Decay	0.05
Training Strategy	LinearLR + PolyLR
Loss	$L_{B C E} + L_{D i c e}$

Table 2. The experimental results on the DeepGlobe dataset show the evaluation indicators of each network on the test set.

Method	P (%)	R (%)	F1 (%)	Road IoU (%)
U-Net	84.29	69.7	76.31	61.69
Deeplabv3	79.18	74.73	76.89	62.45
LinkNet	81.16	69.61	74.94	59.93
D-LinkNet	77.59	73.41	75.44	60.57
MACU-Net	77.84	74.77	76.27	61.65
RCFS-Net	76.02	77.54	74.24	62.06
MSMDFF-Net	79.27	75.53	77.35	63.07
CARE-Net	81.3	73.36	77.13	62.77
CMLFormer	80.14	70.92	75.25	60.32
EFMOD-Net (our)	79.35	77.84	78.59	64.73

Table 3. The experimental results on the CHN6 dataset, showing the evaluation indicators of each network on the test set.

Method	P (%)	R (%)	F1 (%)	Road IoU (%)
U-Net	77.0	67.93	72.18	56.47
Deeplabv3	76.62	69.12	72.67	57.08
LinkNet	76.75	70.16	73.31	57.86
D-LinkNet	75.98	71.67	73.76	58.43
MACU-Net	77.4	73.13	75.2	60.26
RCFS-Net	76.15	74.5	75.31	60.40
MSMDFF-Net	77.02	75.68	76.33	61.74
CARE-Net	74.95	76.15	75.54	60.7
CMLFormer	76.13	74.23	75.17	60.22
EFMOD-Net (our)	77.61	77.87	77.74	63.58

Table 4. Comparison of FLOPs and Params for different models.

Method	Input Size	Params (M)	FLOPs (G)
U-Net	512 × 512	39.5	35.23
Deeplabv3	512 × 512	65.74	270
LinkNet	512 × 512	21.67	27.12
D-LinkNet	512 × 512	31.11	29.59
MACU-Net	512 × 512	5.15	29.7
RCFS-Net	512 × 512	76.74	729.59
MSMDFF-Net	512 × 512	39.27	138
CARE-Net	512 × 512	45.63	160.81
CMLFormer	512 × 512	22.64	95.36
EFMOD-Net(our)	512 × 512	43.6	93.89

Table 5. The ablation study results of the three proposed modules are presented: MFE, EFE, and MOD.

Methods			DeepGlobe		CHN6
MFE	EFE	MOD	F1 (%)	IoU (%)	F1 (%)	Road IoU (%)
✓			76.58	62.04	74.64	59.51
✓	✓		78.08	64.04	77.16	62.93
		✓	77.19	62.76	75.34	60.44
✓		✓	77.57	63.36	76.14	61.47
✓	✓	✓	78.59	64.73	77.74	63.58

Table 6. Ablation results of the proposed MFE on DeepGlobe. “Kernel_size” is the size of the strip convolution kernel in MFE.

MFE	P(%)	R(%)	F1(%)	Road IoU (%)
Kernel size = 3	77.11	76.39	77.95	64.39
Kernel size = 5	79.35	77.84	78.59	64.73
Kernel size = 7	78.18	73.35	77.78	63.63
Kernel size = 9	77.61	77.86	77.47	63.59

Table 7. Ablation results of the proposed MOD on DeepGlobe. “Kernel_size” is the size of the strip convolution kernel in MOD.

MFE	P (%)	R (%)	F1 (%)	Road IoU (%)
Kernel size = 3	73.73	76.26	77.79	62.32
Kernel size = 5	76.21	75.68	77.74	63.59
Kernel size = 7	77.04	74.54	78.18	64.17
Kernel size = 9	79.35	77.84	78.59	64.73
Kernel size = 11	76.33	74.83	77.87	63.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, R.; Zhang, L.; Guan, L.; Ni, X.; Gong, J. An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery. Remote Sens. 2025, 17, 3037. https://doi.org/10.3390/rs17173037

AMA Style

Wu R, Zhang L, Guan L, Ni X, Gong J. An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery. Remote Sensing. 2025; 17(17):3037. https://doi.org/10.3390/rs17173037

Chicago/Turabian Style

Wu, Ruixiang, Lun Zhang, Longkai Guan, Xiangrong Ni, and Jianxing Gong. 2025. "An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery" Remote Sensing 17, no. 17: 3037. https://doi.org/10.3390/rs17173037

APA Style

Wu, R., Zhang, L., Guan, L., Ni, X., & Gong, J. (2025). An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery. Remote Sensing, 17(17), 3037. https://doi.org/10.3390/rs17173037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced Feature Extraction and Multi-Branch Occlusion Discrimination Network for Road Detection from Satellite Imagery

Abstract

1. Introduction

2. Related Work

2.1. Input Header

2.2. Feature Extraction

2.3. Occlusion Discrimination

3. Methodology

3.1. EFMOD-Net

3.2. MFE-Block

3.3. EFE Module

3.4. MOD Module

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Experiment 1: DeepGlobe Dataset

4.5. Experiment 2: CHN6 Dataset

4.6. Ablation Experiment

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI