1. Introduction
Roads constitute a vital traffic infrastructure, laying a foundation for various forms of ground transportation. An accurate road network consistent with the real world is very important for various applications, such as autonomous driving [
1,
2], urban planning [
3,
4], and geographic information system (GIS) updates [
5,
6]. To track road changes, caused by accidents, natural disasters, policy planning, etc., map service providers like Google apply various road extraction methods on different types of measuring data (LIDAR point cloud data, GPS data, or even manually labeled data) collected by patrol vehicles. These methods require a large amount of manpower and resources, but the accuracy of the road extraction data cannot be guaranteed. In response to these challenges, researchers have turned to high-resolution remote sensing images, seeking a more accurate, efficient, and cost-effective approach [
7,
8,
9].
Traditional road extraction from remote sensing images typically leads to the design of road features first, and roads are extracted from remote sensing images based on specific rules [
10,
11]. The accuracy of manually designed features cannot be guaranteed; meanwhile, these methods require significant time consumption, making real-time application unattainable. With the advancement of deep learning (DL), convolutional neural networks (CNNs) have found applications in a wide range of computer vision domains. In particular, fully convolutional neural networks [
12] have proven to be effective in semantic segmentation tasks [
13,
14,
15,
16]. Currently, convolutional neural networks are also utilized in mainstream research with encoder–decoder structures for road extraction from remote sensing images, such as Link-Net [
17], CoaNet [
18], and Deep Road Mapper [
19]. These networks establish long-distance context relationships through cross-layer connections, enabling the network to learn road information more effectively. These models have demonstrated their ability to extract road networks quickly and efficiently on some public datasets [
20,
21,
22]. However, deep learning-based road extraction from remote sensing images still faces several challenges, such as occlusion caused by non-road objects, including buildings, street trees, and vehicles; complex traffic environments; and the irregular shapes of roads themselves. These issues often result in fragmented extraction results. To address the aforementioned challenges, scholars have tried various methods of improving the model’s ability, including contextual information modeling, multi-scale and multi-branch feature extraction, feature recombination, and specialized convolution development, to enhance segmentation performance. Lu et al. [
23] proposed a global perception road detection network based on multi-scale residual learning (GAMS-Net). This network employs multi-scale residual learning to obtain multi-scale features and utilizes global perception operations to capture spatial contextual dependencies and inter-channel dependencies. Zhu et al. [
24] used dilation convolution with different expansion rates to extract the multi-scale features of roads in parallel. Mosinska et al. [
25] used the U-Net network to design a new loss term combined with other loss functions to refine the image segmentation results. Yang et al. [
26] designed a specialized convolution module that uses strip convolutions in four different directions to process the feature map, thereby capturing the topological structure of the road and addressing the issue of road occlusion. Liu et al. [
27] integrated multi-level features such as road edges, centerlines, and road surfaces to provide additional information and enhance the model’s learning performance.
In addition to the above methods, some studies have introduced attention mechanisms into the network to enable more effective focus on the feature representation of the road. For example, Zhang et al. [
28] designed an extended convolutional strip attention (DCSA) module to focus on the characteristics of the vertical and horizontal directions of the road. Hou et al. [
29] proposed a novel attention mechanism that embeds location information while weighting channels to enhance the feature representation of objects of interest. Xu et al. [
30] proposed IDANet, which uses an iterative D-LinkNet model with an attention module to improve network segmentation performance. Although using an attention mechanism to improve the model’s understanding of the road is an effective method for improving feature extraction performance, for the occlusion problem, the road fragmentation phenomenon still exists due to its inability to learn the relationship between pixels.
This paper proposes an enhanced feature extraction and multi-branch occlusion discrimination network (EFMOD-Net) based on an encoder–decoder architecture. Firstly, the multi-directional feature extraction module (MFE) is used as the input to the network, and multi-directional bar convolution is used for feature extraction. Square convolution is inconsistent with the slender and irregular shape of roads, which limits the learning of linear road features. Therefore, the MFE module uses four stripe convolutions in different directions to fully capture road features. In addition, the enhanced feature extraction module (EFE) supplements feature learning through additional branches to enhance the feature extraction ability of the model, while the multi-branch occlusion discrimination module (MOD) uses an attention mechanism and strip convolution to learn the topological relationship between adjacent pixels to alleviate road fragmentation caused by occlusion. Our contributions are summarized as follows:
A multi-directional feature extraction module is proposed to improve the model’s ability to extract linear road features.
An enhanced feature extraction module is designed, which utilizes additional branches to supplement feature information and enhance the learning of road features.
A multi-branch occlusion discrimination module is designed. It uses an attention mechanism and multi-directional bar convolution to learn the topological structure between adjacent pixels and reduce the fragmentation path.
An enhanced feature extraction and multi-branch occlusion discrimination network is proposed to enhance road extraction performance and improve the accuracy of road extraction from remote sensing images. Compared with other methods, it achieves better results on widely used public datasets.
The remainder of this paper is organized as follows:
Section 2 introduces related work on road extraction.
Section 3 provides a detailed explanation of the overall network architecture and the structure of each module.
Section 4 describes the specific details of the experiments, including datasets and experimental platforms, and presents comparison results with state-of-the-art models on different public datasets. Finally,
Section 5 concludes the paper with a summary and discussion.
2. Related Work
2.1. Input Header
The input header, serving as the initial stage of the network architecture, is designed to separate and extract preliminary features from input images through channel expansion. This process separates features into different channels for subsequent feature extraction. Previous studies have used different methods to achieve this operation: Zhou et al. [
31] utilized three consecutive
convolutional layers in their SGCN network for channel expansion, followed by downsampling via a stride convolution and a two-layer square convolution for initial feature extraction. Similarly, LinkNet [
17] adopted a
convolution kernel for feature separation, complemented by a
convolution kernel for downsampling before transmitting features to downstream layers.
However, conventional square convolutions exhibit inherent limitations in capturing road features. Roads typically have elongated, curvilinear structures, which square convolution kernels struggle to represent effectively due to their isotropic nature and lack of rotational invariance [
32]. This often leads to information loss during the initial feature encoding phase. Losing some information in the initial input header stage will lead to insufficient feature extraction in the subsequent stage. This usually causes the road to be misclassified or missed. In most of the current studies, the pretrained model is usually used as the encoder network, but few studies focus on the design of the input header part of the network.
2.2. Feature Extraction
Road segmentation in high-resolution remote sensing imagery predominantly employs encoder–decoder architectures due to their inherent capability to establish long-range contextual relationships through cross-layer connections. These networks simultaneously integrate global semantic information with fine-grained local feature representation, thereby optimizing segmentation precision. Within this framework, the encoder module assumes critical responsibility for hierarchical feature extraction, where the efficacy of this process directly determines the ultimate segmentation performance. Consequently, the selection of backbone networks constitutes a pivotal factor influencing model outcomes, as it fundamentally governs the quality of multi-scale feature representation and information propagation throughout the network hierarchy.
The emergence of residual networks (ResNet) [
33] marked a significant breakthrough in deep learning, effectively addressing the performance degradation problem in deep neural networks through innovative residual connections. This architectural advancement has enabled the development of substantially deeper networks while maintaining training stability, leading to ResNet’s widespread adoption across numerous computer vision tasks, including road segmentation from remote sensing imagery. In road segmentation applications, ResNet’s powerful hierarchical feature extraction capabilities have made it a predominant choice for encoder structures. Several notable implementations demonstrate this trend: Lu et al. [
34] developed the CasMT framework, incorporating Link-Net50 with a ResNet50 backbone for robust road surface feature extraction. Similarly, Li et al. [
35] proposed MBRE-Net, utilizing a U-shaped D-LinkNet architecture with ResNet34 in the encoder stage, while Zhou et al. [
31] employed ResNet50 as the primary feature extraction backbone network in their SGCN network. Wang et al. [
36] proposed the lightweight resnet18 as an encoder in UNetformer to carry out the semantic segmentation of urban scenes.
A fundamental limitation persists across these architectures: the inevitable loss of global contextual information and progressive focus on local features during deep feature extraction, compounded by information degradation through successive downsampling operations. At the same time, shallow ResNet or simple network layers will lead to insufficient feature extraction, which increases the difficulty of segmenting roads that account for a small proportion of the image.
Research proves that a multi-branch structure can also enhance the feature extraction ability of the model. Xin et al. [
37] proposed GPINet and designed an encoder with a dual-branch structure and a local–global interaction module (LGIM) to make full use of the local and global context for feature refinement. Wang et al. [
38] proposed an FDNet with a dual-branch structure, which is used to enhance the model’s ability to extract high-frequency and low-frequency information and to reduce distortion while completing image compression.
Multi-scale convolution is a good method for enhancing feature extraction. For example, KIM et al. [
39] developed a multi-scale convolutional neural network structure composed of parallel convolution paths with different kernel sizes, extracted features from multiple timescales, and applied them to the fault diagnosis of rotating machines, achieving good results. Wang et al. [
40] proposed MSTA-YOLO, which combines multi-scale convolution with an attention mechanism to enhance the feature extraction ability of the model so as to effectively detect landslides. Xie et al. [
41] proposed SDDGRNets for change detection in remote sensing images, using multi-scale convolution to improve the representation of salient features.
In summary, the multi-scale convolution and multi-branch method can help the network extract more feature information and enhance its feature extraction performance, which guides our research.
2.3. Occlusion Discrimination
Connectivity is one of the most important features of roads and is essential for autonomous driving, vehicle navigation, and path planning. The connectivity of roads significantly affects the final path selection and planning results. However, remote sensing images often contain many occlusions on roads, such as roadside trees, tall buildings, shadows, and parked vehicles, which can cover parts of the road surface, leading to road fragmentation in the prediction results. In response to this issue, recent studies have given it considerable attention and proposed various methods to address road occlusion.
The challenge of maintaining road connectivity in occluded scenarios has prompted various methodological innovations. Máttyus et al. [
19] developed an approach combining encoder–decoder segmentation with shortest-path-based post-processing to infer missing connections in aerial imagery. Bastani et al. [
8] introduced RoadTracer, an iterative graph-based method that progressively constructs road networks through CNN-guided node prediction. Alternative strategies have focused on architectural modifications, such as the topology-aware loss function proposed by Mosinska et al. [
25] for U-Net architectures, which explicitly preserves network connectivity during segmentation refinement. Similarly, Batra et al. [
9] demonstrated that the joint optimization of directional features and segmentation through multi-branch convolution could enhance topological accuracy, though such iterative methods often incur substantial computational overhead.
Recent advances have explored multi-task learning frameworks to address connectivity challenges. Zhang et al. [
42] developed a dual-branch network for simultaneous prediction of node confidence and connectivity graphs, while Liu et al. [
43] employed hierarchical feature learning to jointly extract road surfaces, edges, and centerlines. The RoadCorrector framework by Li et al. [
35] further advanced this paradigm through specialized extraction and fusion of road surfaces, centerlines, and intersections. Complementary approaches have investigated data fusion strategies, with Zhang et al. [
27] and Xu et al. [
44] demonstrating that GPS trajectory integration can effectively compensate for occlusion-induced information loss in optical imagery but requires complex preprocessing of trajectory data.
However, the use of post-processing or hierarchical prediction methods will increase the time consumption and computational costs of the entire process of road extraction. The introduction of additional data to supplement information usually involves data collection and complex preprocessing, which is adapted to the task of semantic segmentation and also increases the costs associated with the entire task.
4. Experiments
In this section, we conduct extensive comparative experiments to validate the effectiveness of the proposed model. The experimental details will be systematically presented, including dataset descriptions, evaluation metrics, implementation settings, and result analyses.
To ensure scientific rigor and fairness in the model comparison while enhancing experimental reproducibility, we integrate all baseline models into a unified evaluation framework. The subsequent sections will detail the experimental framework specifications and hardware configurations used.
4.1. Datasets
The models in this experiment were trained and evaluated on two datasets: DeepGlobe (DP) and CHN6-CUG (CHN6).
- (1)
DeepGlobe [
20]: This dataset was released during the 2018 CVPR DeepGlobe Road Extraction Challenge. It consists of 8570 high-resolution remote sensing images from India, Thailand, and Indonesia, of which 6226 images are annotated. The image size is
pixels, with a resolution of 0.5 m per pixel. The dataset covers a variety of scenarios, including but not limited to urban, rural, coastal, and tropical forest areas. For the experiments, we selected the 6226 images with labeled data and divided them into training and test sets at a ratio of 85% to 15%. During training, a data augmentation strategy involving random cropping was adopted. Specifically, the
images were randomly cropped into
patches using a sliding window approach with a step size of 340 pixels.
- (2)
CHN6 [
22]: This dataset consists of high-resolution remote sensing images from six cities with varying urbanization levels, urban scales, development degrees, urban structures, and historical and cultural backgrounds. These cities include Chaoyang District of Beijing, Yangpu District of Shanghai, downtown Wuhan, Nanshan District of Shenzhen, Sha Tin District of Hong Kong, and Macao, China. The dataset contains a total of 4511 images, each with a size of
pixels and a resolution of 0.5 m per pixel. In the experiments, the same dataset from the original paper was used, with 3608 images allocated for training and 903 images for testing. No data augmentation strategy was applied to this dataset during the experiments.
4.2. Implementation Details
All the road extraction networks used in this experiment were implemented on the MMSegmentation platform, which is part of the OpenMMLab series jointly developed by SenseTime and the Chinese University of Hong Kong. This framework is built on the deep learning framework PyTorch [
47] and integrates numerous open-source algorithms. The hardware used includes 4 × NVIDIA Tesla V100 GPUs, and the operating system is Ubuntu 22.04.3. For training, the optimizer selected is AdamW, with an initial lr of 0.002 and a weight decay of 0.05. The loss function combines binary cross-entropy (BCE) loss and Dice loss. Training follows an iterative approach with a total of 320,000 iterations, divided into two stages: a warm-up stage and a formal training stage. Specifically, a linear learning rate is applied during the first 100 iterations for warm-up, followed by a polynomial learning rate for the remaining iterations. This combination accelerates model convergence. The final output of the model is determined by selecting the prediction with the highest confidence. The experimental parameters are summarized in
Table 1. The loss function is calculated as follows:
4.3. Evaluation Metrics
In this paper, we choose precision (
P), recall (
R),
F1-score (
F1), and Intersection over Union (
IoU) as the evaluation metrics for model performance in semantic segmentation tasks. All four metrics can be calculated from the confusion matrix. Among them,
F1 and
IoU are the most comprehensive evaluation indicators. The calculation formulas of these four indicators are as follows:
where
P,
N,
TP,
FP, and
FN represent the positive, negative, true positive, false positive, and false negative pixels in the prediction map, respectively.
4.4. Experiment 1: DeepGlobe Dataset
We conducted comparative experiments on the DeepGlobe dataset between the proposed EFMOD-Net and several state-of-the-art road segmentation models from recent years. These models include U-Net [
48], DeepLabv3 [
49], LinkNet [
17], D-LinkNet [
50], MACU-Net [
51], RCFS-Net [
52], MSMDFF-Net [
53], CARE-Net [
54], and CMLFormer [
55]. To ensure a fair comparison, all these models were implemented within the MMSegmentation framework, with unified configuration files for training and testing [
56].
The quantitative evaluation presented in
Table 2 clearly demonstrates the superior performance of our EFMOD-Net architecture using the DeepGlobe road extraction benchmark. The proposed model achieves SOTA performance with an F1-score of 78.69% and an IoU of 64.73%, representing significant improvements of +1.24% and +1.66%, respectively, over the previous best-performing model MSMDFF-Net. These metrics, which, respectively, reflect the balanced precision–recall characteristics and spatial overlap accuracy, collectively validate the effectiveness of our architectural innovations.
This performance improvement can be attributed to the additional feature compensation branch in the encoder of the proposed network. This branch supplements feature information from the input to the corresponding feature extraction stages, enabling the network to capture more detailed road information during feature extraction. As a result, the proposed network can segment roads that are challenging for other models to predict. Furthermore, the connectivity discrimination module enhances the network’s focus on road regions by weighting the feature maps, thereby improving the modeling of road topology. This enhancement strengthens the connectivity of the segmented roads and reduces road fragmentation. Additionally, the input header designed to reduce information loss in the early stages is also one of the reasons for performance improvement.
Qualitative analysis:
Figure 5 is a partial visualization of the experimental results on the DeepGlobes dataset. The occlusion in this dataset mainly comes from vegetation. This dense occlusion often causes many models to fail to segment the road. However, as the visualization results show, due to the multi-scale feature extraction of the EFE and MFE modules, the model’s ability to capture road linearity and its feature extraction performance are enhanced, so our proposed model successfully segments high-precision roads. Even in areas with generally poor segmentation performance, our model extracts the road to the greatest extent, so as to obtain more accurate and clearer segmentation results. For example, in the images of lines 2 and 4, dense vegetation covers almost all parts of the road, but the model proposed in this paper can still obtain the best segmentation results compared with the other models. In addition, buildings with similar colors to the road often lead to misconceptions in the model, identifying areas that should not be roads as roads. For example, in the third line, it can be seen from the displayed image that the color of the road is very similar to the color of the building, and almost all the models have incorrectly segmented this image. MSMDFF-Net is one of the models that mistook the field-shaped building areas for roads. In contrast, our proposed network produces the least error segmentation among all the models. In addition, our model can also achieve clearer and more accurate results in correctly segmented regions. These experimental results show that the proposed method can extract more comprehensive road information. Furthermore, the connectivity discrimination module enhances road connectivity and reduces road fragmentation, further improving the segmentation quality.
4.5. Experiment 2: CHN6 Dataset
The CHN6 dataset presents substantially greater challenges for road extraction compared to conventional benchmarks, characterized by its complex urban scenes, heterogeneous road typologies, and diverse occlusion patterns. As quantitatively demonstrated in
Table 3, our proposed network achieves superior performance, with an IoU of 63.58% and an F1-score of 76.74%, representing improvements of +1.84% for IoU and +1.41% for F1 over the previous state-of-the-art model (MSMDFF-Net). The CHN6 dataset was collected from six representative cities in China, primarily focusing on urban scenes. From the images in the dataset, it can be observed that the roads exhibit diverse shapes and are subject to more complex occlusions compared to the DeepGlobe dataset, particularly due to shadows from dense and tall buildings. This significantly increases the difficulty of road extraction. The experimental results demonstrate that our proposed model maintains strong performance even in such complex scenarios.
Qualitative analysis:
Figure 6 shows the partial visualization of the experimental results on the CHN6 dataset. From the visualization results, it can be observed that the roads in the city are complex and changeable, and dense urban agglomerations appear in the image. Some high-rise buildings and their shadows block the roads, meaning that the model is unable to obtain road information from the blocked area. This creates significant challenges for segmentation. Thanks to the MOD module, the weighted method can be used to obtain the end-point attention area, and bar convolution is used to extract the relationship between the target pixel and the domain pixel. Even in more complex urban environments, our method can achieve optimal performance.
We also compare the parameters and calculation speeds of the different models. From the data in
Table 4, we can see that the parameters of the method proposed in this paper are not the lowest, as in the models with the closest IOU index, the parameters are lower. At the same time, the calculation speed of the model proposed in this paper is not the fastest, which is due to the fact that, compared with other models using standard convolution, the area of each convolution is smaller, so the calculation speed for bar convolution is slower. This also leads to lower FLOPs for this model.
4.6. Ablation Experiment
The effectiveness of the proposed module: In EFMOD-Net, we designed MFE and EFE modules to enhance feature extraction performance in the encoder, aiming to acquire more useful information during the feature extraction phase and enable the network to learn more road features. The MOD module, designed for the decoder, focuses on the road areas of interest and learns the topological structure of roads by weighting feature maps and applying multi-directional strip convolutions, thereby reducing road fragmentation caused by occlusions. To validate the effectiveness of the proposed modules, we conducted experiments with different configurations on the DeepGlobe and CHN6 datasets, selecting various combinations of the three aforementioned modules. The evaluation metrics chosen were F1 and IoU, two of the most representative assessment indicators. The experimental results are presented in
Table 5. The MFE and EFE modules were removed from the data in the table. On the DeepGlobe dataset, F1 was reduced by 1.4% and IoU was reduced by 1.97%. On the CHN6 dataset, F1 decreased by 1.02% and IoU decreased by 1.35%. This also reflects that when these two modules are removed, the feature extraction ability of the model is greatly weakened.
In the MFE module, we employed strip convolutions and conducted experiments with varying kernel sizes to determine the optimal configuration. The evaluations were performed on the DeepGlobe dataset, with detailed results documented in
Table 6.
In order to determine the optimal core size of the MOD module, we conducted ablation studies with different dimensions. The experimental results in
Table 7 show that when the kernel size is set to 9, the performance of the model reaches its peak.
The MOD module uses the attention mechanism to make the model pay more attention to the road part.
Figure 7 shows a heatmap of the ablation experiment for the MOD module. In this experiment, we plotted a heatmap of the last encoder output in the network. It can be seen from the figure that after using the attention mechanism to weigh the feature map, the weight of the road part increased, and the model paid more attention to the road part.
5. Discussion
For road segmentation in remote sensing images, the small proportion of roads in an image, the limited information contained, and the interference of complex backgrounds pose a great challenge for the extraction of roads. Our method aims to more accurately extract a greater number of roads from high-resolution remote sensing images, so three different modules are proposed to improve the performance of road extraction.
From the results of the ablation experiments, it can be seen that the enhanced feature extraction network composed of the EFE and MFE modules can enhance the feature extraction ability of the model. In the MFE module, multi-scale strip convolution can be used to learn the slender linear features of the road while extracting multi-scale features. At the same time, MFE can use additional auxiliary branches to extract features from the input image and input them into the corresponding backbone features through the downsampling of the corresponding magnification, which reduces the loss of features and supplements the features at each stage. The data in
Table 5 shows that replacing the MFE and EFE modules in the encoder with standard convolution in the encoder–decoder architecture significantly reduces the final F1 and IoU scores.
For instance, removing both modules results in a decrease of +1.4% in F1 and +1.97% in IoU on the DeepGlobe dataset, and a decrease of +1.02% in F1 and +1.35% in IoU on the CHN6 dataset. The encoder section is crucial for feature extraction, and the removal of these modules severely impairs this capability, leading to insufficient learning of road features and consequent omission of certain road sections. The function of the MOD module is to enhance the network’s ability to predict occluded road sections. This module first directs the model’s attention toward the road regions by weighting the feature map and then learns the features of the occluded parts through strip convolution, thereby capturing the topological structure of the road. Removing the MOD module results in insufficient ability to handle occlusions, leading to road fragmentation. In the experimental results on the DeepGlobe dataset, removing the MOD module reduced the F1-score by 0.51% and the IoU score by 0.69%. Similarly, on the CHN6 dataset, removing the MOD module reduced the F1-score by 0.58% and the IoU score by 0.65%.
Figure 7 presents a heatmap of the ablation experiment for the MOD module. In this experiment, we plotted a heatmap of the output from the last encoder in the network. As shown in the figure, the addition of the MOD module enables the model to focus more on learning road features. When the MOD module is removed, the attention weight assigned to road regions decreases, particularly in areas that are challenging to segment. For example, in the case of DP_2, the road in the upper-left corner of the building area is narrow and partially obscured by shadows from trees and buildings. The heatmap reveals that the MOD module helps the model pay more attention to this occluded road section. Additionally, as seen in CHN6_2, the introduction of the MOD module reduces the attention given to non-road regions, thereby lowering the likelihood of misclassification.
The experimental results in
Table 6 show that when the kernel size is set to 5, the performance of the MFE reaches its peak. This is because in high-resolution remote sensing images the road itself is slender, and the use of bar convolution can capture this feature well, but this is limited to smaller bar convolution capture areas, as an oversized convolution kernel capture area will contain a lot of background information. In the ablation experiment for the strip convolution kernel size of the EFE module, it can be seen that when the convolution kernel size is 9, the performance of the module is optimal. In the MOD module, strip convolution is used to learn the relationship between the center pixel of the cross-sectional area and the surrounding pixels to reduce road fragmentation. A larger kernel size reveals more pixels and introduces more background pixels, which adversely affects the model’s judgment.
The ablation results demonstrate the proposed modules’ efficacy in enhancing segmentation accuracy for high-resolution remote sensing imagery. Moreover, when all three modules are used, the network achieves its highest performance. This further demonstrates the overall effectiveness of the proposed network.
6. Conclusions and Future Work
This paper proposes an EFMOD-Net based on a new encoder–decoder structure for road extraction from high-resolution remote sensing images. The proposed network aims to solve two key challenges in remote sensing road segmentation: insufficient feature extraction and serious road fragmentation.
In order to solve the above two challenges, we propose EFMOD-Net. In the proposed network, we design three key modules: MFE, EFE, and MOD. The MFE module uses strip convolution in different directions to work with the dual-branch EFE module and ASPP module to form an encoder to enhance the feature extraction ability of the network. The MOD module constitutes a decoder, which aims to focus on learning the topology of the road and predicting the blocked road sections by integrating bar convolution and attention mechanisms in multiple directions. Additionally, an output head is employed to further enhance the modeling of road spatial features. In order to verify the effectiveness of the proposed method, comparative experiments with multiple SOTA models were carried out on the DeepGlobe and CHN6 datasets. The experimental results show that the network designed in this paper achieves IoUs of 64.73 and 63.58 on the DeepGlobe and CHN6-CUG datasets, respectively, which are 1.66 and 1.84 higher than the IoUs of the performance-based method assessed. Finally, we performed extensive ablation studies on each proposed module, confirming their individual effectiveness and the rationality of the parameter settings within each module.
However, due to the use of a lot of bar convolution in the proposed model, as well as its own limited computing power, the overall computing power of the model is not high. Therefore, the number of parameters of the model proposed in this paper was increased compared with the original model. In future research, we will focus on reducing the number of parameters of the model and improving the computational power of the model.