1. Introduction
As an essential component of transportation, roadways exert a substantial impact on various aspects of contemporary life, encompassing urban and rural development, traffic management, and autonomous vehicle navigation [
1]. In tandem with the evolution of satellite remote sensing technology, high-resolution satellite imagery has arisen as a crucial asset for digital image processing in the modern era [
2]. As a result, the pursuit of high-precision road extraction from high-resolution satellite imagery has attracted considerable scholarly attention in recent years. However, complications from factors such as illumination, shadow occlusion [
3], and noise yield divergent features among identical road targets, while spectral ambiguity resulting from the influence of neighboring materials (non-road targets displaying road-like properties) intensifies the difficulty of accurately and comprehensively extracting roads from high-resolution satellite imagery [
4].
Lately, in conjunction with the swift advancement of deep learning, an increasing number of researchers have employed deep learning methods within the domains of image categorization and semantic partitioning [
5]. Convolutional neural networks have demonstrated their efficacy in road extraction tasks due to their remarkable feature extraction capabilities [
6]. In comparison to conventional approaches requiring the manual extraction of superficial features, deep learning methods not only enhance the precision of road extraction but also display resilient performance in large-scale extraction efforts [
7].
Mnih et al. [
8] pioneered the implementation of deep learning techniques in road extraction, proposing a method employing restricted Boltzmann machines for identifying roads in high-resolution satellite imagery. Subsequently, Long et al. [
9] presented fully convolutional networks (FCNs), effectively addressing semantic segmentation in images. By replacing fully connected layers with standard convolutions, FCNs enabled the shift from elementary classification to pixel-level classification. Unfortunately, the continuous downsampling operations in FCNs lead to the loss of numerous fine-grained details on small feature maps, complicating the restoration of the original resolution and resulting in segmentation challenges, such as blurred boundaries, indistinct edges, and insufficient feature representation [
10].
To counter these limitations, Ronneberger et al. introduced the U-Net, integrating supplementary skip connections into an FCN to bolster the network’s ability to manage intricate details. U-Net pioneered the encoder–decoder architecture, in which the encoder reduces spatial dimensions and captures spatial detail information, while the decoder reuses low-level features from the encoding stage to progressively restore input size and recover spatial position information [
11]. This approach adeptly retrieves a wealth of spatial detail information and optimally exploits road texture information. However, as the depth of the U-Net increases, issues such as gradient vanishing and explosion may occur. Consequently, existing research focuses on enhancing network performance and stability by leveraging the U-Net’s unique encoding and decoding architecture, while simultaneously refining the network model [
12]. For example, Zhang et al. [
13] combined the advantages of U-Net and ResNet to create the ResUNet, promoting information propagation through abundant contract connections while fully capitalizing on the superior gradient propagation stability of the residual connection structure. Oktay et al. [
14] incorporated an attention mechanism into U-Net, yielding the attention U-Net model. This model highlights segmented targets by suppressing feature responses in irrelevant background areas, thereby improving segmentation accuracy and robustness. Furthermore, Zhou et al. [
15] developed the D-LinkNet model, which integrates dilated convolution modules into the U-Net model to expand the receptive field and enhance the recognition capacity of large-scale objects in the image by scaling and cropping the input image to various dimensions [
16]. Nonetheless, the capacity to discern multi-scale objects in the image remains insufficient, potentially leading to the loss of road details, false positives, and false negatives, which negatively impact the road segmentation outcome. To address these constraints, several empirical studies [
2,
17,
18,
19,
20,
21] have demonstrated that road extraction algorithms can be improved through the fusion of multi-scale spatial features.
For fixed-scale input images, the information acquired by feature extraction operators and classifiers remains constant. Insufficient information can result in improper classification, while excessive information can impede target identification. Consequently, Chen et al. [
22] proposed the DeepLab series of networks, utilizing dilated convolutions instead of upsampling and combining dilated convolutions with atrous spatial pyramid pooling modules to augment the receptive field and obtain multi-scale features, thus enhancing the model’s multi-scale prediction capabilities. However, the feature extraction method based on dilated convolutions can easily lose information related to small-scale targets, as the convolution kernels solely extract features from restricted regions. Wang et al. [
23] introduced the HRNet, which employs a multi-branch structure to simultaneously maintain high- and low-resolution features and repeatedly performs multi-scale fusion to generate rich high-resolution representations. Nevertheless, in occluded images, some resolutions may exhibit information deficits due to partial regions being obscured, ultimately affecting the effectiveness of the resulting feature maps.
To effectively address the aforementioned challenges and further enhance road extraction performance, this paper presents a deep learning network model, HRU-Net, specifically tailored for road features in remote sensing imagery, drawing inspiration from U-Net and HRNet. The HRU-Net model inherits and incorporates the encoder–decoder architecture of the U-Net, establishing an information propagation pathway for replicating low-level features to their corresponding high-level representations, thus enriching high-level semantic features with the finer details of low-level features. The model employs a parallel structure within the sub-network to simultaneously preserve high- and low-resolution semantic information. Concurrently, multiple UMR and MPF modules are designed between the encoding and decoding components to optimally exploit multi-scale information. Both modules progressively combine feature maps of different resolutions, integrating the global contextual information of low-resolution feature maps with the robust detail information of high-resolution feature maps to generate high-resolution feature map representations. Furthermore, the different resolution feature maps acquired after fusion by the two modules undergo frequent information exchange via a parallel structure, enhancing the utilization of advantageous information, such as regions and boundaries, and mitigating the impact of extraneous information, such as shadow occlusion. This process yields a high-resolution feature map through consistent information interaction, ensuring the final prediction results closely approximate pixel-level accuracy and achieve a more precise local feature discrimination capacity.
The proposed HRU-Net offers the following contributions: the design of UMR and MPF modules within the network. The UMR module is a multi-scale fusion module with upsampling functionality, merging features of varying resolutions post-upsampling to generate larger resolution features for subsequent input into the subnet. In the MPF module, features of identical resolution are combined and decoded synchronously via upsampling. Simultaneously, the overarching network employs a parallel structure to maintain the parallelism of low-resolution and high-resolution feature maps, continuously executing multi-scale fusion operations to acquire multi-scale information and perpetually exchanging information between different resolutions. Through incessant information exchange, the model more effectively integrates multi-scale semantic information, considering the semantic information of high and low features, thereby augmenting the expressive capacity of the network model. In comparison to existing models such as U-Net, ResNet, DeepLabV3, ResUnet, and HRNet, the HRU-Net model’s unique parallel connection structure and continuous multi-scale feature integration and information exchange not only preserve detailed information but also capture global features, achieving more accurate road recognition.
The content arrangement of this paper in subsequent sections is as follows:
Section 2 provides detailed information regarding the road detection network proposed in this paper;
Section 3 presents the experimental results, encompassing data introduction, experimental settings, and result analysis;
Section 4 constitutes the discussion segment;
Section 5 offers concluding remarks.
4. Discussion
In this investigation, we assessed the efficacy of the HRU-Net model for road extraction from high-resolution remote sensing imagery. The experimental findings lend credence to our supposition that the HRU-Net model can proficiently delineate road information from these types of images. When appraised on the metrics of precision, recall, and intersection over union (IoU), the HRU-Net model manifested a superior performance in comparison to other cutting-edge methodologies employed for road extraction from remote sensing imagery, such as U-Net, ResNet, DeepLabV3, ResUnet, and HRNet [
36,
37,
38]. These comparative findings portray the HRU-Net model as a promising candidate for implementing road extraction from high-resolution remote sensing images. These results bear significant practical implications. The competency to precisely delineate road information from high-resolution imagery can influence a gamut of domains including urban planning, traffic management, and disaster response [
39]. The adeptness of the HRU-Net model could potentially transform these domains by providing higher precision and detail in road information.
Notwithstanding, our study was constrained by certain limitations. The precision of the HRU-Net model was compromised in certain intricate scenarios, as evidenced by blurred edges, breaks, or omissions when the MRF module was omitted. Likewise, the continuity of the road was disrupted, leading to interruptions or incomplete road segmentation when the UMR module was excluded. These constraints imply that while the HRU-Net model demonstrates promise, it requires further refinement to enhance its performance in intricate scenarios.
Interestingly, our observations indicate that the quantity of network modules had a consequential effect on the model’s performance. The incorporation of the UMR and MRF modules into the HRU-Net model played an instrumental role in boosting the performance of the road extraction endeavor. Their design principles and functionalities acted in unison, thereby augmenting the model’s perceptual capabilities, feature expression competency, and precision. Nevertheless, the escalation in the number of modules also culminated in an increase in computational complexity, which could potentially impact the model’s efficiency.
For future investigations, we advocate for a deeper exploration into the UMR and MRF modules. Our findings denote that these modules contribute significantly to the enhancement of the road extraction task’s performance. Subsequent research could focus on optimizing these modules to further ameliorate the model’s performance in intricate scenarios, while also keeping a check on the equilibrium between performance augmentation and computational efficiency.
To summarize, our study provides substantial evidence that the HRU-Net model is a potent tool for road extraction from high-resolution remote sensing images. Despite the presence of certain limitations, the model demonstrated superior performance in comparison to other advanced methodologies in our experiments. The UMR and MRF modules, in particular, were critical in augmenting the model’s performance. These findings not only contribute to the discipline of remote sensing image analysis but also lay the groundwork for future investigations.