Next Article in Journal
ChatGeoAI: Enabling Geospatial Analysis for Public through Natural Language, with Large Language Models
Next Article in Special Issue
Detecting Urban Traffic Anomalies Using Traffic-Monitoring Data
Previous Article in Journal
SGIR-Tree: Integrating R-Tree Spatial Indexing as Subgraphs in Graph Database Management Systems
Previous Article in Special Issue
Scale- and Resolution-Adapted Shaded Relief Generation Using U-Net
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Road Extraction Models through Semi-Supervised Learning with ACCT

1
School of Earth and Space Sciences, Peking University, Beijing 100871, China
2
College of Urban and Environmental Sciences, Peking University, Beijing 100871, China
3
School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
4
School of Information Management, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
ISPRS Int. J. Geo-Inf. 2024, 13(10), 347; https://doi.org/10.3390/ijgi13100347
Submission received: 23 August 2024 / Revised: 18 September 2024 / Accepted: 27 September 2024 / Published: 29 September 2024
(This article belongs to the Special Issue Advances in AI-Driven Geospatial Analysis and Data Generation)

Abstract

:
Improving the performance and reducing the training cost of road extraction models in the absence of samples is important for updating road maps. Despite the success of recent road extraction models on standard datasets, they often fail to perform when applied to new datasets or real-world scenarios where labeled samples are not available. In this paper, our focus diverges from the typical quest to pinpoint the optimal road extraction model or evaluate generalization prowess across models. Instead, we propose a method called Asymmetric Consistent Co-Training (ACCT) to train existing road extraction models faster and make them perform better in new scenarios lacking samples. ACCT uses two models with different structures and a supervision module to enhance accuracy through mutual learning. Labeled and unlabeled images are processed by both models to generate road maps from different perspectives. The supervision module ensures consistency between predictions by computing losses based on labeling status. ACCT iteratively adjusts parameters using unlabeled data, improving generalization. Empirical evaluations show that ACCT improves IoU by 2.79% to 10.26% using only 1/8 of the labeled data compared to fully supervised methods. It also reduces parameters by over 49% compared to state-of-the-art semi-supervised methods while maintaining similar accuracy. These results highlight the potential of leveraging large amounts of unlabeled data to enhance road extraction models as data acquisition technology advances.

1. Introduction

Road extraction from remote satellite imagery holds significant importance across various domains, including urban planning [1], land use [2], and traffic navigation [3]. With advancements in Earth observation technology, high spatial resolution remote sensing imagery has rapidly developed, ushering in the era of Remote Sensing (RS) big data [4]. This abundance of data provides a wealth of learnable samples for road extraction. However, road extraction remains challenging due to the complexity and variability of road scenarios [5,6,7]. Furthermore, road extraction laws summarized in a limited number of scenarios are not generalizable.
The rise of deep learning has significantly sped up road extraction from remote sensing (RS) imagery [8]. Compared to traditional methods like visual interpretation [9], deep learning-based models typically outperform [10]. However, the effectiveness of deep learning relies on accurately labeled data [11], which is challenging due to complex road features. In the context of road extraction tasks, labels typically involve assigning classes to the pixel points in an image, distinguishing them as either road class pixels or non-road. As seen in Figure 1, even conventional datasets may lack full and accurate labeling for all roads in an image.
Additionally, it is essential to streamline training, reduce costs, and ensure effective results [8]. Firstly, the substantial variability in model accuracy when applied to different datasets necessitates the adoption of transfer learning for cost-effective sample labeling and model retraining in new scenarios. Secondly, given the overwhelming daily volume of unlabeled Remote Sensing imagery in China compared to road datasets, there is a pressing need for efficient and cost-effective data utilization strategies to optimize model performance.
To address these challenges, we propose a novel semi-supervised learning method for road extraction called asymmetric consistency co-training (ACCT), a framework that simultaneously enhances training accuracy, minimizes sample labeling requirements, and reduces training costs, while preserving the integrity of the original model architecture. ACCT consists of two road extraction models with distinct structures, along with a supervised module. First, we need to randomly disrupt and group the data based on the percentage of labeled data and input them into two models group by group. Then, each model independently generates pixel-wise road extraction results from the input data. The supervision module applies different methods of supervision to the road extraction results and updates the parameters of both models simultaneously, depending on whether the input image is labeled. This setup allows the models to share learned features. ACCT involves all unlabeled images throughout feature mining and loss calculation, which ensures that potential information within unlabeled images is fully utilized. Additionally, ACCT addresses the issue of sharing feature learning perspectives among models with different structures, rather than relying on a single perspective with models of the same structure but varying parameters, as seen in previous Cross Pseudo Supervision framework [12,13].
As a result, our method optimally utilizes unlabeled images to enhance the road extraction capability of the model.
This paper presents several significant contributions:
1. A novel semi-supervised learning method, ACCT, is proposed, enabling the training of road extraction models using a combination of labeled and unlabeled images to improve the performance. The method is compatible with most contemporary deep learning models.
2. To address the underutilization of unlabeled images, particularly those with complex content, we introduce a cross-supervision module and an efficient mechanism for leveraging pseudo-labels. This approach eliminates the need for intricate pseudo-label filtering and dataset reconstruction, allowing for full exploitation of latent information within unlabeled data.
3. Empirical validation of the proposed methodology through extensive experimentation across five authentic datasets demonstrates its effectiveness and efficiency. Comprehensive comparisons with several state-of-the-art supervised and semi-supervised learning methods show that our method achieves higher training accuracy at less than half the training cost required.

2. Related Work

To address the challenge of inadequate road labels in vast remote sensing image datasets, researchers have proposed unsupervised, weakly supervised, and semi-supervised methods to improve road extraction in remote sensing datasets [14]. Unsupervised methods, like clustering and generative algorithms, do not rely on precise labels but have lower accuracy [15,16,17]. Weakly supervised methods use scribble annotations but still require manual effort and suffer from data incompleteness [18,19]. In contrast, semi-supervised methods are well-suited for road extraction tasks as they require fewer precise labels or scribble annotations and yield higher accuracy [20]. In road extraction tasks, labels for semi-supervised are usually comprise pixel-level manually labeled data and pixel-level pseudo-labels generated by the model during training.
Semi-supervised learning methods aim to improve model performance by utilizing unlabeled data when labeled samples are scarce. While extensively used in natural image segmentation, their application in road extraction tasks is limited. This is because road extraction from remote sensing imagery is inherently complex, posing challenges even with precisely labeled samples [21,22]. Recent efforts, like the method using Generative Adversarial Network (GAN), have limitations such as fixed model structure and complex training processes, hindering their adaptability to various road extraction scenarios [23].
In this paper, we synthesize insights from multiple reviews on semi-supervised learning, categorizing methods into self-training, co-training, and consistency learning. We then explore the potential of these methods for application in road extraction tasks and explore why their training costs are difficult to control, especially when the amount of unlabeled data is high. We then assess the applicability of these methods to road extraction tasks and analyze the challenges associated with controlling training costs, particularly in scenarios involving substantial quantities of unlabeled data.

2.1. Self-Training and Co-Training

Self-training, introduced by Yarowsky and David [24], is a fundamental pseudo-labeling technique. It follows a four-step iterative process: training a supervised classifier on labeled data, predicting labels for unlabeled data and sorting them by confidence levels, then integrating high-confidence predictions as pseudo-labels into a new dataset for iterative retraining until convergence. Widely used in image segmentation, it has driven significant advancements, such as the Naive-Student model by Chen et al. [25] achieving state-of-the-art results on the Cityscapes dataset [26], and Ibrahim et al. [27] proposing a model with a small fully supervised ensemble, reducing labeling efforts without sacrificing performance.
Co-training, introduced by Blum, Avrim, and Tom Mitchell [28], extends self-training to multiple classifiers. This iterative process involves four similar steps. The difference is that in each iteration, two or more supervised classifiers are trained simultaneously using labeled data, and their most confident predictions are added to the labeled dataset of the other classifiers [29]. Co-training can be categorized as either multi-view or single-view, depending on whether the classifiers observe features from different perspectives or the same perspective, respectively [30].
The sheer volume of daily remote sensing images creates a notable gap between labeled and unlabeled data. Initially, training on a small labeled set is manageable, but handling the abundance of unlabeled data becomes challenging. As unlabeled images increase, stabilizing the dataset with pseudo-labels becomes more time-consuming due to iterative refinement. This extended iteration time is a result of classifiers adjusting predictions based on new data. Therefore, larger pools of unlabeled data require more extensive iterations for dataset stability.
Furthermore, the rigid and simplistic rules in self-training and co-training compel the model to miss learning opportunities from complex samples. Unfortunately, these complex samples are common in remote sensing imagery, and the model’s ability to correctly handle them is crucial for achieving high performance.
Thus, considering the training costs, self-training and co-training may not be the most efficient for utilizing large amounts of unlabeled remote sensing images to enhance model performance.

2.2. Consistency Learning

Consistency learning is extensively researched in semi-supervised semantic segmentation to ensure stable model output despite input perturbations [31]. Adding noise to input images is a common method. For example, Laine, Samuli, and Timo Aila [32] proposed the Π-model and Temporal Ensembling, using stochastic augmentation before prediction. FixMatch [3] and PseudoSeg [33] differentiate image augmentation levels in consistency learning branches, supervising strongly augmented predictions with pseudo-labels from weakly augmented images.
In addition, effectively managing pseudo-labels is vital for improving consistency learning, alongside handling perturbations. However, these aspects are often discussed separately. For instance, Mean Teacher prioritizes refining model parameters over integrating image augmentation components [34]. Similarly, Cross Pseudo Supervision (CPS) first establishes a semi-supervised framework and then briefly explores integrating existing image augmentation methods for training accuracy [17].
Therefore, in Figure 2, mainstream consistency learning networks are divided into three parts: image augmentation, model prediction, and loss supervision. Image augmentation preprocesses samples by adding noise perturbations. Model prediction generates road extraction results from models. Loss supervision calculates loss for backpropagation.
In current consistency learning research, a notable issue is the fixed design of the model prediction component. There is a prevailing belief that its role is solely to generate predictions and introduce perturbations through initialization parameters. However, this narrow view hampers progress. Experiments show that a thoughtful design of the prediction component, like the structure in ACCT proposed by us, greatly enhances training performance.

3. Methodology for ACCT Networks Using in Road Extraction

In this section, we will outline the architecture of the proposed ACCT, followed by explanations of the model branches and the supervision module. Next, we will define the input and loss function mechanisms. Lastly, we will elaborate on the analysis of the effectiveness of the ACCT structure.

3.1. Framework of ACCT

As the name implies, Asymmetric Consistent Co-Training (ACCT) integrates the principles of consistency learning with co-training. This approach enhances the foundational structure of consistency learning by incorporating the multi-perspective advantages of co-training. In ACCT, samples are examined and learned from various viewpoints, leading to an asymmetric design in the consistency learning framework.
The framework of ACCT is illustrated in Figure 3. Following the structural dismantling of the consistency learning framework in Section 2.2, ACCT focuses on the detailed design of the model prediction and loss supervision components. While we have not yet explored the optimal image augmentation method for ACCT, it is a direction we plan to investigate in future studies.
ACCT consists of three main elements: a high-precision principal model F p , a low-precision vice model F v , and a cross pseudo supervision module.
For ACCT, the training image set is divided into two parts: a labeled satellite image dataset D l with N l samples, and an unlabeled satellite image dataset D u with N u samples. The semi-supervised road extraction task aims to train a road extraction network by leveraging both the labeled and unlabeled images.
First, we calculate the data ratio r of volume between the labeled dataset D l and the unlabeled dataset D u using the formula:
r = N l : N u .
Next, we randomly disrupt the data distribution within both the labeled dataset D u and the unlabeled dataset D u . We then combine the disrupted data into a number of sample batches, ensuring each batch contains an equal amount of data according to the ratio r .
Each sample batch will simultaneously feed into both the principal model F p and the vice model F v .
P p = F p ( X ; θ p ) ,
P v = F v ( X ; θ v ) .
The principal model F p and vice model F v have distinct structures with different initial weights θ p and θ v . X represents the input satellite images, while P p ( P v ) denotes the segmentation confidence map, which corresponds to the output of the respective networks.
The predictions from both models will calculate the loss within the supervision module. This loss will update the parameters of both models simultaneously through backpropagation. The supervision module adjusts its loss computation method based on whether the input data is labeled and the data volume ratio r within the sample batch containing labeled and unlabeled data. The specific methods for loss computation are described in detail in Section 3.4.
ACCT incorporates the benefits of multi-view learning from Co-training and makes adaptive enhancements based on the two-branch training framework of consistency learning. Hence, we enable two models with entirely different structures to observe training samples from diverse perspectives. We employ a cross pseudo supervision module to encourage the two models to share their learned features and generate predictions that closely resemble each other. As shown in the pseudo code below (Algorithm 1), ACCT’s training methodology is succinct. In the experimental section, we demonstrate the effectiveness of this structural design. Additionally, in Section 3.5, we elucidate why models with different structures can share learned features within the framework of ACCT.
Algorithm 1: Pseudo code for ACCT process in semantic segmentation
Input: Unlabeled image set D u = { ( x i , y i ) } i = 1 N u , labeled image set D l = { ( x i , y i ) } i = 1 N l
Output: Final best models F p
Randomly combine D u and D l into batches with the same ratio of labeled and unlabeled samples, and the number of batches is A.
until  L convergence do
  for batch { ( x k , y k ) } k = 1 A ( D u D l )  do
    B  = ( N u + N l ) / A
    Pp  = F p ( x k , y k )
    Pv  = F v ( x k , y k )
    for  k { 1 , ,   A }  do
      for  j { 1 , ,   B }
        if  x k D u  then
          calculate the loss L c with Equation (5)
        else
          calculate the loss Ls with Equation (4)
    Combine L c and L s and calculate L with Equation (6)
  Update F p and F v to minimize L of { ( x k , y k ) } k = 1 A
return  F p
To clearly delineate our contributions and emphasize the innovations of our work, we offer the following comparison. In contrast to traditional consistency learning, ACCT enables the concurrent training of two models with significantly different architectures across separate branches. Unlike co-training, which involves iterative updates to the dataset, ACCT adopts a streamlined approach, with each training iteration comprising a single pass of inputs and outputs. Additionally, unlike distillation learning, which focuses on transferring the predictive power of a larger model to a smaller one, ACCT emphasizes mutual-assisted learning. This approach fosters collaboration between both models to improve prediction accuracy, as evidenced by our experimental results.
The specifics of the components of ACCT and their training procedures are delineated in the subsequent subsections.

3.2. The Components of Primary Model and Vice Model

While ACCT facilitates the simultaneous training of two models with distinct structures, in practical applications, we typically utilize only one of the trained models for extracting road network distributions in remote sensing images. Consequently, during the training process, the models on the two branches are designated as primary model and vice model. We designate the model intended for practical application as the principal model, while the other model serves as the vice model.
If the objective of training is solely to train a high-precision target model, then once the loss curve of the principal model has converged during the training process, training can be halted, even if the loss curve of the vice model has not yet converged. However, if the aim is to train both models simultaneously, training can be continued until the loss curves of both models have converged before stopping.
For the road extraction task, large models typically exhibit higher extraction accuracy compared to small models. This discrepancy arises because small models aim to minimize the model’s parameter count while maintaining an acceptable extraction accuracy, making them suitable for deployment on other devices or systems. Consequently, in practical ACCT application scenarios, a combination of a principal model with a large number of parameters and a vice model with a small number of parameters is common. Nonetheless, both scenarios—using a large model as the principal model and using a small model as the principal model—are tested in Section 4.

3.3. The Sample Batch and Model Inputs

In real road extraction tasks, the ratio r of labeled to unlabeled data is often not regular. While this irregularity theoretically does not significantly impact our algorithm’s implementation, the size of each training sample batch should be limited due to the constraints of computer GPU memory.
We begin by adjusting the sample ratio r to a power of 2 by either removing some of the unlabeled data or adding a small amount of labeled data. For instance, ratios such as 1/4:3/4, 1/8:7/8, 3/16:13/16, etc., are used. Subsequently, the samples are restructured into batches based on the processed dataset’s new ratio r of labeled to unlabeled data. For example, if r is 1/4:3/4, then the sample batch size is 4, with each batch comprising 1 labeled data and 3 unlabeled data samples.
The sorted batches of samples are sequentially input into the two models in ACCT. Throughout this process, unlabeled and labeled data are treated equally, and the models do not reduce their effort on the data because unlabeled data lack true labels. The output from both models is fed into the supervision module for loss calculation based on the ratio r , as depicted in Figure 4.

3.4. The Loss Function for ACCT

The sample batches are input into the principal model F p and the vice model F v , producing predictions P p and P v , respectively. Subsequently, the supervision module performs loss calculation. Depending on whether the input sample X in the sample batch has a real segmentation map Y or not, the supervision module employs different loss calculation methods.
As depicted in Figure 3, Y p is utilized to supervise P v , and Y v is used to supervise P p , where Y p and Y v represent the label maps. If the input X belongs to the set D l and has a real segmentation map Y , then Y is both Y p and Y v . However, if the input X belongs to the set D u and lacks a real segmentation map, then P p and P v serve as Y p and Y v , respectively.
The training objective comprises two losses: supervision loss L s and cross pseudo supervision loss L c . The supervision loss L s is formulated using the standard pixel-wise cross-entropy loss on the labeled images across the two parallel segmentation networks:
L s = 1 | D l | X D l 1 W × H i = 0 W × H ( l c e p p i , y i + l c e p v i , y i )
Here, l c e represents the cross-entropy function, p p i ( p v i ) denotes the segmentation confidence map of F p ( F v ), and y i is the ground truth. W and H represent the width and height of the input image.
The cross pseudo supervision loss is bidirectional, involving supervision from P p to P v and vice versa. We utilize the pixel-wise label map P v output from the principal network F p to supervise the pixel-wise confidence map P p of the vice model F v . Conversely, we employ the pixel-wise label map P p output from the principal network F v to supervise the pixel-wise confidence map P v of the vice model F p . The cross pseudo supervision loss on the unlabeled data is expressed as:
L c = 1 | D u | X D u 1 W × H i = 0 W × H ( l c e p p i , p v i     2
The total loss L of the training is defined as:
L = r L s + λ ( 1 r ) L c
where λ is the trade-off weight. Objectively, λ can take any value. However, from the perspective of training semi supervised learning, the value of λ should not be set too high. This is because the confidence level of pseudo labels is far inferior to that of precise manual labels. Therefore, the loss L c calculated based on pseudo labels should not occupy a higher weight in backpropagation than the loss L s calculated based on precise manual labels. Arguments are made regarding the appropriate value of λ in Section 4. The results indicate that a value between 0.1–0.2 is the most appropriate for λ .

3.5. Analysis of the Effectiveness of ACCT Structure

In contrast to traditional consistency learning architectures, ACCT enhances both model prediction and loss supervision. Specifically, we introduce new semi-supervised structures that enable models of various architectures to share features learned from diverse perspectives during training, rather than relying solely on a single viewpoint.
Figure 5 illustrates the predictions of the model trained on the CHN6-CUG dataset using the conventional method. Let’s focus on the positions marked with circles and boxes. As shown in Figure 5c, while there are fewer noise points, road continuity is compromised compared to Figure 5d. This observation underscores that, although the overall accuracy of different models may vary, each model possesses unique strengths in terms of local accuracy at specific locations. Consequently, our approach emphasizes the integration of diverse features learned by different models to mutually enhance performance. In traditional consistency learning, both branches utilize the same structure, which often results in a narrow focus on the road features of interest due to the absence of external stimuli. This limited perspective can negatively impact the model’s overall performance.The reduced training cost can be attributed to our unique designs in both the model prediction and loss supervision components. Unlike self-training and co-training methods, the clever cross-pseudo-supervised design in ACCT eliminates the need for comparing and screening pseudo-labels. Additionally, employing a small model with a negligible number of parameters to assist in training a larger model further reduces training costs by nearly half.
The validity of the ACCT structure will be verified in Section 4 and Section 5, and further analyzed in Section 5.5.

4. Experiment

4.1. Dataset Description

To comprehensively demonstrate the effectiveness of the proposed semi-supervised road extraction method, we conduct extensive experiments on five separate datasets. Details for all these datasets are provided in Table 1.
The CHN6-CUG Dataset contains 4511 aerial images from six Chinese cities, with detailed pixel-wise annotations [37]. Since there are a large number of images and labels without any information in the CHN6-CUG dataset, repeated training on such data not only has little effect on improving the model capability, but also consumes a lot of computational resources. Therefore, we filtered the dataset before the experiment and finally reserved 2048 images containing road information for training. In the experiment, the whole dataset was divided into two parts. One part has 1600 images, which are used as training dataset, and the remaining 448 images are used for testing.
The Massachusetts Roads Dataset, widely used in road extraction research [38], consists of 1108 training images and 49 testing images. These images are 1500 × 1500 pixels in size with a resolution of 1.2 m/pixel. To align with the CHN6-CUG Dataset, each original image was divided into 9 pieces, yielding 9972 training images sized 512 × 512 and 441 testing images.
The Ottawa Roads Dataset represents typical urban areas in Ottawa, Canada [35]. Roads in this dataset vary in width from 10 to 80 pixels and feature numerous shadows and occlusions due to cars and avenue trees. Consequently, extracting roads from this dataset presents a more challenging task.
The LoveDA Dataset was collected from the cities of Nanjing, Changzhou, and Wuhan, encompassing 18 different administrative districts [39]. For our experiments, we randomly selected 1366 road images, primarily from rural areas. This subset includes 1088 images for training and 278 images for testing, with each image having a resolution of 1024 × 1024 pixels and a pixel resolution of 0.3 m per pixel.
The DeepGlobe Road Extraction Dataset comprises 6226 training images and 1101 test images sourced from Thailand, India, and Indonesia [40]. This dataset features a variety of scenes, including urban areas, countryside, desolate rural regions, coastal areas, and tropical rainforests. All images are sized at 1024 × 1024 pixels with a resolution of 0.5 m per pixel. To optimize the data for GPU processing, we used 6224 images for training and 1008 images for testing.
By experimenting with the CHN6-CUG, Massachusetts Roads, and Ottawa Roads Datasets, we can assess the effectiveness of various semi-supervised methods for training road extraction models in urban scenarios. Similarly, experiments on the LoveDA Dataset allow us to evaluate these methods in rural contexts. Additionally, testing on the DeepGlobe Road Extraction Dataset enables us to verify their effectiveness in complex and hybrid scenes.
We adhere to the partition protocols of Guided Collaborative Training by dividing the entire training set into two groups. This division is achieved by randomly subsampling 1/2, 1/4, 1/8, and 1/16 of the entire set to create the labeled set, while the remaining images are designated as the unlabeled set [41].

4.2. Baselines and Experiment Setting

In practice, the volume of unlabeled remote sensing image data far exceeds that of labeled data. Traditional self-training and co-training methods require extensive data filtering and iterative updates, making them impractical for large-scale road extraction tasks. Consequently, we limited our comparison between ACCT and these methods to a few experiments focused on training accuracy. In contrast, consistency learning approaches are more comparable to ACCT in terms of framework and training costs. Thus, we conducted extensive experiments across all datasets to demonstrate ACCT’s effectiveness.
To compare with ACCT, we selected three consistency learning methods: CR-Seg [42], CPS [17], and GCT [41]. CR-Seg is a leading semi-supervised learning method that shows significant advantages in multiple experiments. Introduced at CVPR 2021, CPS is recognized for its top performance on benchmarks like PASCAL VOC 2012 and Cityscapes. GCT excels in various tasks, including semantic segmentation and image denoising, serving as a solid baseline. Among these, CPS shares the closest structure to ACCT, differing mainly in loss supervision and model prediction. Thus, this comparative evaluation serves as an ablation study to highlight the advancements and effectiveness of the ACCT framework.
For the models trained on these paths, we selected RUW-Net [43], U-Net, D-LinkNet [44], RoadNet, and CGNet [45]. RUW-Net is currently one of the best road extraction models available. U-Net is a widely used classical semantic segmentation model that serves as a benchmark for road extraction comparisons. D-LinkNet gained prominence in the CVPR 2018 Digital Road Extraction Challenge, demonstrating its effectiveness in road extraction tasks. Both CGNet and RoadNet are lightweight networks, with CGNet being one of the most advanced lightweight semantic segmentation models, while RoadNet is known for its exceptional accuracy in road extraction despite its small model size.
While ACCT is compatible with both CNN-based and transformer-based models, it is important to note that transformer-based models are less suitable for our dataset. According to Alexey Dosovitskiy et al.’s seminal work on Vision Transformers (ViTs), ViTs generally show lower efficiency compared to CNNs when trained on small to medium-sized datasets (around 1 million samples or fewer) [46]. To demonstrate that ACCT can effectively support the training of transformer architectures, we conducted experiments with two ViT models: SegFormer and Swin-UNet [47,48].
Furthermore, for comparisons involving self-training and co-training methods, we use the widely adopted frameworks ST++ and DMT as baselines against ACCT. ST++ enhances training outcomes even in scenarios of significant label scarcity by employing data augmentation to progressively utilize unlabeled images. DMT, on the other hand, is a co-training approach that facilitates mutual training between two models through dynamic re-weighting of the loss function, demonstrating strong performance in both image classification and semantic segmentation tasks.
As with most self-training and co-training methods, the ST++ and DMT frameworks require manual tuning of hyperparameters, such as the pseudo-label threshold and the number of dataset iterations. To prevent potential accuracy degradation from hyperparameter adjustments, we adhered to the original settings specified by the authors in our experiments.

4.3. Evaluation

We evaluate the segmentation performance using the Intersection-over-Union (IoU), Precision, Recall, F1 metric, Overall Accuracy (OA) and the Kappa coefficient.
The evaluation metric of IoU is formulated as follows:
I o U = T P T P + F P + F N
where TP, FP, TN and FN are the number of true positives, false positives, true negatives and false negatives, respectively.
The evaluation metric of Precision is formulated as follows:
P r e c i s i o n = T P T P + F P
The evaluation metric of Recall is formulated as follows:
R e c a l l = T P T P + F N
The evaluation metric of F1 is formulated as follows:
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
The evaluation metric of OA is formulated as follows:
O A = T P + T N T P + F N + F P + T N
The evaluation metric of Kappa coefficient is formulated as follows:
K a p p a = P 0 P e 1 P e
where P 0 is the actual agreement rate and P e is the theoretical agreement rate.
To evaluate the overall training cost, we consider both equipment and time costs associated with the training process. Equipment cost is determined by the model’s parameter size (Params) and the number of floating-point operations (FLOPs). Time cost is measured by recording the duration required for each method to complete a single training iteration on the dataset. This comprehensive approach provides a detailed understanding of the training costs from multiple perspectives.
The evaluation metric of Params is formulated as:
P a r a m s = i = 0 m P a r a m s c o n v + i = 0 n P a r a m s f c
where m and n represent the number of convolutional and fully connected layers, respectively. P a r a m s c o n v and P a r a m s f c denote the number of parameters in the convolutional and fully connected layers.
P a r a m s f c = ( n i n + b i a s ) n o u t
where n i n and n o u t denote the number of input nodes and output nodes, and bias stands for the bias term.
The evaluation metrics of FLOPs are formulated as:
F L O P s = i = 0 m F L O P s c o n v + i = 0 n F L O P s f c
where F L O P s c o n v and F L O P s f c represent the floating-point operations in the convolutional and fully connected layers.
F L O P s c o n v = k w k h c i n + b i a s c o u t H o u t W o u t
where kw and kh represent the width and height of the convolution kernel, c i n and c o u t indicate the number of input and output channels. H o u t and W o u t signify the height and width of the output feature layer.
F L O P s f c = ( n i n + b i a s ) n o u t

4.4. Implementation Details

Our proposed ACCT is implemented using PyTorch and trained on RTX 3090 GPUs. To ensure a fair comparison and to focus on the structural design of the semi-supervised learning framework, we did not use pre-trained models or image enhancement techniques in any of the experiments. Models were trained using a mini-batch SGD optimizer with a fixed momentum of 0.9, a batch size of 8, and an initial learning rate of 0.01. We set the number of epochs to 1000 to ensure model convergence.
To investigate the impact of the weight ratio between the supervision loss L s and the cross pseudo supervision loss L c on training outcomes, we varied λ from 0.1 to 1.0 in increments of 0.1 and conducted three sets of semi-supervised learning experiments using U-Net on the CHN6-CUG dataset. To maintain fairness and reduce the effect of hyperparameter tuning on the results, we fixed λ at a specific value when comparing the accuracy of different semi-supervised learning methods.

5. Results

5.1. Impact of Different Trade-Off Weight of Ls and Lc on Training Effects

We investigated the influence of different λ that is used to balance the supervision loss and cross pseudo supervision loss as shown in Equation (6). As shown in Table 2, the model performs best on the CHN6-CUG dataset when λ is set between 0.1 and 0.2. Therefore, we set λ to 0.1 for all experiments.

5.2. Comparison of Accuracy with Consistency Learning

Table 3 presents the effectiveness of various semi-supervised methods for model training across different partition protocols. The results indicate that no single method excels in all evaluation metrics. However, the ACCT method consistently maintains 4 to 5 out of 6 metrics at optimal levels, whereas other methods achieve optimal performance in at most 1 metric. This demonstrates a significant advantage for the ACCT approach.
Using Intersection over Union (IoU) as an illustrative example, we present a comprehensive comparison of our method against the baseline across all partitioning protocols in Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11.
Take Figure 6 as an example, both our method and the baseline consistently outperform supervised learning on the CHU6-CUG Dataset using U-Net (Figure 6a). The Dashed line represents the performance of a model trained using supervised learning. Figure 6b focuses on the upper curve from Figure 6a, showing our method’s superiority over the baseline under 1/16, 1/8, 1/4, and 1/2 partition protocols. The blue curve in Figure 6 shows U-Net’s training results using our method with CGNet as a secondary model, while the purple curve shows results with RoadNet as a secondary model. The baseline approach does not differentiate between primary and secondary models. The pink curve represents the CPS method’s U-Net training results, the green curve represents the GCT method’s U-Net training results, and the orange curve shows the supervised learning results.As shown in Figure 6, even with only 1/16 of the data labeled, ACCT outperforms fully supervised training. This suggests that for real-world road extraction tasks, ACCT’s methodology can achieve better performance with significantly less labeled data, which is very promising.
Table 4, Table 5 and Table 6 present detailed scores of the training results for various methods on different datasets, demonstrating the generalizability of ACCT’s training approach.
Table 7 displays the performance of various semi-supervised methods on the CHN6-CUG dataset when training a model based on a transformer architecture, rather than a CNN architecture. As discussed in Section 4.2, while the amount of data used remains insufficient to fully leverage the potential of transformer-based models, the ACCT method achieves superior training results even with limited data.
Table 8 presents the performance of various semi-supervised methods in rural areas and other complex scenes. The experiments once again demonstrate the superiority of ACCT.
Figure 12 and Figure 13 display segmentation results of U-Net on the CHN6-CUG Dataset, Massachusetts Roads Dataset, and Ottawa Roads Dataset.
Comparing images in groups c and d, it’s clear that the ACCT-trained model performs better than the CPS-trained model, particularly in maintaining road continuity. Further comparison of images in groups c, d, and f shows that ACCT outperforms fully supervised training with only 1/4 or 1/16 of the labeled data. This indicates that with just 6.25% to 25% of previously labeled road samples, we can train a superior road extraction model when given an unlabeled set of road samples in a new environment. Additionally, comparing images in groups c and f highlights that ACCT training significantly surpasses fully supervised training when supplemented with additional unlabeled road images, which are often readily available.
In summary, ACCT effectively enhances model performance.
Furthermore, we conducted experiments to demonstrate the effectiveness of our method for training lightweight models, as shown in Figure 14. The Dashed line represents the performance of a model trained using supervised learning. It illustrates the improvement over the baseline for all partitioning protocols, even with lightweight models. Table 9 provides specific scores of the training results for various methods on different datasets.

5.3. Comparison of Accuracy with Self-Training and Co-Training

Table 10 and Table 11 present detailed training scores for the ACCT method compared with ST++ and DMT on the CHN6-CUG dataset. The results indicate that when training large-volume models with high accuracy, the performance of ACCT is only marginally better than that of ST++ and DMT. However, when applied to lightweight models, ACCT demonstrates a significant advantage in training accuracy.

5.4. Comparison in Training Cost

Self-training and co-training methods, such as ST++ and DMT, involve intricate, sequential model training processes and iterative dataset updates. Their training costs are variable and significantly affected by the number of dataset iterations and the confidence thresholds for pseudo-labels, making them notably more expensive than ACCT. Specifically, these methods can be more than three times slower than ACCT when using three dataset iterations, and the time required to sort pseudo-labels by confidence increases as the dataset size grows.
In contrast, the training cost of consistency learning is more comparable to ACCT. Table 12 compares the computational costs and model parameters of different methods. ACCT achieves nearly half the equipment costs of the baseline, primarily due to its streamlined training structure. Unlike CPS, which utilizes large models on both branches, we use a lightweight auxiliary model, reducing parameter requirements while still enabling mutual supervision.
Table 13 presents a comparison of the time required by different methods to complete one round of training on the DeepGlobe Road Extraction Dataset. This includes both the time spent training on the training set and evaluating accuracy on the test set. ACCT reduces time consumption by approximately 30% per iteration compared to the baseline. The percentage of time saved is smaller than the percentage of equipment cost saved because some fundamental code and tool libraries required for training cannot be streamlined by the semi-supervised approach.
Figure 15 displays the convergence curves for U-Net training on the CHN6-CUG dataset. Our method converges more rapidly than the baseline, typically within 400–800 iterations, compared to 800–1200 iterations for the CPS method. The fewer iterations required for convergence, coupled with reduced parameters, computations, and time per iteration, contribute to lower overall training costs for our method.

5.5. Ablation Experiments

To further elucidate why the structural design of ACCT enhances model training effectiveness, we performed ablation experiments to analyze the early training dynamics, as illustrated in Figure 16. Several key observations can be drawn from the figure. First, the ACCT method demonstrates superior training efficiency, initiating the acquisition of useful road structures by the 20th epoch. This stage requires 30 epochs in Figure 16b and 60 epochs in Figure 16c, consistent with the findings presented in Section 5.4.
Subsequently, we compare how different methods learn the red-circled area as the number of epochs increases. The performance observed between the 20th and 100th epochs in Figure 16b indicates that the U-Net model fails to learn this area accurately. The misclassified features learned by U-Net not only persist but also become more entrenched with additional iterations. In contrast, RoadNet effectively learns this area, showing no misclassifications from round 20 to round 100. This indicates that models with distinct architectures provide different feature observation perspectives. Models with identical architectures may not aid each other in addressing specific learning challenges, instead reinforcing erroneous knowledge. The ACCT method effectively resolves this challenge. As depicted in Figure 16a during epochs 20–40, although the ACCT-trained model is initially misled by the U-Net model on a branching road in the red-circled area, resulting in some misjudgment, the situation improves during epochs 60–100. At this stage, the RoadNet model on an alternate branch begins to learn new features and share them with the U-Net model on the previous branch, facilitating the rapid correction of previously misclassified features. This process highlights why the ACCT method significantly enhances the model’s ability to learn and generalize sample features.
This conclusion is further supported by Figure 17. When we use a rural road scene as the training set with the state-of-the-art road extraction model, the feature-sharing property of ACCT remains effective. As shown in Figure 17b, while RUW-Net quickly learns the road features, it misclassifies the jungle within the red circle as a road, and this misclassification tends to deepen over time. In contrast, CGNet, despite having lower overall performance, remains unaffected in that area. ACCT corrects this erroneous classification through feature sharing. Figure 17a indicates that errors deepen between the 60th and 80th cycles, but from the 80th to 100th cycles, CGNet’s feature sharing rectifies the misclassification.

6. Discussions

There have been many studies on fusing multimodal data to extract roads and achieved good results, such as combining high-resolution satellite imagery, synthetic aperture radar, digital elevation model, and mobile laser scanning data to extract roads [49,50]. However, in this paper, only high-resolution remote sensing images were used as inputs to design various experiments to extract the roads. This is mainly to reduce the complexity of the experiments and to focus the comparison on the training methods of different semi-supervised methods. Therefore, we also did not intentionally choose an ultra-high precision road extraction model to compare the training effectiveness of different models vertically, but rather horizontally to compare the effectiveness of the same model trained in different ways.
Table 3 illustrates the superiority of ACCT over recent semi-supervised methods by comparing six different accuracy evaluation metrics. Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11 and Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 demonstrate, through experiments conducted on multiple datasets across various scenarios, that ACCT exhibits strong generalization capabilities. This holds true for urban road datasets, rural road datasets, and mixed scenario datasets, across resolutions of 1 m, 0.5 m, 0.3 m, and 0.21 m. Additionally, ACCT is effective whether training large parameter models or lightweight models, and whether employing CNN or transformer architectures. Compared to other methods, ACCT significantly enhances model performance while reducing training costs. Remarkably, even with only 1/16 of the labels utilized in fully supervised methods, ACCT enables effective feature learning from unlabeled images, allowing the model to achieve comparable accuracy.
Table 12 and Table 13 demonstrate that, compared to the baseline, ACCT reduces FLOPs by over 48.83%, parameters by more than 49.56%, and the time required for each iteration by approximately 30%. Additionally, Figure 15 illustrates that the ACCT method facilitates faster model convergence. Consequently, the overall training cost of ACCT is significantly lower than that of the benchmark method.
Additionally, Figure 16 and Figure 17 illustrate the effectiveness of the structural design of the ACCT method through ablation experiments.
However, Table 7 indicates that the applicability of ACCT is somewhat limited. Although ACCT achieves improved training accuracy compared to benchmark methods, it still struggles with effectively training transformer-based models. Future research will focus on how to fully leverage the vast amount of unlabeled samples to enhance the feature extraction accuracy of transformer-based models.

7. Conclusions

To address the challenges associated with a scarcity of accurately labeled samples and the substantial costs of model training in real-world applications, this paper introduces a novel semi-supervised road extraction framework, referred to as ACCT. Unlike conventional semi-supervised learning approaches, ACCT facilitates synchronous training of models with diverse architectures, allowing them to share features learned from multiple perspectives. This innovative strategy not only enhances model performance but also minimizes training expenses. Experimental evaluations across various models and datasets representing different scenarios demonstrate that ACCT exhibits robust training efficacy and compatibility. These results indicate that a thoughtfully implemented multi-perspective learning approach can provide significant advantages, thereby paving the way for further advancements in the performance of road extraction models.

Author Contributions

Conceptualization, Hao Yu and Zhijiang Li; methodology, Hao Yu; validation, Hao Yu and Zhijiang Li; formal analysis, Hao Yu; resources, Shihong Du and Zhijiang Li; data curation, Hao Yu; writing—original draft preparation Hao Yu; writing—review and editing, Zhijiang Li, Zhenshan Tan and Xiuyuan Zhang; supervision, Zhijiang Li; project administration, Shihong Du; funding acquisition, Zhijiang Li All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant 2021YFE0117100.

Data Availability Statement

Data can be provided upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jiang, F.; Ma, L.; Broyd, T.; Chen, W.; Luo, H. Digital twin enabled sustainable urban road planning. Sustain. Cities Soc. 2022, 78, 103645. [Google Scholar] [CrossRef]
  2. Li, Y.; Yan, B.; Yan, J. Correlation between Road Network Accessibility and Urban Land Use: A Case Study of Fuzhou City. Pol. J. Environ. Stud. 2022, 31, 2915–2922. [Google Scholar]
  3. Soni, P.K.; Rajpal, N.; Mehta, R. Road network extraction using multi-layered filtering and tensor voting from aerial images. Egypt. J. Remote Sens. Space Sci. 2021, 24, 211–219. [Google Scholar] [CrossRef]
  4. Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big Data for Remote Sensing: Challenges and Opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
  5. Wang, Y.; Peng, Y.; Li, W.; Alexandropoulos, G.C.; Yu, J.; Ge, D.; Xiang, W. DDU-Net: Dual-Decoder-U-Net for Road Extraction Using High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4412612. [Google Scholar] [CrossRef]
  6. Li, Y.; Xiang, L.; Zhang, C.; Jiao, F.; Wu, C. A Guided Deep Learning Approach for Joint Road Extraction and Intersection Detection from RS Images and Taxi Trajectories. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8008–8018. [Google Scholar] [CrossRef]
  7. Gao, F.; Tu, J.; Wang, J.; Hussain, A.; Zhou, H. RoadSeg-CD: A Network with Connectivity Array and Direction Map for Road Extraction from SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3992–4003. [Google Scholar] [CrossRef]
  8. Yang, X.; Song, Z.; King, I.; Xu, Z. A Survey on Deep Semi-Supervised Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
  9. Miao, Z.; Shi, W.; Zhang, H.; Wang, X. Road centerline extraction from high resolution imagery based on shape features and multivariate adaptive regression splines. IEEE Geosci. Remote Sens. Lett. 2012, 10, 583–587. [Google Scholar] [CrossRef]
  10. Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road extraction methods in high resolution remote sensing images: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
  11. Li, Y.; Shi, T.; Zhang, Y.; Chen, W.; Wang, Z.; Li, H. Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2021, 175, 20–33. [Google Scholar] [CrossRef]
  12. Dixit, Y.; Srivastava, N.; Joy, J.D.; Olikara, R.; Ramesh, R. Cross Psuedo Supervision Framework for Sparsely Labelled Geo-spatial Images. arXiv 2024, arXiv:2408.02382. [Google Scholar]
  13. Xu, Y.; Wei, F.; Sun, X.; Yang, C.; Shen, Y.; Dai, B.; Zhou, B.; Lin, S. Cross-model pseudo-labeling for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2959–2968. [Google Scholar]
  14. Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Deep learning approaches applied to remote sensing datasets for road extraction: A state-of-the-art review. Remote Sens. 2020, 12, 1444. [Google Scholar] [CrossRef]
  15. Wu, L.; Fang, L.; He, X.; He, M.; Ma, J.; Zhong, Z. Querying labeled for unlabeled: Cross-image semantic consistency guided semi-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8827–8844. [Google Scholar] [CrossRef]
  16. Cira, C.-I.; Kada, M.; Manso-Callejo, M.-Á.; Alcarria, R.; Bordel Sanchez, B. Improving Road Surface Area Extraction via Semantic Segmentation with Conditional Generative Learning for Deep Inpainting Operations. ISPRS Int. J. Geo-Inf. 2022, 11, 43. [Google Scholar] [CrossRef]
  17. Chen, X.; Yuan, Y.; Zeng, G.; Wang, J. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  18. Wu, S.; Du, C.; Chen, H.; Xu, Y.; Guo, N.; Jing, N. Road Extraction from Very High Resolution Images Using Weakly labeled OpenStreetMap Centerline. ISPRS Int. J. Geo-Inf. 2019, 8, 478. [Google Scholar] [CrossRef]
  19. Zhou, M.; Sui, H.; Chen, S.; Liu, J.; Shi, W.; Chen, X. Large-scale road extraction from high-resolution remote sensing images based on a weakly-supervised structural and orientational consistency constraint network. ISPRS J. Photogramm. Remote Sens. 2022, 193, 234–251. [Google Scholar] [CrossRef]
  20. Engelen, J.V.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
  21. Wang, Y.; Xuan, Z.; Ho, C.; Qi, G.-J. Adversarial Dense Contrastive Learning for Semi-Supervised Semantic Segmentation. IEEE Trans. Image Process. 2023, 32, 4459–4471. [Google Scholar] [CrossRef]
  22. Hoyer, L.; Dai, D.; Wang, Q.; Chen, Y.; Van Gool, L. Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. Int. J. Comput. Vis. 2023, 131, 2070–2096. [Google Scholar] [CrossRef]
  23. Chen, H.; Li, Z.; Wu, J.; Xiong, W.; Du, C. SemiRoadExNet: A semi-supervised network for road extraction from remote sensing imagery via adversarial learning. ISPRS J. Photogramm. Remote Sens. 2023, 198, 169–183. [Google Scholar] [CrossRef]
  24. Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, USA, 26–30 June 1995. [Google Scholar]
  25. Chen, L.C.; Lopes, R.G.; Cheng, B.; Collins, M.D.; Cubuk, E.D.; Zoph, B.; Adam, H.; Shlens, J. Naive-student: Leveraging semi-supervised learning in video sequences for urban scene segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  26. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
  27. Ibrahim, M.S.; Vahdat, A.; Ranjbar, M.; Macready, W.G. Semi-supervised semantic image segmentation with self-correcting networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  28. Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998. [Google Scholar]
  29. Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  30. Chen, M.; Weinberger, K.Q.; Chen, Y. Automatic Feature Decomposition for Single View Co-training. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; Volume 2. [Google Scholar]
  31. Fan, Y.; Kukleva, A.; Dai, D.; Schiele, B. Revisiting Consistency Regularization for Semi-Supervised Learning. Int. J. Comput. Vis. 2023, 131, 626–643. [Google Scholar] [CrossRef]
  32. Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
  33. Zou, Y.; Zhang, Z.; Zhang, H.; Li, C.L.; Bian, X.; Huang, J.B.; Pfister, T. Pseudoseg: Designing pseudo labels for semantic segmentation. arXiv 2020, arXiv:2010.09713. [Google Scholar]
  34. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
  35. Liu, Y.; Yao, J.; Lu, X.; Xia, M.; Wang, X.; Liu, Y. RoadNet: Learning to comprehensively analyze road networks in complex urban scenes from high-resolution remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2043–2056. [Google Scholar] [CrossRef]
  36. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
  37. Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A global context-aware and batch-independent network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
  38. Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
  39. Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
  40. Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. arXiv 2018, arXiv:1805.06561. [Google Scholar]
  41. Ke, Z.; Qiu, D.; Li, K.; Yan, Q.; Lau, R.W. Guided collaborative training for pixel-wise semi-supervised learning. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  42. Xiao, Y.; Dong, J.; Zhang, Q.; Yi, P.; Liu, R.; Wei, X. Semi-supervised Semantic Segmentation with Complementary Reconfirmation Mechanism. In Proceedings of the 22nd UK Workshop on Computational Intelligence (UKCI 2023), Birmingham, UK, 6–8 September 2023; Springer Nature: Cham, Switzerland; pp. 182–194. [Google Scholar]
  43. Yang, J.; Gu, Z.; Wu, T.; Ahmed, Y.A. RUW-Net: A Dual Codec Network for Road Extraction from Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1550–1564. [Google Scholar] [CrossRef]
  44. Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  45. Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
  46. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  47. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  48. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
  49. Fu, L.; Chai, H.; Lv, X. Enhancing Road Extraction in Large-Scale Complex Terrain through Multi-Source Remote Sensing Image Fusion and Optimization. Remote Sens. 2024, 16, 297. [Google Scholar] [CrossRef]
  50. Guan, H.; Li, J.; Yu, Y.; Chapman, M.; Wang, C. Automated Road Information Extraction from Mobile Laser Scanning Data. IEEE Trans. Intell. Transp. Syst. 2015, 16, 194–205. [Google Scholar] [CrossRef]
Figure 1. Challenges in achieving complete and accurate road labeling, illustrated with examples from the Massachusetts Roads Dataset and CHN6-CUG Dataset. (b) is the label of the real image (a) and (d) is the label of the real image (c). Red circles mark areas that are not fully labelled, orange circles mark areas that are difficult to be accurately labeled due to building occlusion, and yellow circles mark areas that are difficult to be accurately labeled due to tree occlusion.
Figure 1. Challenges in achieving complete and accurate road labeling, illustrated with examples from the Massachusetts Roads Dataset and CHN6-CUG Dataset. (b) is the label of the real image (a) and (d) is the label of the real image (c). Red circles mark areas that are not fully labelled, orange circles mark areas that are difficult to be accurately labeled due to building occlusion, and yellow circles mark areas that are difficult to be accurately labeled due to tree occlusion.
Ijgi 13 00347 g001
Figure 2. Illustrating the architectures for (a) common consistency learning methods, (b) mean teacher, (c) PseudoSeg, and (d) CPS. Solid arrows indicate forward operation and dashed arrows indicate loss supervision. ‘//’ on solid arrows means stop-gradient. ‘X’ means the sample, ‘X1’ and ‘X2’ mean the sample after image augmentation, ‘P1’ and ‘P2’ mean the prediction result, ‘Y1’ and ‘Y2’ mean the generated pseudo label, ‘YW’ means the true label.
Figure 2. Illustrating the architectures for (a) common consistency learning methods, (b) mean teacher, (c) PseudoSeg, and (d) CPS. Solid arrows indicate forward operation and dashed arrows indicate loss supervision. ‘//’ on solid arrows means stop-gradient. ‘X’ means the sample, ‘X1’ and ‘X2’ mean the sample after image augmentation, ‘P1’ and ‘P2’ mean the prediction result, ‘Y1’ and ‘Y2’ mean the generated pseudo label, ‘YW’ means the true label.
Ijgi 13 00347 g002aIjgi 13 00347 g002b
Figure 3. Overall framework of ACCT. Solid arrows indicate forward operation and dashed arrows indicate loss supervision. ‘//’ on solid arrows means stop-gradient. ‘ X ’ means the sample. ‘ F p ’ and ‘ F v ’ mean two road extraction models with different structure. ‘ P p ’ and ‘ P v ’ mean the segmentation confidence map, ‘ Y ’ means the true label, ‘ Y p ’ and ‘ Y v ’ mean the label map.
Figure 3. Overall framework of ACCT. Solid arrows indicate forward operation and dashed arrows indicate loss supervision. ‘//’ on solid arrows means stop-gradient. ‘ X ’ means the sample. ‘ F p ’ and ‘ F v ’ mean two road extraction models with different structure. ‘ P p ’ and ‘ P v ’ mean the segmentation confidence map, ‘ Y ’ means the true label, ‘ Y p ’ and ‘ Y v ’ mean the label map.
Ijgi 13 00347 g003
Figure 4. Illustration of sample batch composition and model processing using r = 1/4:3/4 as an example.
Figure 4. Illustration of sample batch composition and model processing using r = 1/4:3/4 as an example.
Ijgi 13 00347 g004
Figure 5. Pseudo-labels generated by different models during CPS training in CHN6-CUG Dataset. (a) The original satellite image; (b) ground truth marked by experts; (c) the pseudo-label generated by RoadNet [35]; and (d) the pseudo-label generated by U-Net [36].
Figure 5. Pseudo-labels generated by different models during CPS training in CHN6-CUG Dataset. (a) The original satellite image; (b) ground truth marked by experts; (c) the pseudo-label generated by RoadNet [35]; and (d) the pseudo-label generated by U-Net [36].
Ijgi 13 00347 g005
Figure 6. Comparison of segmentation performance between our method and the baseline on the CHN6-CUG Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 6. Comparison of segmentation performance between our method and the baseline on the CHN6-CUG Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g006
Figure 7. Comparison of segmentation performance between our method and the baseline on the CHN6-CUG Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 7. Comparison of segmentation performance between our method and the baseline on the CHN6-CUG Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g007
Figure 8. Comparison of segmentation performance between our method and the baseline on the Massachusetts Roads Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 8. Comparison of segmentation performance between our method and the baseline on the Massachusetts Roads Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g008
Figure 9. Comparison of segmentation performance between our method and the baseline on the Massachusetts Roads Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 9. Comparison of segmentation performance between our method and the baseline on the Massachusetts Roads Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g009
Figure 10. Comparison of segmentation performance between our method and the baseline on the Ottawa Roads Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 10. Comparison of segmentation performance between our method and the baseline on the Ottawa Roads Dataset when employing U-Net as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g010
Figure 11. Comparison of segmentation performance between our method and the baseline on the Ottawa Roads Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Figure 11. Comparison of segmentation performance between our method and the baseline on the Ottawa Roads Dataset when employing D-LinkNet as the model to be trained. (a) illustrates the performance of different methods under 1/16, 1/8, 1/4, and 1/2 partition protocols. (b) is an enlarged view of a portion of (a).
Ijgi 13 00347 g011
Figure 12. Example qualitative results from the CHN6-CUG Dataset. We have highlighted the differences between them in the circles. (a) Original satellite images; (b) Ground truth; (c) Results of our method under 1/4 partition protocols; (d) Results of CPS under 1/4 partition protocols; (e) Result of fully supervised training using all labeled data; (f) Results of our method under 1/16 partition protocols; (g) Result of fully supervised training under 1/16 partition protocols.
Figure 12. Example qualitative results from the CHN6-CUG Dataset. We have highlighted the differences between them in the circles. (a) Original satellite images; (b) Ground truth; (c) Results of our method under 1/4 partition protocols; (d) Results of CPS under 1/4 partition protocols; (e) Result of fully supervised training using all labeled data; (f) Results of our method under 1/16 partition protocols; (g) Result of fully supervised training under 1/16 partition protocols.
Ijgi 13 00347 g012aIjgi 13 00347 g012b
Figure 13. Example qualitative results from the Massachusetts Roads Dataset. We have highlighted the differences between them in the circles. (a) Original satellite images; (b) Ground truth; (c) Results of our method under 1/4 partition protocols; (d) Results of CPS under 1/4 partition protocols; (e) Result of fully supervised training using all labeled data; (f) Results of our method under 1/16 partition protocols; (g) Result of fully supervised training under 1/16 partition protocols.
Figure 13. Example qualitative results from the Massachusetts Roads Dataset. We have highlighted the differences between them in the circles. (a) Original satellite images; (b) Ground truth; (c) Results of our method under 1/4 partition protocols; (d) Results of CPS under 1/4 partition protocols; (e) Result of fully supervised training using all labeled data; (f) Results of our method under 1/16 partition protocols; (g) Result of fully supervised training under 1/16 partition protocols.
Ijgi 13 00347 g013aIjgi 13 00347 g013b
Figure 14. Comparison of the segmentation performance of our method with the baseline on different datasets. (a) Performance of different methods on the CHN6-CUG Dataset when using RoadNet as the model to be trained. (b) Performance of different methods on the CHN6-CUG Dataset when using CGNet as the model to be trained.
Figure 14. Comparison of the segmentation performance of our method with the baseline on different datasets. (a) Performance of different methods on the CHN6-CUG Dataset when using RoadNet as the model to be trained. (b) Performance of different methods on the CHN6-CUG Dataset when using CGNet as the model to be trained.
Ijgi 13 00347 g014
Figure 15. Convergence Curve of Our Method Compared to Baseline on CHN6-CUG Dataset.
Figure 15. Convergence Curve of Our Method Compared to Baseline on CHN6-CUG Dataset.
Ijgi 13 00347 g015
Figure 16. Early trends in model Training: an example from the first 100 epochs. We have highlighted the differences between them in the circles. (a) ACCT method using RoadNet to assist U-Net training. (b) An approach using two U-Net of the same structure to supervise each other. (c) A method using two RoadNet to supervise each other. (d) Original satellite images; (e) Ground truth.
Figure 16. Early trends in model Training: an example from the first 100 epochs. We have highlighted the differences between them in the circles. (a) ACCT method using RoadNet to assist U-Net training. (b) An approach using two U-Net of the same structure to supervise each other. (c) A method using two RoadNet to supervise each other. (d) Original satellite images; (e) Ground truth.
Ijgi 13 00347 g016
Figure 17. Early trends in model Training: an example from the first 100 epochs. We have highlighted the differences between them in the circles. (a) ACCT method using CGNet to assist RUW-Net training. (b) An approach using two RUW-Net of the same structure to supervise each other. (c) A method using two CGNet to supervise each other. (d) Original satellite images; (e) Ground truth.
Figure 17. Early trends in model Training: an example from the first 100 epochs. We have highlighted the differences between them in the circles. (a) ACCT method using CGNet to assist RUW-Net training. (b) An approach using two RUW-Net of the same structure to supervise each other. (c) A method using two CGNet to supervise each other. (d) Original satellite images; (e) Ground truth.
Ijgi 13 00347 g017
Table 1. Dataset Description (For dataset 1 and dataset 2, the numbers in parentheses are the amount of data after filtering and cropping. For dataset 4 and dataset 5, the numbers in parentheses are the amount of data after randomly selection).
Table 1. Dataset Description (For dataset 1 and dataset 2, the numbers in parentheses are the amount of data after filtering and cropping. For dataset 4 and dataset 5, the numbers in parentheses are the amount of data after randomly selection).
DatasetResolutionAreaTrainTest
CHN6-CUG Dataset [37]0.5 mBeijing, China; Shanghai, China; Shenzhen, China; Wuhan, China; Macau, China; Hong Kong, China3608 (1600)903 (448)
Massachusetts Roads Dataset [38]1 mMassachusetts, USA1108 (9972)49 (441)
Ottawa Roads Dataset [35]0.21 mOttawa, Canada512128
LoveDA Dataset [39]0.3 mNanjing, Changzhou, and Wuhan(1088)(272)
DeepGlobe Road Extraction Dataset [40]0.5 mThailand, India, and Indonesia6226 (6224)1101 (1008)
Table 2. Segmentation performance evaluation on CHN6-CUG Dataset under 1/2 partition protocols when λ takes different values. The top two scores are bolded.
Table 2. Segmentation performance evaluation on CHN6-CUG Dataset under 1/2 partition protocols when λ takes different values. The top two scores are bolded.
λACCT (U-Net [36]/RoadNet [40])ACCT (U-Net/CGNet [45])CPS (U-Net/U-Net)
IoUPrecisionRecallF1IoUPrecisionRecallF1IoUPrecisionRecallF1
0.10.46070.72590.55820.62570.46470.68060.59180.62810.46040.70310.57140.6243
0.20.46980.69790.58700.63370.46380.70970.57030.62680.46140.71940.56140.6244
0.30.46100.73190.55190.62470.46220.70860.56960.62600.45710.71240.55770.6188
0.40.46700.70620.58080.63240.45980.69740.57290.62360.44510.67860.56460.6085
0.50.44390.72110.53560.60970.44070.74570.51950.60590.43220.70070.53080.5966
0.60.44420.74770.52250.60870.44550.72890.53170.61030.43330.73930.51230.5977
0.70.42390.74800.49350.58710.43950.74280.51560.60360.41840.72760.49760.5803
0.80.44040.74630.51990.60550.43880.75980.50990.60320.41690.72260.49540.5797
0.90.42450.75290.49520.58970.42130.77250.48010.58560.41110.75430.47500.5750
1.00.40010.78780.44910.56480.41060.77220.56780.57350.39130.67910.48220.5548
Table 3. Performance evaluation on CHN6-CUG Dataset when the model to be trained is RUW-Net [43] (For ACCT, the assist model is CGNet).
Table 3. Performance evaluation on CHN6-CUG Dataset when the model to be trained is RUW-Net [43] (For ACCT, the assist model is CGNet).
Performance evaluation
1/161/8
IoUPrecisionRecallF1OAKappaIoUPrecisionRecallF1OAKappa
GCT [41]0.45080.72730.54240.61620.64420.25570.45490.73480.54290.61910.62990.3014
CPS [17]0.45880.72960.55300.62380.67130.31140.46220.70860.56960.62600.54680.2717
CR-Seg [42]0.45850.72080.55800.62300.66400.32530.46370.71060.56940.62780.61330.2877
ACCT0.46090.71950.55960.62470.70310.26840.46670.70030.58090.62970.66790.3272
1/41/2
IoUPrecisionRecallF1OAKappaIoUPrecisionRecallF1OAKappa
GCT0.45760.73030.54830.62100.53510.24180.46210.70940.56700.62510.68350.3309
CPS0.46550.71480.57080.62980.59760.29760.46950.70820.58310.63410.67190.3358
CR-Seg0.46900.71840.57370.63340.61320.30790.47110.69610.59240.63510.66020.3103
ACCT0.47200.68670.60140.63670.64840.28420.47430.69530.59770.63790.69230.3429
Table 4. Performance evaluation on CHN6-CUG Dataset.
Table 4. Performance evaluation on CHN6-CUG Dataset.
MethodTraining U-NetTraining D-LinkNet
1/161/81/41/21/161/81/41/2
Supervised learning0.23670.28690.33590.39860.33720.35660.39690.4501
GCT0.36780.40130.44270.45460.44160.46220.48930.5083
CPS0.44210.45440.46040.46440.47170.48840.50070.5124
ACCT (under CGNet-assist)0.45740.46210.46470.46610.47330.50010.50980.5152
ACCT (under RoadNet-assist)0.44330.45590.46070.46510.48130.50990.51730.5234
Table 5. Performance evaluation on Massachusetts Roads Dataset.
Table 5. Performance evaluation on Massachusetts Roads Dataset.
MethodTraining U-NetTraining D-LinkNet
1/161/81/41/21/161/81/41/2
Supervised learning0.44490.48040.51670.53580.29010.34060.42160.4907
GCT0.52140.54080.55040.56110.49780.51340.51720.5297
CPS0.54300.56130.57460.57800.52230.52610.53030.5327
Ours (under CGNet-assist)0.56570.57180.57980.58320.52630.52880.53290.5369
Ours (under RoadNet-assist)0.56950.57300.57860.58240.52660.52910.53370.5403
Table 6. Performance evaluation on Ottawa Roads Dataset.
Table 6. Performance evaluation on Ottawa Roads Dataset.
MethodTraining U-NetTraining D-LinkNet
1/161/81/41/21/161/81/41/2
Supervised learning0.50030.63500.73130.77780.42870.50290.55030.6079
GCT0.81170.82570.83790.84720.67110.69630.71380.7562
CPS0.82630.83180.84230.85020.68030.71690.74160.7689
Ours (under CGNet-assist)0.82950.83250.84840.85040.70080.74110.76860.7703
Ours (under RoadNet-assist)0.82860.84490.84750.85280.69010.73060.75130.7736
Table 7. Performance evaluation on CHN6-CUG Dataset under 1/2 partition protocols when the model to be trained is based on transformer.
Table 7. Performance evaluation on CHN6-CUG Dataset under 1/2 partition protocols when the model to be trained is based on transformer.
MethodTraining SegFormerTraining Swin-UNet
IoUPrecisionRecallF1IoUPrecisionRecallF1
Supervised learning0.20430.46580.26510.32720.13110.47020.15390.2200
GCT0.24610.59860.29760.38580.15910.45250.19850.2663
CPS0.28920.59890.36240.44070.17170.47000.21360.2862
Ours (under CGNet-assist)0.29130.61080.35930.44280.18580.47470.23920.3062
Ours (under RoadNet-assist)0.29930.57600.38460.45060.20100.38790.29780.3300
Table 8. Performance evaluation on LoveDA Dataset and DeepGlobe Road Extraction Dataset under 1/2 partition protocols when the model to be trained is U-Net.
Table 8. Performance evaluation on LoveDA Dataset and DeepGlobe Road Extraction Dataset under 1/2 partition protocols when the model to be trained is U-Net.
MethodLoveDA DatasetDeepGlobe Road Extraction Dataset
IoUPrecisionRecallF1IoUPrecisionRecallF1
Supervised learning0.40870.81120.45300.57610.58840.78990.69900.7390
GCT0.50020.79090.57690.66270.61350.81900.70770.7579
CPS0.52850.80910.60650.68910.62970.84220.71470.7697
Ours (under CGNet-assist)0.53260.81800.61000.69170.63120.80000.75320.7721
Ours (under RoadNet-assist)0.53890.80840.61820.69810.63290.84240.71910.7731
Table 9. Performance evaluation on CHN6-CUG Dataset when the model to be trained is a lightweight model.
Table 9. Performance evaluation on CHN6-CUG Dataset when the model to be trained is a lightweight model.
MethodTraining RoadNetTraining CGNet
1/161/81/41/21/161/81/41/2
Supervised learning0.16690.18670.22530.23840.20470.23140.25260.2757
GCT0.20020.22830.24740.25170.25130.26420.28310.3007
CPS0.22540.24510.25610.26850.28690.30230.31410.3179
Ours (under U-Net-assist)0.22630.26790.30340.32320.29890.30730.31530.3234
Table 10. Comparison of performance evaluation with self-training and co-training on the CHN6-CUG dataset When the model to be trained is a large-volume model.
Table 10. Comparison of performance evaluation with self-training and co-training on the CHN6-CUG dataset When the model to be trained is a large-volume model.
MethodTraining U-NetTraining D-LinkNet
1/161/81/41/21/161/81/41/2
Supervised learning0.23670.28690.33590.39860.33720.35660.39690.4501
ST++0.45090.45980.46110.46460.47310.48860.50230.5138
DMT0.45130.46010.46210.46440.47090.48670.50390.5111
Ours (under CGNet-assist)0.45740.46210.46470.46610.47330.50010.50980.5152
Ours (under RoadNet-assist)0.44330.45590.46070.46510.48130.50990.51730.5234
Table 11. Comparison of performance evaluation with self-training and co-training on the CHN6-CUG dataset When the model to be trained is a lightweight model.
Table 11. Comparison of performance evaluation with self-training and co-training on the CHN6-CUG dataset When the model to be trained is a lightweight model.
MethodTraining RoadNetTraining CGNet
1/161/81/41/21/161/81/41/2
Supervised learning0.16690.18670.22530.23840.20470.23140.25260.2757
ST++0.19580.22280.23350.25200.26280.27420.28090.2866
DMT0.19730.22040.24110.25410.26170.27990.28510.2913
Ours (under U-Net-assist)0.22630.26790.30340.32320.29890.30730.31530.3234
Table 12. FLOPs and Params of Different Methods. FLOPs and Params are estimated for an input size of 3 × 512 × 512 at epoch 8.
Table 12. FLOPs and Params of Different Methods. FLOPs and Params are estimated for an input size of 3 × 512 × 512 at epoch 8.
MethodFLOPs (G)Proportion of FLOPs Reduced by ACCTParameters (M)Proportion of Parameters Reduced by ACCT
Training U-NetCPS3503.49 62.08
Ours (under CGNet-assist)1780.2849.19%31.3149.56%
Ours (under RoadNet-assist)1752.5849.98%31.0549.98%
Training D-LinkNetCPS2238.23 473.28
Ours (under CGNet-assist)1147.6548.83%236.9149.94%
Ours (under RoadNet-assist)1119.9549.96%236.6550.00%
Table 13. Time consumption for different methods to train one round on the DeepGlobe Road Extraction Dataset.
Table 13. Time consumption for different methods to train one round on the DeepGlobe Road Extraction Dataset.
MethodTime Consumption (s)
Training U-NetCPS171.04
Ours (under CGNet-assist)123.44
Ours (under RoadNet-assist)119.58
Training D-LinkNetCPS209.02
Ours (under CGNet-assist)149.57
Ours (under RoadNet-assist)142.53
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, H.; Du, S.; Tan, Z.; Zhang, X.; Li, Z. Improved Road Extraction Models through Semi-Supervised Learning with ACCT. ISPRS Int. J. Geo-Inf. 2024, 13, 347. https://doi.org/10.3390/ijgi13100347

AMA Style

Yu H, Du S, Tan Z, Zhang X, Li Z. Improved Road Extraction Models through Semi-Supervised Learning with ACCT. ISPRS International Journal of Geo-Information. 2024; 13(10):347. https://doi.org/10.3390/ijgi13100347

Chicago/Turabian Style

Yu, Hao, Shihong Du, Zhenshan Tan, Xiuyuan Zhang, and Zhijiang Li. 2024. "Improved Road Extraction Models through Semi-Supervised Learning with ACCT" ISPRS International Journal of Geo-Information 13, no. 10: 347. https://doi.org/10.3390/ijgi13100347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop