Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models

Chang, Jae Young; Oh, Kwan-Young; Lee, Kwang-Jae

doi:10.3390/rs17152669

Open AccessArticle

Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models

by

Jae Young Chang

^*

,

Kwan-Young Oh

and

Kwang-Jae Lee

National Satellite Operation & Application Center, Korea Aerospace Research Institute, Daejeon 34133, Republic of Korea

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2669; https://doi.org/10.3390/rs17152669

Submission received: 3 June 2025 / Revised: 28 July 2025 / Accepted: 30 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Signal Processing, Image Processing and Fusion Techniques in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Artificial intelligence (AI) has become the mainstream of analysis tools in remote sensing. Various semantic segmentation models have been introduced to segment land cover from aerial or satellite images, and remarkable results have been achieved. However, they often lack universal performance on unseen images, making them challenging to provide as a service. One of the primary reasons for the lack of robustness is overfitting, resulting from errors and inconsistencies in the ground truth (GT). In this study, we propose a method to mitigate these inconsistencies by utilizing redundant models and verify the improvement using a public dataset based on Google Earth images. Redundant models share the same network architecture and hyperparameters but are trained with different combinations of training and validation data on the same dataset. Because of the variations in sample exposure during training, these models yield slightly different inference results. This variability allows for the estimation of pixel-level confidence levels for the GT. The confidence level is incorporated into the GT to influence the loss calculation during the training of the enhanced model. Furthermore, we implemented a consensus model that employs modified masks, where classes with low confidence are substituted by the dominant classes identified through a majority vote from the redundant models. To further improve robustness, we extended the same approach to fuse the dataset with different class compositions based on imagery from the Korea Multipurpose Satellite 3A (KOMPSAT-3A). Performance evaluations were conducted on three network architectures: a simple network, U-Net, and DeepLabV3. In the single-dataset case, the performance of the enhanced and consensus models improved by an average of 2.49% and 2.59% across the network architectures. In the multi-dataset scenario, the enhanced models and consensus models showed an average performance improvement of 3.37% and 3.02% across the network architectures, respectively, compared to an average increase of 1.55% without the proposed method.

Keywords:

satellite imagery; KOMPSAT; deep-learning; land cover segmentation; overfitting; semantic segmentation

Graphical Abstract

1. Introduction

Currently, high-resolution aerial imagery is available from a variety of sources, including aircraft, drones, and satellites, allowing for more precise segmentation of land use. Satellite images are widely used because they can be captured regardless of regional restrictions. WorldView, Pleiades, QuickBird, Ikonos, and SkySat are the major satellite series providing high-resolution imagery. As of 2025, four high-resolution optical satellites with a ground sample distance (GSD) of less than 1 m are in operation in Korea. They are the New-space Earth Observation Satellite Constellation No.1 (GSD 1 m), KOMPSAT-3 (GSD 70 cm), KOMPSAT-3A (GSD 55 cm), and Compact Advanced Satellite No.1 (GSD 50 cm). As high-resolution satellite imagery becomes more widely available, a variety of land-use segmentation datasets have been produced, and related models based on these data have been proposed.

In the early stages of adopting deep-learning techniques in remote sensing, datasets were created that focused on the segmentation of major land use classes, such as buildings, roads, cropland, water, and forests, based on images collected at specific locations by specific satellites or sensors. The International Society of Photogrammetry and Remote Sensing has released the Potsdam–Vaihingen dataset, which provides aerial imagery with a GSD of 5–9 cm; land-cover labels for six classes: impervious surfaces, buildings, low vegetation, trees, cars, clutter, and background; and a digital surface model, which is still being used to study new network architecture applications [1,2]. The DeepGlobe dataset comprises seven land-cover classes: urban, agricultural, rangeland, forest, water, wasteland, and unclassified, utilizing imagery captured by DigitalGlobe’s satellites at a GSD of 50 cm [3]. The Inria Aerial Image Labeling Dataset provides building masks with aerial imagery at a GSD of 30 cm spanning approximately 810 km² of cities in the United States and Europe [4].

Recent datasets focus not only on land-cover masks but also on specific, practical topics. The BONAI dataset is a building segmentation dataset that provides not only roof masks but also footprint masks and offset vectors, clarifying building structural information at off-nadir image acquisition [5]. The LEVIR Multilevel Change Interpretation dataset contains descriptive caption data for changes to roads and buildings, enabling the use of large-scale language models to describe these changes accurately [6]. The Semantic Change Detection Dataset addresses the limitation of binary change detection, which only distinguishes changes in category, by providing an additional non-change mask for multitemporal images at the same location [7]. The French National Institute of Geography and Forestry (IGN) has developed the “French Land Cover with Aerospace ImageRy” (FLAIR) dataset, which includes data acquired by multiple sensors at a large scale with temporal shifts, as well as height information from the ground [8]. The Land-Cover Dataset for Domain Adaptive Semantic Segmentation (LoveDA) dataset emphasizes the balance between data acquisition in urban and rural areas [9]. Recent datasets tend to provide more accurate information by providing additional information without limiting the image source, location, or collection time. This indicates that the practical aspect of analytics is becoming increasingly important, and more accurate data is required to build robust models.

As more advanced semantic segmentation models and backbone networks emerge in the field of computer vision, many researchers are actively applying popular architectures to remote sensing images with notable success [10]. Some key examples of effective segmentation architectures include the Fully Convolutional Network (FCN) [11], U-Net [12], DeepLab [13], and HRNet [14]. Additionally, there have been substantial advancements in backbone networks that can be paired with segmentation networks, such as ResNet [15], ResNeSt [16], EfficientNet [17], Swin Transformer [18], and ConvNeXt [19]. Despite its simple structure, U-Net [12] remains a popular choice for aerial image segmentation tasks [20,21,22]. It is not easy to conclude that more complex models yield more reliable results in remote sensing applications. This may stem from limitations in ground resolution and the challenges associated with creating perfect ground truth data. Additionally, more complex networks often risk overfitting the errors present in the dataset [23]. Even in the absence of human error, it is common for different class areas to appear similar, making them difficult to distinguish through human observation. Therefore, we need a method to address these human and intrinsic errors [24] in the GT.

There is another trend that aims to enhance the robustness of models without requiring the creation of new datasets. The first approach is multi-model ensemble. Fusing the inference results of multiple models can produce more robust results than more complex networks, and the performance advantage of multi-model ensembles has been well verified through various datasets [25]. The second approach is to improve performance by supplementing GT information. Bressan et al. demonstrated that performance is enhanced by computing training loss by assigning pixel weights according to their relative importance based on class distribution and geometric uncertainty [26]. Tong et al. demonstrated that the virtual labels generated by a model trained on the original dataset using Gaofen-2 images can enhance the robustness across images from various sensors, including Gaofen-1, Jinlin-1, Ziyuan-3, Sentinel-2A, and Google Earth images [27]. This can also be viewed as a performance improvement due to the supplementation of GT information.

Ensemble methods offer a promising approach for enhancing model robustness. A classic example of this is the Random Forest algorithm, which utilizes multiple decision trees to produce more reliable outcomes. This method has the advantage of reasonable run-time inference since the computational complexity of individual decision trees is relatively low. However, recent classifiers based on deep neural networks exhibit enormous computational complexity, making it challenging to implement an ensemble consisting of a large number of models during run-time.

To address this issue, we aimed to create a novel approach that consolidates the collective intelligence of ensemble methods into a single dataset. This allows us to train one model that incorporates the strengths of the ensemble, clearly demonstrating its impact through pixel-level confidence levels. It uses redundant models trained on various combinations of training and validation sets to enhance the original datasets. The confidence level estimated by the redundant models for each pixel is incorporated into the original GT mask and is reflected during the loss calculation of the enhanced model. The supplemented mask shows indistinguishable regions due to inherent ambiguities caused by human error or lack of resolution. We introduced one more revised mask, the modified mask, which is generated by replacing the low-confidence GT class with the dominant class voted by the majority of redundant models. Additionally, we applied the method to fuse heterogeneous datasets from different image sources, which have slightly different class compositions.

2. Materials

This study uses two datasets. The first is the LoveDA dataset [9], which is the primary dataset used to test the single dataset improvement and also serves as the host dataset for dataset fusion. The second dataset is a custom dataset based on the KOMPSAT-3A images, which is used as a guest dataset for the dataset fusion. The second dataset is described in the text as CITY-20, meaning that it was produced in 2020, and most of the scenes are set in city areas.

The LoveDA dataset comprises 5987 pairs of images and corresponding masks. The images were collected from Google Earth in urban and rural China, and the masks are divided into seven classes: background, buildings, roads, barren, agriculture, forest, and water. Google Earth images come from a variety of sources, making it difficult to specify the sensors used to capture the images. However, the images in this dataset have a spatial resolution adjusted to 30 cm per pixel and consist of three bands: Red, Green, and Blue, with 8 bits per pixel of quantized data. While the images are georeferenced, allowing us to pinpoint their location, we cannot determine the exact time when each picture was taken. Figure 1 shows some samples of the dataset.

We adjusted the spatial resolution to 1 m per pixel to harmonize various image sources from Google Earth and maintain consistency with the CITY-20 dataset. Furthermore, we applied sample augmentation techniques to enhance the diversity of the data (Figure 2). Although sample augmentation can enhance performance [28], many geometric transformations are not suitable for the small patches in the LoveDA datasets. Therefore, we implemented a straightforward approach by rotating the patches 90 degrees three times. Additionally, we converted the input patch’s color model from RGB to YCbCr, added Gaussian noise with a standard deviation of 5.0 to the Cb and Cr components, and then converted it back to RGB. This process adjusts the color without affecting the brightness. As a result, we obtained a total of 23,948 patches.

The CITY-20 dataset consists of images captured by the KOMPSAT-3A satellite, which provides panchromatic data with a resolution of 0.55 m and multispectral data with a resolution of 2.2 m, including Red, Blue, Green, and Near Infrared (NIR) bands. For this study, we utilize the pan-sharpened products, and the NIR band is excluded. The spatial resolution is adjusted to 1 m per pixel to ensure compatibility with the host dataset. The original imagery contains 16 bits of quantized data, which we compress to 8 bits using 1% percentile stretching. The dataset comprises patches from fifteen scenes collected around major cities in South Korea, with imagery sourced from the years 2015 to 2019. The dataset includes 145,287 pairs of images and masks with augmented samples. The class configuration is similar to that of the LoveDA dataset, except for the absence of the barren class (Figure 3). Despite their class similarity, the two datasets differ in annotation style and quality. In the CITY-20 dataset, adjacent small buildings are annotated individually, whereas in the LoveDA dataset, they are merged into a large mask. The CITY-20 dataset has better building and road mask quality, but a large part of the forest mask is missing.

3. Methods

3.1. Overview of the Proposed Method

Figure 4 illustrates the overall process of the proposed method. In a typical training procedure, a single model is generated from the original dataset. By varying the training and validation combinations, we can produce multiple redundant models based on the same dataset. Although these models share similar properties and performance, they exhibit differences due to variations in data exposure during training. These discrepancies reflect the confidence level of each pixel in the GT, and this information is incorporated to create supplemented masks. Furthermore, modified masks are created by substituting the GT class of confusing pixels with the dominant class determined through the voting of redundant models. The process for creating the datasets is explained in detail in Section 3.3.

As a result, we derived three models for comparison: the raw model from the original dataset, the enhanced model trained using the supplemented dataset, and the consensus model trained on the modified dataset.

There are numerous projects and datasets available for land-cover segmentation of aerial images. However, various constraints—such as project duration, image capture location, timing, and environmental conditions—can make it challenging to achieve consistent performance from the models trained on them. Heterogeneous dataset fusion is a vital strategy for enhancing performance in real-world applications. Nonetheless, inconsistencies often arise between these distinct datasets concerning class composition, imaging sensors, annotation styles, and quality. Similar to the case of a single dataset, redundant models can help reduce discrepancies in heterogeneous datasets. Figure 5 illustrates a dataset fusion configuration based on redundant models utilizing two distinct datasets.

The synthetic dataset (SD) is created by merging two distinct datasets. To generate the SD, we oversampled the host dataset to ensure the number of training samples matched that of the guest dataset. The redundant models trained on the host dataset assess confidence levels for both datasets to produce the supplemented SD. This process aligns the two datasets from the perspective of the host dataset. The modified SD is then created from the supplemented SD using the same method that is applied to a single dataset. As a result, we now have three additional models: the raw fusion model (FM), the enhanced FM, and the consensus FM.

3.2. Segmentation Networks

A simplified network architecture is preferred for this study because it helps reduce overfitting and lowers the computational complexity of iterative training. We adopted a convolution neural network (CNN) architecture with a straightforward structure [29] that includes feature extraction through ResNeSt-50 [16] and resolution enhancement via deconvolution layers. Each output layer corresponds to the score for a specific class, and the estimated class index is determined by identifying the layer index that yields the highest score (Figure 6). We used the architecture to train redundant models and build the supplemented and modified datasets.

We extended our experiment by incorporating two additional semantic segmentation models commonly used in remote sensing: U-Net [12] and DeepLabV3 [13]. The two models employed ResNet-50 [16] as their base network, and as shown in Table 1, the number of trainable parameters across the models is comparable. We trained models of the three network structures with the datasets and compared the performance to verify the effect of the proposed method through different network structures.

3.3. Creation of the Supplemented Mask and the Modified Mask Using Redundant Models

Redundant models are trained using various combinations of training and validation sets from the original dataset. A total of twenty redundant models were trained, with 70% of the data used for training and 30% for validation. The estimated class index,

\bar{C}

, of a redundant model is determined as:

{\bar{C}}_{m} = \arg \max_{i} s_{m} (i)

(1)

where

s_{m} (i)

is the score for class i of the mth model.

Out of 20 redundant models, six utilize a sample for validation. The inference results from these six models are similar; however, differences in estimated classes can be used to evaluate the probability (P) of making a correct decision at each pixel location, as shown in Equation (2).

P = E \{δ [\bar{C} - C_{G T}]\} = \{\sum_{n = 1}^{6} δ [{\bar{C}}_{n} - C_{G T}]\} / 6

(2)

where δ[] denotes the Dirac delta function, equal to 1 when the two class indices match and zero otherwise.

Conversely, 1-P indicates the level of confusion and is quantized to 5 bits, resulting in 32 levels. These levels are then incorporated into the upper 5 bits of the GT mask to create the supplemented mask. The colormap has been expanded to illustrate the levels of confusion (Figure 7). The whiter the color, the more confused a pixel becomes.

Figure 8 presents four examples of supplemented masks. In Figure 8a, the building mask in the center lacks any actual substance and is therefore marked as confusing in the supplemented mask. In Figure 8b, the absence of masks for the croplands in the center of the image is also noted as confusing. In Figure 8c, a pond located on the left side of the image has no corresponding mask, which is similarly marked as confusing. Finally, in Figure 8d, the presence of artificial edges and missing annotations for forest areas is indicated as confusing. In the four samples, there is a common trend that the areas along the boundaries of different classes are identified as confusing pixels. This is understandable, as it is challenging to define clear boundaries due to human error and limited image resolution.

The precision of the confidence level is determined by multiplying the number of redundant models (M) by the validation data rate for each model. In this study, we set M to 20 and the validation data rate to 0.3, resulting in 6 confidence levels. Figure 9 illustrates the effect of varying M. As M increases, the confidence levels become smoother; however, the training time for the redundant models also increases. Therefore, it is a trade-off parameter, and we concluded that setting M to 20 is appropriate for balancing training time and confidence level precision.

Typically, the parameters (θ) of a CNN model are optimized to minimize the loss (L) along with certain regularization terms (R), as shown in Equation (3).

\min_{θ \in Θ} \sum_{(I, M) \in T} L (\hat{M}, M) + λ R (θ)

(3)

where

M

is the GT score, and

\hat{M}

is the estimated score.

In the enhanced model, we adjusted the loss by weighting it according to the probability of making the correct decision (P) using the confusion levels in the class mask (Equation (4). This approach ensures that for pixels with higher confidence, the parameter updates are stronger, whereas for pixels with greater ambiguity, the updates are less impactful.

L (\hat{M}, M) = \sum_{x} P (x) \cdot L (\hat{M} (x), M (x))

(4)

where x is the pixel location.

If the probability of making the correct decision (P) is zero, the parameters of the enhanced model will not be updated. In these situations, redundant models predict a different class, which we refer to as the dominant class, instead of the GT class. If the probability of the class surpasses that of the GT class, it suggests that the GT may be incorrect. In this case, we can replace the GT class with the dominant class, and P is adjusted based on the difference in probability between the dominant class and the original GT class. This results in the modified dataset with which the consensus model is trained. If the dominant class’s probability is only slightly higher than that of the GT, the GT will be replaced, but with a very low confidence level. This means that it will have a minimal effect on the parameter updates while training the consensus model. For the experiment, the enhanced and consensus models used the same training and validation combination as the raw model, the first redundant model.

Figure 10 shows a comparison of three types of masks. The supplemented mask identifies the boundaries between different classes as confusing areas, while the modified mask replaces those areas with the background, resulting in greater precision.

In the multi-dataset scenario, the main distinction is that the redundant models trained on the host dataset are also utilized to assess the confusion level of the guest dataset. An intrinsic difference between the two datasets is that the guest dataset lacks the barren class. We anticipate that the intervention of the redundant models will help address this anomaly. Figure 11 illustrates examples from the guest dataset, comparing three types of masks with the corresponding images. The first and second samples show that the barren areas designated as background in the original mask are identified as confusing in the supplemented mask and are substituted with the barren class in the modified mask. The third and fourth samples indicate that the missing forest areas are marked as confusing in the supplementary mask and are replaced by the forest class in the modified mask. Additionally, the modified mask distorted some building boundaries and eliminated certain roads. This suggests that the host dataset has lower-quality annotations for these classes, which may lead to degraded performance in the consensus model.

3.4. Assessment of Universal Performance

Estimating universal performance based solely on a single dataset can be challenging. To ensure independence between training and validation, we divided the main dataset into two parts: Part A and Part B (Figure 12). Since the samples are separated prior to augmentation, their image locations do not overlap completely. To evaluate performance considering the GT quality, we used the supplemented mask rather than the original mask. The performance metric used in this study is the mean intersection over union (mIOU), which is derived from the confusion matrix. The correspondence between the GT class and the estimated class is accumulated within the confusion matrix using the confidence level (P) of the GT. The final performance for each model type is calculated as the average of the scores from both parts.

4. Results

Table 2 presents a comparison of the estimated universal performance of the six models across three network structures. Enhanced and consensus models outperformed raw models, regardless of network structures and dataset configurations. In the single-dataset case, the performance of the enhanced and consensus models improved by an average of 2.49% and 2.59% across the network structures. In the multi-dataset scenario, the performance of the enhanced FMs and consensus FMs improved by an average of 3.37% and 3.02% across the network structures, while the raw FMs showed an average increment of 1.55%.

In the single-dataset scenario, the consensus model outperformed others for the simple network, showing an improvement of 3.46%. Meanwhile, the enhanced models were the most effective for both U-Net and DeepLabV3, achieving improvements of 2.21% and 2.43%, respectively.

In the multi-dataset scenario, we first compared the performance of single-dataset raw models with raw FMs. Then, we compared raw FMs with enhanced and consensus FMs. The raw FMs of the simple network and U-Net outperformed single-dataset raw models by 3.54% and 1.61%, respectively. However, the performance of the raw FM of DeepLabV3 decreased by 0.51% compared to the single-dataset raw model. The enhanced FMs were the most effective for both the simple network and U-Net, showing improvements of 2.19% and 1.97%, compared to the performance of the raw FMs. In contrast, the consensus FM showed the best performance for DeepLabV3, achieving an improvement of 2.76% over the performance of the raw FM.

Table 3 compares the performance across different classes. Across all classes in all configurations, the performance of enhanced models improved compared to raw models. Consensus models also improved in almost all cases, except for two notable exceptions: the barren class in the multi-dataset case for the simple network and U-Net.

Figure 13 compares the class-wise performance of enhanced and consensus FMs against raw FMs. For both the simple network and U-Net, the enhanced FMs demonstrate more reliable performance compared to the consensus FMs. However, in the case of DeepLabV3, the consensus FM outperforms the enhanced FM.

In Figure 14, we provide examples of inference results along with their corresponding images and labels to examine the qualitative trends across the models. In the single-dataset case, the class boundaries in the raw model results tend to fluctuate, and the fluctuation decreases in both the enhanced and consensus model results. The multi-dataset model results appear to have a higher quality than the corresponding single-dataset results. They not only reduce boundary fluctuations but also provide a more accurate representation of land cover shapes.

5. Discussion

Both the enhanced models and the consensus models outperformed the raw models, irrespective of the network structures and dataset configurations. We can attribute this improvement to the supplementation or modification of GT masks, as the redundant models help compensate for inconsistencies both within and across datasets. The proposed method offers not only improved performance but also computational efficiency. Traditional ensemble methods require multiple models to be used during runtime, which causes execution time to rise in direct proportion to the number of models involved. In contrast, the proposed method enhances performance without extending runtime, as it integrates redundancy within the generated training data, allowing for the training of a single, improved model (see Figure 15).

We trained redundant models using the simple network structure as a baseline to verify the effectiveness of the method. Using heterogeneous network structures in practical applications can enhance the variability of redundant models, leading to more reliable collective intelligence. It is important to research how to select network structures, as performance discrepancies can lead to downward leveling.

In the multi-dataset case, we expected all models to be improved by the additional guest datasets. But for the DeepLabV3, the raw FM showed a lower performance than the single-dataset raw model by 0.51%. The discrepancies between the two datasets appear to be the cause, and the major difference between DeepLabV3 and the other two network structures is the lack of a decoder structure, which could be the main factor. The consensus FM outperformed the raw FM by 2.76%, but it still fell short of the best single-dataset model, the enhanced model, by 0.18%.

In the scenario involving multiple datasets, we distinguished between host and guest datasets; therefore, the guest dataset was expected to align with the host dataset. However, it is possible to generate redundant models for each dataset, which can aid in estimating the confidence levels of another dataset. The differences in class configurations between the two datasets used in this study were minor; thus, more complex relationships, such as inclusion relationships, should be taken into account. With these points in mind, our future research will focus on developing a more configurable and interactive multi-dataset fusion approach based on redundant models (see Figure 16).

Finally, this method has the potential to be extended to various regression tasks, including denoising, pan-sharpening, super-resolution, and depth estimation. In regression, the model estimates continuous values instead of discrete class indices. Consequently, pixel-wise confidence levels may be based on the statistics of the estimated values themselves, which can be derived from redundant models, rather than on the probability of accurate predictions. Similar to semantic segmentation, we could adjust the strength of parameter updates based on the uncertainty of the estimated values or correct the target values when the majority of models consistently predict distant values.

6. Conclusions

We proposed a method to achieve consistent performance in land-cover semantic segmentation using redundant models for scenarios involving both single and multiple datasets. Redundant models utilize the same network architecture and hyperparameters, but they are trained with different combinations of training and validation data from the same dataset. Using these redundant models, two revised datasets were created. The supplemented dataset retains the same GT class with the confidence levels estimated by the redundant models. Conversely, the modified dataset includes some replacements of the GT class where the confidence was deemed low, based on the dominant class determined by the majority vote of the redundant models. These two datasets were then used to train both the enhanced model and the consensus model.

Both the models outperformed the raw models, irrespective of the network structures and dataset configurations. It implies that the redundant models can reduce the inconsistency both within and across datasets. Three network structures—simple network, U-Net, and DeepLabV3—were examined, and improvements were consistently observed. This method applies to any pixel-level classifier, as it is independent of the network architecture. Its strength lies in the intuitive assessment achieved by examining the altered sections of the original GT masks. In addition to the performance improvement, it does not require the additional execution time in runtime because it integrates redundancy within the revised training data, allowing for the training of a single, improved model.

Finally, we estimated the multi-dataset scenario and noted the current limitations and outlined future research directions to address these challenges by developing a more configurable and interactive multi-dataset fusion approach based on redundant models.

Author Contributions

Conceptualization, J.Y.C. and K.-J.L.; methodology, J.Y.C. and K.-Y.O.; software, J.Y.C.; validation, J.Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The LoveDA dataset is available online at https://github.com/Junjue-Wang/LoveDA (accessed on 19 May 2025). The CITY-20 dataset is not publicly accessible because of copyright and security concerns.

Acknowledgments

The authors thank the anonymous reviewers for their thorough review and valuable suggestions on our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial intelligence
GT	ground truth
GSD	ground sample distance
IGN	Institute of Geography and Forestry
FLAIR	French Land Cover with Aerospace ImageRy
LoveDA	Land-Cover Dataset for Domain Adaptive Semantic Segmentation
FCN	Fully Convolutional Network
SD	synthetic dataset
FM	fusion model
CNN	convolution neural network

References

Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 17200–17209. [Google Scholar]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The Inria Aerial Image Labeling Benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Wang, J.; Meng, L.; Li, W.; Yang, W.; Yu, L.; Xia, G.-S. Learning to extract building footprints from off-nadir aerial images. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1294–1301. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-agent: Towards interactive comprehensive remote sensing change interpretation and analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Yang, K.; Xia, G.-S.; Liu, Z.; Du, B.; Yang, W.; Pelillo, M.; Zhang, L. Asymmetric Siamese networks for semantic change detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Garioud, A.; Gonthier, N.; Landrieu, L.; Wit, A.D.; Valette, M.; Poupée, M.; Giordano, S.; Wattrelos, B. FLAIR: A country-scale land cover semantic segmentation dataset from multi-source optical imagery. In Proceedings of the Advances in Neural Information Processing Systems 2023, New Orleans, LA, USA, 2 November 2023. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Online, 6–14 December 2021; Volume 1. [Google Scholar]
Yu, A.; Quan, Y.; Yu, R.; Guo, W.; Wang, X.; Hong, D.; Zhang, H.; Chen, J.; Hu, Q.; He, P. Deep learning methods for semantic segmentation in remote sensing with small data: A survey. Remote Sens. 2023, 15, 4987. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Lecture Notes in Computer Science, Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder—Decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5693–5703. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R. Resnest: Split-Attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 2736–2746. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 11976–11986. [Google Scholar]
Feng, W.; Sui, H.; Huang, W.; Xu, C.; An, K. Water body extraction from very high-resolution remote sensing imagery using deep U-Net and a superpixel-based conditional random field model. IEEE Geosci. Remote Sens. Lett. 2018, 16, 618–622. [Google Scholar] [CrossRef]
Alsabhan, W.; Alotaiby, T.; Dudin, B. Detecting buildings and nonbuildings from satellite images using U-Net. Comput. Intell. Neurosci. 2022, 2022, 4831223. [Google Scholar] [CrossRef] [PubMed]
Pan, Z.; Xu, J.; Guo, Y.; Hu, Y.; Wang, G. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 2020, 12, 1574. [Google Scholar] [CrossRef]
Zhong, X.; Liu, C. Toward mitigating architecture overfitting on distilled datasets. IEEE Trans. Neural Netw. Learn. Syst. 2025, 1–13. [Google Scholar] [CrossRef] [PubMed]
Foody, G.M. Status of land cover classification accuracy assessment. Remote Sens. Environ. 2002, 80, 185–201. [Google Scholar] [CrossRef]
Dimitrovski, I.; Spasev, V.; Loshkovska, S.; Kitanovski, I. U-Net ensemble for enhanced semantic segmentation in remote sensing imagery. Remote Sens. 2024, 16, 2077. [Google Scholar] [CrossRef]
Bressan, P.O.; Junior, J.M.; Martins, J.A.C.; Gonçalves, D.N.; Freitas, D.M.; Osco, L.P.; Silva, J.d.A.; Luo, Z.; Li, J.; Garcia, R.C.; et al. Semantic segmentation with labeling uncertainty and class imbalance. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102690. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Alomar, K.; Aysel, H.I.; Cai, X. Data augmentation in classification and segmentation: A survey and new strategies. J. Imaging 2023, 9, 46. [Google Scholar] [CrossRef] [PubMed]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]

Figure 1. Four samples from the LoveDA dataset. On the left are the images, and on the right are the class masks. Each mask is represented using 8-bit raster data along with the color map.

Figure 2. Sample augmentation conducted on the LoveDA dataset. The first row shows the augmented images, and the second row shows the corresponding class masks.

Figure 3. Four samples from the CITY-20 dataset. On the left are the images, and on the right are the corresponding class masks.

Figure 4. Three types of models trained from the original dataset: the raw model (e.g., Model-1), the enhanced model, and the consensus model.

Figure 5. A dataset fusion configuration of two distinct datasets using redundant models.

Figure 6. The simple network consisting of the base network to extract features and the deconvolution (Deconv.) layers to recover the resolution.

Figure 7. The colormap adjusted to intuitively visualize confusion levels.

Figure 8. Four examples (a–d) that compare the image, original mask, and supplemented mask. The red circles indicate the locations of discrepancies in comparison to the original masks.

Figure 9. The effect of varying the number of redundant models on the confusion levels: (a) the image, (b) the original GT mask, and (c–f) the supplemented masks when M is 8, 12, 16, and 20.

Figure 10. Comparison of three types of masks: (a) the image, (b) the original mask, (c) the supplemented mask, and (d) the modified mask.

Figure 11. Comparison of the image and three types of masks in the guest dataset, CITY-20: (a) the image, (b) the original mask, (c) the supplemented mask, and (d) the modified mask.

Figure 12. Performance evaluation by splitting the host dataset and merging the cross-validation results on the supplemented masks.

Figure 13. Comparison of class-wise performance between enhanced and consensus fusion models (FMs) over raw FMs: (a) simple network, (b) U-Net, (c) DeepLabV3. The class labels are abbreviated as follows: background (BG), building (BD), roads (RD), barren (BR), agricultural (AG), forests (FG), and waterbodies (WB).

Figure 14. The inference results for six models, each based on different network structures, are presented along with their corresponding images and ground truth (GT) masks. The layout is as follows: (a) the image is shown at the top and the GT mask at the bottom; (b) results from the Simple Network; (c) results from the U-Net; and (d) results from DeepLabV3. The inference results for each network are organized in a 2 × 3 grid. The first row contains the results from single-dataset models, while the second row displays the results from multi-dataset models. In the column direction, the results are listed in the order of the raw model, enhanced model, and consensus model.

Figure 15. Comparison of the training and inference phases: (a) conventional ensemble methods and (b) the proposed method.

Figure 16. Creating a unified model by integrating heterogeneous datasets by interactive fusion.

Table 1. Comparison of the number of trainable parameters by network structure.

Network Structure	Simple Network	U-Net	DeepLabV3
Number of trainable parameters	38,021,511	41,694,574	39,759,047
Percentage difference with respect to simple network	-	+9.7%	+4.6%

Table 2. Comparison of the estimated performance of six models across three network structures. The highest scores for each network and dataset configuration are emphasized in bold.

Network Structure	Model Type		mIOU (%)			Improvement (%)
Network Structure	Model Type		Part A	Part B	Average	Improvement (%)
Simple network	Single Dataset	Raw	70.48	69.84	70.16	-
		Enhanced	73.69	72.29	72.99	+2.83
		Consensus	74.37	72.88	73.62	+3.46
	Multi-Dataset	Raw	74.30	73.10	73.70	+3.54
		Enhanced	76.51	75.27	75.89	+5.73
		Consensus	74.89	74.15	74.52	+4.36
U-Net	Single Dataset	Raw	72.14	71.28	71.71	-
		Enhanced	74.45	73.38	73.92	+2.21
		Consensus	74.13	73.40	73.76	+2.05
	Multi-Dataset	Raw	73.96	72.68	73.32	+1.61
		Enhanced	75.97	74.62	75.29	+3.58
		Consensus	74.70	73.61	74.16	+2.45
DeepLabV3	Single Dataset	Raw	72.09	70.86	71.48	-
		Enhanced	74.51	73.30	73.91	+2.43
		Consensus	73.79	73.70	73.75	+2.27
	Multi-Dataset	Raw	71.05	70.88	70.96	−0.51
		Enhanced	72.89	71.64	72.26	+0.79
		Consensus	74.31	73.15	73.73	+2.25

Table 3. Comparison of performance across classes from both the single-dataset and multi-dataset. The highest scores in each category are highlighted in bold.

Network Structure	Model Type		IOU of Each Class (%)
Network Structure	Model Type		Background	Building	Road	Barren	Agricultural	Forest	WaterBody
Simple network	Single Dataset	Raw	67.08	62.36	73.75	51.85	79.27	73.64	83.18
		Enhanced	70.86	68.05	77.00	52.97	80.71	76.11	85.24
		Consensus	71.39	69.43	76.99	54.57	80.91	76.58	85.49
	Multi-Dataset	Raw	70.79	71.84	77.66	54.35	80.63	75.80	84.83
		Enhanced	72.95	74.07	81.17	57.56	81.65	77.52	86.30
		Consensus	72.33	71.69	79.26	53.85	81.17	77.54	85.79
U-Net	Single Dataset	Raw	68.55	68.41	75.61	52.27	79.30	73.74	84.07
		Enhanced	71.56	71.58	78.26	54.42	80.13	75.86	85.61
		Consensus	71.59	70.88	76.68	54.93	80.46	76.18	85.60
	Multi-Dataset	Raw	70.09	72.25	77.18	54.62	79.39	74.96	84.75
		Enhanced	72.36	75.48	80.03	55.36	80.81	77.26	85.77
		Consensus	71.80	72.38	78.54	53.97	80.06	76.75	85.59
DeepLab V3	Single Dataset	Raw	68.79	68.02	73.12	52.61	79.45	74.35	83.97
		Enhanced	71.69	71.46	76.10	55.68	80.76	76.24	85.42
		Consensus	71.67	70.83	75.29	54.91	81.10	76.90	85.53
	Multi-Dataset	Raw	69.22	71.32	73.85	49.53	78.33	71.89	82.60
		Enhanced	69.12	73.99	75.99	50.54	77.87	75.80	82.55
		Consensus	71.37	71.51	76.54	54.37	80.28	76.83	85.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, J.Y.; Oh, K.-Y.; Lee, K.-J. Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models. Remote Sens. 2025, 17, 2669. https://doi.org/10.3390/rs17152669

AMA Style

Chang JY, Oh K-Y, Lee K-J. Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models. Remote Sensing. 2025; 17(15):2669. https://doi.org/10.3390/rs17152669

Chicago/Turabian Style

Chang, Jae Young, Kwan-Young Oh, and Kwang-Jae Lee. 2025. "Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models" Remote Sensing 17, no. 15: 2669. https://doi.org/10.3390/rs17152669

APA Style

Chang, J. Y., Oh, K.-Y., & Lee, K.-J. (2025). Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models. Remote Sensing, 17(15), 2669. https://doi.org/10.3390/rs17152669

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving the Universal Performance of Land Cover Semantic Segmentation Through Training Data Refinement and Multi-Dataset Fusion via Redundant Models

Abstract

1. Introduction

2. Materials

3. Methods

3.1. Overview of the Proposed Method

3.2. Segmentation Networks

3.3. Creation of the Supplemented Mask and the Modified Mask Using Redundant Models

3.4. Assessment of Universal Performance

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI