F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains

Nguyen, Manh-Hung; Vu, Chi-Cuong

doi:10.3390/su17062640

Open AccessArticle

F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains

by

Manh-Hung Nguyen

^*

and

Chi-Cuong Vu

Faculty of Electrical and Electronics Engineering, HCMC University of Technology and Education, Ho Chi Minh City 7000, Vietnam

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(6), 2640; https://doi.org/10.3390/su17062640

Submission received: 6 February 2025 / Revised: 13 March 2025 / Accepted: 14 March 2025 / Published: 17 March 2025

(This article belongs to the Special Issue Advancing Sustainable Development Through Artificial Intelligence (AI))

Download

Browse Figures

Versions Notes

Abstract

:

Satellite imagery segmentation is essential for effective land resource management. However, diverse geographical landscapes may limit segmentation accuracy in practical applications. To address these challenges, we propose the F-Segformer network, which incorporates a Variational Information Bottleneck (VIB) module to enhance feature selection within the SegFormer architecture. The VIB module serves as a feature selector, providing improved regularization, while SegFormer is well adapted to unseen domains. Combining these methods, our F-Segformer robustly enhanced segmentation performance in new regions that do not appear in the training process. Additionally, we employ Online Hard Example Mining (OHEM) to prioritize challenging samples during training, the setting helps with accelerating model convergence even with the co-trained VIB loss. Experimental results on the LoveDA dataset show that our method can achieve a comparable result to well-known domain-adaptation methods without using data from the target domain. In a practical scenario when the segmentation model is trained on a domain and tested on an unseen domain, our method shows a significant improvement. Last but not least, OHME helps the model converge three times faster than without OHME.

Keywords:

satellite imagery; land management; sutainable development; feature selection; Kullback–Leibler divergence

1. Introduction

Effective land resource management is essential for sustainable development. Conventionally, this task costs a lot of time and human effort. Advances in satellite technology now enable the capture of high-quality imagery, and deep learning-based segmentation models can predict land categories at the pixel level. Because a resolution for each pixel is available in satellite imagery, the combination significantly enhances our ability to monitor the Earth’s surface with greater efficiency.

Several well-known segmentation models, such as Unet [1], DDRNet [2], and PID [3], have been introduced to address various challenges in segmentation tasks. Unet [1] and its variants address scale issues by encoding high-level features and decoding spatial details. However, it incurs high computational costs due to its Pyramid Pooling Module decoder [4]. DDRNet [2] enhances computational efficiency by introducing a two-branch network, where one branch handles low-resolution features and the other focuses on high-resolution details. Although this design speeds up the process, it may compromise accuracy as small details risk merging with the background. PID [3], built on DDRNet, addresses overshot problems by adding a third branch for edge-focused boundary extraction. This three-branch network leverages the strengths of both DDRNet and Unet, achieving higher accuracy with faster processing speeds. Recent transformer-based approaches, such as SegFormer [5], MaskFormer [6], and UpperNet [7], employ self-attention mechanisms to capture long-range dependencies, further improving segmentation accuracy in complex scenes.

Although many promising results have been reported, many challenges remain. Conventional methods train an image-segmentation model using a large, high-quality dataset and then apply this model to new areas to generate land resource reports. For reliable performance, the testing dataset must be collected and processed in the same manner as the training data. Although we can ensure technical properties such as resolution or spectral bands between training and testing samples, the diverse geographical landscape is a challenge that cannot be controlled. Figure 1 introduces the differences between urban and rural regions in building patterns in the LoveDA dataset [8]. In urban areas, buildings (marked red) are typically aligned along roads in straight lines and exhibit uniform architectural styles. In contrast, buildings in rural areas are arranged more irregularly, often without road alignment. In addition, the orientation of rural buildings can vary widely, with no consistent pattern.

Further distinctions are evident in Figure 2, where a detailed comparison between the LoveDA dataset [8] and the Dalat dataset [9] is presented. Figure 2a illustrates differences in the “Wate” class (blue). The color of the water varies due to natural environmental factors, leading to differences between regions. In LoveDA, water in agricultural areas appears different from water in rivers within the Dalat dataset. Figure 2b highlights variations in the “Barren” class (gray). Barren land, defined as areas without vegetation, has clear visibility on the surface. However, in the LoveDA dataset, some background areas also contain barren land, causing confusion in classification. In contrast, the Dalat dataset provides more detailed labeling, reducing misclassification. Additionally, LoveDA contains a significant number of unlabeled pixels, whereas the Dalat dataset ensures more comprehensive labeling. Figure 2c compares the “Building“ (red) and “Road” (yellow) classes. In urban areas, LoveDA buildings appear more uniform, whereas Dalat buildings are more varied and irregular. Moreover, roads in LoveDA exhibit a distinct concrete color, while in Dalat, the roads are generally narrower and blend with the surrounding soil. This characteristic is also evident in Figure 2d, which focuses on the “Forest” class. Due to the different geographical characteristics of Wuhan and Dalat, variations in crop types result in different forest colors. Figure 2e further examines the differences in the class of “Agricultural” (orange). In Dalat, agricultural land is often cultivated in small plots and greenhouses. In contrast, the agricultural land of Wuhan has diverse soil colors. These distinctions highlight the unique geographic and structural characteristics of each dataset.

To meet practical requirements, we aim to train a segmentation model that generalizes better to unseen domains that contain domain gap issues. We hypothesize that a model capable of extracting sparse feature maps is more robust when tested on an unseen domain. Figure 3 compares conventional features with sparse features. As shown in Figure 3a, domain gaps lead to distributional shifts between the training and testing domains, causing some test samples to be misclassified. When features become sparse, the distribution of samples in the feature space aligns along the axes, as illustrated in Figure 3b. This indicates that only a few critical features are utilized for classification. Although classification errors may still occur, the available features remain useful for decision-making.

Motivated by this observation, we propose a method to train a segmentation model using sparse features on satellite imagery. At the core of our approach is the patch-level Variational Information Bottleneck (P-VIB) [10] module, which learns sparse, high-semantic feature maps. Unlike traditional dropout, which randomly deactivates a subset of neurons to promote robustness and generalization, P-VIB leverages KL-divergence to selectively retain fewer but more semantically meaningful features. These features are expected to remain robust across diverse domains. To ensure that the model learns only the “just enough” features required for effective segmentation, our loss function incorporates two additional constraints.

The selected features should be well suited for handling segmentation tasks and any auxiliary tasks typically used in established segmentation methods. To achieve this, we use cross-entropy to guide feature extraction toward this objective.
To avoid extracting redundant features, the feature map should remain as sparse as possible. Furthermore, the learned features should not contain redundancy information that could be used to reconstruct the original input image. The property is enforced through the Deep Variational Information Bottleneck (DVIB) [11], which applies a Kullback–Leibler (KL) divergence loss to restrict feature redundancy, ensuring that only essential information is retained.

The proposed P-VIB module is designed as a plug-and-play module that can be seamlessly integrated into existing neural network models without significantly modifying or retraining the entire architecture. This adaptability makes it suitable for enhancing generalization across any standard segmentation model. In our work, we implemented the P-VIB module within the SegFormer backbone, as SegFormer has been proven to be effectively adapted in various domains [12,13,14,15].

The P-VIB introduces uncertainty during training through random sampling, which makes the model take more time to converge. To address these issues, we employ an Online Hard Example Mining (OHEM) [16] strategy to prioritize hard samples throughout training. Using OHEM, the model focuses on challenging examples (those it finds more difficult to classify or segment) rather than treating all samples equally. This approach accelerates training by helping the model learn from mistakes and improves its capacity to handle complex cases. In satellite image segmentation, hard samples often represent rare features, so focusing on these cases avoids overfitting and enables the model to achieve optimal performance more quickly. When OHEM is accompanied by P-VIB, it helps the model to converge faster even if the VIB-based training is complex.

To assess the generalization of the proposed method in unseen domains, we trained and evaluated the segmentation model on the LoveDA dataset and then tested it on a new dataset [9] collected in Dalat City. Since the new dataset [9] is a small-scale dataset, it alone may not comprehensively validate our contributions. Therefore, we also evaluated the model using images from rural and urban areas in the LoveDA dataset [8] to provide a more robust analysis of the effectiveness of our approach in various environments.

In summary, the main contributions of this paper are as follows.

F-Segformer Network with Enhanced Generalization: We introduce a novel F-Segformer network that incorporates a Variational Information Bottleneck (VIB) module to improve the generalization capability of SegFormer. The model is trained and validated on data from one domain and tested on a different domain. Our approach not only enhances performance in the same domain but also yields significantly better results when applied to a new domain.
Online Hard Example Mining (OHEM) for Faster Learning: Although P-VIP could learn a more generalized model, its random sampling increases the training cost. We address the challenge by integrating OHEM into the training process. Conventionally, OHEM prioritizes learning hard samples, allowing the model to learn more effectively. In our application, we find that the technique enables faster convergence despite the added uncertainty introduced by the VIB module.
Extensive Experiments on Diverse Datasets: Experimental results were obtained in the widely recognized LoveDA dataset [8] and a custom dataset collected from Dalat City [9], highlighting the improvement of the proposed methods.

2. Relative Works

2.1. AI Method and Remote Sensing Application

Satellite stations are efficient tools for capturing the surface of Earth. They provide valuable data for scientific research and management applications. Automating satellite image analysis offers significant benefits and improves decision-making in various fields. Motivated by this potential, numerous datasets have been developed to train segmentation models for land classification.

Due to the high cost of labeling satellite imagery, early datasets such as DeepGlobe [17] provided only 803 high-resolution images. More recently, the LoveDA dataset introduced 5987 images with a well-defined benchmarking framework. In addition to these datasets, various research efforts have focused on the development of segmentation networks [18] to improve the accuracy of land segmentation. However, despite these advancements, the performance of multiclass segmentation models remains insufficient for practical land-management applications.

In an application where accuracy is required, one possible approach is reducing the number of prediction classes to improve segmentation performance. In this approach, only critical classes are analyzed. For instance, studies such as [19] and [20] focus solely on detecting water bodies, while [21,22,23] concentrate on identifying forests and deforestation. More recently, an emerging research direction [24,25] involves detecting surface changes rather than specific land classes. This method highlights potential changes for human analysts to verify, reducing task complexity while saving significant labor costs. The DynamicEarthNet dataset [26] is a well-known resource for change detection, providing daily satellite images for the same locations, with one labeled image per month.

Beyond accuracy, domain adaptation is a critical challenge attracting significant attention in the research community. Different satellite systems capture data in varying formats—some provide multispectral images or Near-Infrared (NIR) bands, while others only offer standard RGB images. The Potsdam and Vaihingen datasets [27] were among the first designed to support domain adaptation research. However, these datasets are relatively small and may not fully encompass the complexities of real-world domain-adaptation challenges. The Potsdam dataset consists of 38 very-high-resolution True Orthophotos (VHR TOP), each with a fixed size of 6000 × 6000 pixels. It offers three imaging modes: IR-R-G, R-G-B, and R-G-B-IR (with the latter being a four-channel image format). The Vaihingen dataset comprises 33 VHR TOP images, each approximately 2000 × 2000 pixels in size, with a single imaging mode (IR-R-G). More recently, the LoveDA dataset has been introduced with a specific focus on domain adaptation. It provides multiple classes in both urban and rural areas, enabling more robust domain-adaptation research.

2.2. Image Segmentation

A straightforward approach to image segmentation involves using a fully convolutional network (FCN), as established by Long et al. in their seminal work [28]. In this architecture, fully connected layers are converted into convolutional layers, allowing the network to predict dense semantic maps for pixel-wise image segmentation directly. Although FCNs effectively produce semantic maps, they encounter notable limitations in achieving high segmentation accuracy. A key challenge is handling scale variations, as objects within an image may appear at significantly different sizes due to varying distances from the camera or inherent spatial differences. Moreover, the overshot problem, where small objects are difficult to distinguish from the background because of limited resolution, further complicates accurate segmentation. These issues often result in suboptimal segmentation performance, limiting the applicability of FCNs in complex real-world scenarios.

To better manage scale-related issues, researchers have proposed a range of encoder–decoder architectures, such as the widely used UNet [1]. In these architectures, the encoder progressively enlarges its receptive field through operations like stride convolutions or pooling, capturing increasingly abstract and high-level features as the resolution of the feature maps decreases. However, the decoder reconstructs spatial details from these high-level semantics through techniques like deconvolutions or upsampling, aiming to restore the original image size while preserving important information. However, the inherent downsampling process employed in the encoding phase can lead to the loss of fine spatial details, which are crucial for accurately segmenting small objects and maintaining precise boundaries.

To address this limitation, dilated convolutions [29] were introduced, offering an elegant solution by expanding the receptive field without sacrificing spatial resolution. By introducing holes or gaps between kernel elements, dilated convolutions allow the model to capture larger contexts while retaining detailed spatial information. This concept was further developed in the DeepLab series [30], where multiple dilation rates were strategically used across different layers of the network to achieve multiscale context aggregation. This technique significantly improved segmentation performance, especially in capturing global and local features. However, while dilated convolutions offer a substantial boost in accuracy, they pose challenges for hardware implementation due to their non-contiguous memory-access patterns, which can lead to inefficiencies in computational speed and resource usage.

Building on the need to capture the multiscale context, PSPNet [4] introduced the Pyramid Pooling Module (PPM), which processes features at multiple scales to better model the varying sizes of objects within an image. By pooling information from different levels of granularity, PSPNet successfully integrates fine and coarse details, resulting in more accurate segmentation. Despite its effectiveness in enhancing accuracy, this multiscale processing has a trade-off: the added computational complexity can result in significantly increased processing times, limiting its practicality in real-time applications.

To address the overshot problem, which commonly impacts small, boundary-adjacent objects, it is essential to balance background context with fine spatial detail. Large receptive fields capture background context and help distinguish objects from their surroundings, while spatial detail aids in accurately delineating object boundaries. DDRNet [2] addresses this challenge by employing a two-branch network architecture, where a deep branch captures global context and a shallow branch focuses on spatial details. These outputs are then combined through a feature-fusion module, enhancing both segmentation accuracy and efficiency, especially for small objects, without adding substantial computational overhead. Building on the two-branch network approach, PID [3] introduced a three-branch network, adding a specialized branch for boundary-focused segmentation. This branch applies a Canny filter to detect and emphasize image edges. During training, the model focuses on small, boundary-adjacent objects, leading to improved accuracy while maintaining a low computational cost.

Recently, segmentation research has been shifting to transformer-based architectures, such as SegFormer [5], MaskFormer [6], and UpperNet [7]. These methods leverage self-attention and multi-head attention mechanisms, enabling them to capture long-range dependencies that CNN-based models typically struggle with. By capturing interactions across the entire image, these transformers achieve enhanced accuracy, especially in scenes with diverse object sizes and intricate object relationships. Among the methods, SegFormer [5] had been considered a successful backbone for domain adaptation [31] where there is a domain gap between training and testing image.

2.3. Avoid Overfitting Solutions

Overfitting is a significant challenge in the training of neural networks. Traditional approaches often apply

l_{p}

regularization terms [32] to mitigate overfitting. However, while these methods have some impact, they are limited in the context of deep learning, where models contain millions or even billions of parameters, resulting in high expressiveness and susceptibility to overfitting. Although

l_{p}

regularization can penalize large weights to a certain degree, it does not inherently prevent the network from learning complex patterns in training data that may not be generalized to new data. Techniques such as dropout, batch normalization, data augmentation, and early stopping have proven to be more effective regularization strategies for deep learning, providing stronger resistance against overfitting.

Dropout is a regularization technique that randomly “drops out” or deactivates neurons during each forward pass in the training phase. In each iteration, a subset of neurons is randomly set to zero, preventing them from participating in the forward and backward passes. This strategy forces the network to learn distributed and redundant representations, discouraging individual neurons from becoming overly specialized. By promoting more generalized feature learning, dropout improves the generalization ability of the network, helping to prevent overfitting and improve performance on unseen data.
Batch normalization [33] is another essential technique for stabilizing and accelerating deep neural network training by normalizing the input of the layer. This method adjusts and scales activations within each layer, reducing sensitivity to variations in data scale and distribution. Batch normalization helps mitigate overfitting by reducing dependency on initial weight values, and it has become a widely used component in modern deep neural network architectures.
Data augmentation expands the diversity and size of the training dataset by creating modified versions of existing samples. The augmentation improves the generalization of the model by exposing it to a wider range of variations during training. Advanced augmentation methods, such as Mix-up [34] and CutMix [35], have been introduced and are now standard in many state-of-the-art (SoTA) segmentation models, significantly improving robustness by minimizing overfitting.
Early stopping is another regularization technique that monitors the performance of the model in a validation set during training, stopping training once the performance in the validation set ceases to improve. This approach prevents the model from overfitting by avoiding unnecessary training epochs, ensuring that the model maintains optimal performance without learning noise from the training data.

Recently, most segmentation methods have integrated dropout, batch normalization, data augmentation, and early stop during training, yielding promising results. However, in satellite imagery, both dropout and data augmentation struggle to model irregular geographical landscapes. Therefore, they may not help to improve performance in an unseen domain. In addition, these techniques lack an objective function specifically designed to encourage sparse and compressed feature learning for improved accuracy. The Variational Information Bottleneck (VIB) [11] addresses this gap by introducing objective functions that improve the generalization of neural networks by focusing on essential information required for the task rather than learning extraneous details of the input data. The core idea of the information bottleneck is to develop a compressed representation (Z) that retains only the information needed to predict the target output (Y) while removing unnecessary details from the input (X). This is achieved by maximizing the mutual information between Z and Y while also minimizing the mutual information between Z and X. In essence, VIB [11] encourages the model to learn representations that are highly informative about the output, but as compact and concise as possible.

3. Proposed Method

Figure 4 presents an overview of the proposed method. The network is based on the SegFormer architecture [5]. It uses the Mix Transformer (MiT) [36] backbone to extract hierarchical characteristics, which are subsequently resized and fused via an MLP layer. The VIB module then calculates the mean and variance of these features, allowing new features to be sampled from these distributions. A KL divergence loss [37] is applied to select sparse and informative features for the classification head. In addition, the OHEM that learns from hard samples helps the model to convert faster. The remainder of this section provides further details: Section 3.1 describes the network architecture, Section 3.2 explains the loss function, and Section 3.3 discusses the OHEM strategies for the training process.

3.1. Network Architecture

The MiT backbone is a hierarchical multistage transformer architecture optimized for efficient visual understanding. Figure 5 illustrates the hierarchical structure within the MiT backbone.

The MiT backbone has four stages for hierarchical feature extraction. Each stage contains a specific number of transformer blocks and produces feature maps of varying shapes based on the input resolution and model variant (for example, MiT-B0 to MiT-B5). Among these variants, MiT-B5 is the most complex and achieves the highest accuracy. In this work, we employ the MiT-B5 variant to maximize precision. Each stage comprises blocks that include Patch Merging, Multihead Self-Attention, and a MixFFN module [5]. The number of blocks varies across stages, and Table 1 provides detailed information for each stage of the MiT-B5 backbone.

In the decoder, the multilevel features

F_{i}

of the MiT encoder are first processed through an MLP layer to unify their channel dimension to C. These features are then upsampled to a resolution of

\frac{W}{4} \times \frac{H}{4}

and concatenated to form a composite feature map F with

4 C

channels, which is subsequently fed into the VIB module, as illustrated in Figure 4. Within the VIB module, the mean

μ

and the variance

σ

of F are obtained using convolutional layers

3 \times 3

, denoted

V I B_{μ}

and

V I B_{σ}

, while maintaining the output channel count at

4 C

, as shown in Equations (1) and (2).

μ = V I B_{μ} (F)

(1)

σ = V I B_{σ} (F)

(2)

Lastly, the selected features are estimated using Equation (3). Here, a sampling is applied at every spatial position on a feature map with

ϵ \sim p (ϵ) = N (0, I)

. Here,

ϵ

is a matrix that is similar in size to

μ

and

σ

.

z = μ + ϵ σ

(3)

Finally, an MLP network takes z as an input and outputs a feature vector with dimensions corresponding to the number of output classes (

N_{c l s}

). This produces a mask of size

(\frac{H}{4}, \frac{W}{4}, N_{c l s})

, which is subsequently upsampled by a factor of four to match the original image dimensions.

3.2. Object Functions

In Figure 4, the model is trained using a classification loss and a feature-selection loss. Let i denote the

i^{t h}

pixel in an image, with

y_{i}

and

{\hat{y}}_{i}

representing the ground truth label and the predicted label for that pixel, respectively. The classification loss

L_{i}^{c l s} (y_{i}, {\hat{y}}_{i})

is formulated as a standard classification loss, as shown in Equation (4). Furthermore, the feature-selection loss

L_{i}^{f e a} (μ_{i}, σ_{i})

is based on the Kullback–Leibler (KL) divergence, enforcing the learned feature distribution to approximate a predefined distribution

q (z)

. Typically,

q (z)

is chosen as a standard normal distribution,

N (z; 0, I)

. Based on

q (z)

, the feature-selection loss is defined by Equation (5).

L_{i}^{c l s} (y_{i}, {\hat{y}}_{i}) = \sum_{c = 1}^{N_{c l s}} y_{i}^{c} log {\hat{y}}_{i}^{c} .

(4)

\begin{matrix} L_{i}^{f e a} (μ_{i}, σ_{i}) & = K L (N (z_{i}; μ_{i}, σ_{i}) | | q (z_{i})) \\ = \frac{1}{4 C} \sum_{k = 1}^{4 C} (μ_{i, k}^{2} + σ_{i, k}^{2} - 2 l o g (σ_{i, k}) - 1) \end{matrix}

(5)

where:

$4 C$ is the dimension of the latent features;
i is the index of a position;
k is the index of a channel.

It is worth noting that

μ = 0

serves as a criterion for improving

L_{i}^{f e a}

. Consequently, the loss function facilitates the extraction of sparse features, as illustrated in Figure 3. As discussed in Section 1, this property can help identify valuable features in both the source and the target domains.

Let

β

be a hyperparameter that balances the contributions of the feature-selection loss and classification loss. The model is trained end-to-end with the loss function

L (y, \hat{y}, μ, σ)

, as defined in Equations (6) and (7). Here,

W, H

are the numbers of rows and columns in the input images.

L (y, \hat{y}, μ, σ) = \frac{16}{W H} \sum_{i}^{\frac{W}{4} \frac{H}{4}} L_{i} (y_{i}, {\hat{y}}_{i}, μ_{i}, σ_{i})

(6)

L_{i} (y_{i}, {\hat{y}}_{i}, μ_{i}, σ_{i}) = L_{i}^{c l s} (y_{i}, {\hat{y}}_{i}) + β L_{i}^{f e a} (μ_{i}, σ_{i})

(7)

3.3. Online Hard Example Mining

Online Hard Example Mining (OHEM) [16] is a technique used in image segmentation to improve the training efficiency and performance of neural networks, especially when dealing with unbalanced data or challenging samples. OHEM helps the model focus on the most informative or difficult-to-classify samples (the “hard examples“) during training rather than treating all samples equally. In the scenario of segmentation, easy samples are pixels belonging to large, well-defined objects that are easier for the model to segment, whereas hard samples are pixels belonging to small objects, objects with unclear boundaries, or objects that are occluded and are harder to segment correctly.

Equation (8) represents the loss function when OHEM is used in a segmentation application.

L_{O H E M} = \frac{\sum_{i} M_{i} L (x_{i}, y_{i})}{\sum_{i} M_{i}}

(8)

where:

$L (x_{i}, y_{i})$ is any loss at the $i^{t h}$ pixel of an image. In our work, the loss is represented by $L_{i} (y_{i}, {\hat{y}}_{i}, μ_{i}, σ_{i})$ in Equation (7).
$M_{i}$ is a pixel in a binary mask M that represents the hardest pixels in an image. This mask is dynamically selected based on the highest values of $L (x_{i}, y_{i})$ .

For estimating the mask (M) in Equation (8), the loss of each pixel i is computed by

L (x_{i}, y_{i})

. Then, all pixels are ranked by their loss values

L (x_{i}, y_{i})

. Finally, the K pixels with the highest loss are chosen to formulate the binary mask M.

4. Experimental Result

The section aims to demonstrate the contribution of the proposed method and is organized as follows:

Section 4.1: This section introduces the datasets used in the experiments. Among them, the LoveDA dataset is a large-scale dataset with a rigorous evaluation system. Performances in the dataset are evaluated using a public evaluation server; hence, they ensure the reliability of the experiment in scientific research. In contrast, the Dalat dataset is a customized dataset used to evaluate the applicability of LoveDA when used for real-world applications.
Section 4.2: The proposed method consists of two important objective functions. One objective function for classification and one objective function for feature selection. Therefore, a hyperparameter is needed to control the balance between these two losses. The major purpose of this experiment is to select the appropriate hyperparameters for the training process. The convergence curves of the objective function and the evaluation results during the training process will also be provided to evaluate the sensitivity of the feature-selection function.
Section 4.3: This section compares state-of-the-art segmentation methods in a real-world application context. Consequently, the training will be performed on a large-scale dataset and the model will be tested on a new region that does not appear in the training process. In this experiment, the loveDA dataset serves as a large-scale dataset; and the Dalat dataset serves as the dataset collected in a new region.
Section 4.4: This section compares our method with unsupervised domain-adaptation (UDA) methods. UDA methods incorporate both target domain data and specialized techniques to transfer knowledge from the source domain during training. Then, UDA serves as an upper bound for comparison with the proposed method. Through the experiment, we aim to prove the generalization of the method in a hard manner.
Section 4.5: This section investigates the two main components of the proposed method: the VIB model and the OHME sampling process. Ablation studies assess the role of each component during training.
Section 4.6: This experiment explains common errors and their causes. Section 4.6.1 analyzes the common errors, and Section 4.6.2 analyzes the features to explain the causes of those errors.

4.1. Dataset

4.1.1. LoveDA Dataset

LoveDA [8] is a large-scale dataset designed for semantic segmentation and unsupervised domain adaptation (UDA) [31,38], specifically aimed at mapping land cover in urban and rural environments. Urban areas within the dataset are characterized by densely packed, human-made structures, while rural regions primarily consist of natural elements. This dataset addresses key challenges in geographic generalization, offering a unique resource that spans diverse landscapes to promote model adaptability across different domains.

The dataset comprises 5987 images, partitioned into 2522 images for training, 1669 for validation, and 1796 for testing. The urban and rural regions in the training and validation sets are fully labeled. Although labels are not provided for the test set, evaluation is possible by submitting predictions to an online evaluation server [39]. With its well-defined benchmark, LoveDA has established itself as a prominent dataset in satellite image-segmentation research.

4.1.2. Dalat Dataset

To evaluate the performance of the model in a real-world setting, we collected and labeled a new dataset in District 12 of Dalat City. Some examples of the dataset are shown in Figure 6. To reduce the domain gap issue, we strictly follow the setting in the LoveDA dataset to collect and label data. The resolution for each pixel is 0.3 m; and images are labeled with classes corresponding to those outlined by the LoveDA dataset: building, road, water, barren, forest, agriculture, and background. In total, 64 high-resolution (1926 × 1825 pixels) images capture various types of land cover within the city of Dalat. We may see that there is a geometry difference between LoveDA and Dalat City, as discussed in Figure 1 in Section 1.

4.2. Hyperparameter Selection

In this section, we evaluate the impact of the learning rate (

l r

) and the contribution of the feature-selection loss (

β

). These two hyperparameters are critical settings in the training process. The urban area in the LoveDA dataset [8] serves as the training and validation data, while the rural area is used for testing. Since labels are unavailable for the rural test set, we utilize the rural validation set as the test set in this experiment. Consistent with the LoveDA challenge-evaluation protocol [8], segmentation performance is assessed using the mIoU metric. In the next experiments, we train our model using the Pytorch framework (2.1) and an RTX Titan GPU. The optimizer is “AdamW” with weight_decay = 0.01, drop_out = 0.1, and the learning rate in the head is ten times higher than in the backbone. During training, an input image is randomly resized and cropped to

640 \times 640

; then the patch is randomly flipped and the PhotoMetricDistortion augmentation is applied. Each training process includes 160K iterations; the model is evaluated for every 10K iterations. The best model in these evaluations is considered the selected model and is used to infer the result in the testing set.

By default, the learning rate (

l r

) for SegFormer is set to

l r = 0.00006

; we experimented with alternative

l r

values of 0.00004 and 0.001, as well as various

β

values (0.01, 0.3, 0.5, and 0.7) to determine an optimal configuration. The results, shown in Figure 7, indicate that a lower learning rate generally yields better results. For example, when

β = 0.01

and

l r = 0.00004

, the model achieves an mIoU of 51.44%, compared to 51.19% when

β = 0.01

and

l r = 0.00006

. This suggests that reducing the learning rate from the default value slightly improves mIoU. In contrast, increasing the learning rate to

l r = 0.001

results in a significant drop in mIoU to 50.14%, regardless of

β

.

Based on these findings, we conclude that the default

l r

setting of SegFormer [5] is optimal for training; an increase in

l r

tends to reduce mIoU, but a reduction on

l r

does not improve mIoU too much. Additionally, the

β

hyperparameter should be smaller than 0.1; higher values do not produce better mIoU outcomes.

In addition to the learning rate,

β

is a critical hyperparameter that must be carefully optimized. A lower value of

β

implies minimal selection of features, while a higher value of

β

may lead to excessive reduction of features, potentially resulting in insufficient information to handle challenging samples effectively. Figure 8 illustrates the mIoU in both the validation and testing datasets. In particular, the validation dataset consists of urban area images, while the testing dataset comprises rural area images, leading to a notable drop in testing performance relative to the validation set.

In Figure 8,

β = 0

represents the baseline SegFormer; other settings mean that the VIB module is applied with a corresponding parameter

β

. The setting

β = 0.01

improves performance in both the validation and the testing datasets. This result indicates that the VIB module helps with unseen data in the same domain (the validation dataset). In addition, it also helps with data from unseen domains (the testing dataset). Performance improves further with

β = 0.05

, suggesting that VIB effectively strengthens the generalization of the model. However, increasing

β

to

0.1

slightly reduces performance compared to

β = 0.05

, probably due to being over-sparse, which limits the features necessary for robust representation. Thus, increasing

β

beyond a certain threshold can affect the ability to extract sufficient information for accurate feature selection.

Figure 9 illustrates the convergence of errors during the hyperparameter search. The result helps to better understand the model’s sensitivity to different hyperparameter settings. The figure visualizes both feature-selection loss and segmentation loss under varying values of

β

in the training set.

It can be observed that both losses converge smoothly in all scenarios. However, the slope of the curves varies according to

β

. When

β

is small (

β = 0.01

), the segmentation loss decreases faster than the feature-selection loss. In contrast, when

β

is larger (

β = 0.1

), the segmentation loss decreases more slowly than the feature-selection loss. Furthermore, increasing

β

results in a greater final segmentation loss. For example, the segmentation losses are

4.5 \times 10^{- 4}

,

1.8 \times 10^{- 4}

, and

2.5 \times 10^{- 5}

when

β

is set to 0.1, 0.05, and 0.01, respectively. This suggests that a larger

β

value may negatively impact segmentation performance by increasing segmentation loss.

4.3. Compare with SoTA

This section compares our method with state-of-the-art (SoTA) approaches to develop deep learning models for land resource management in new geographic regions. The model is trained and validated on a large-scale dataset and then tested on data from a new location. In these experiments, we used the LoveDA dataset as the large-scale training dataset, incorporating urban and rural areas for training and validation. We also used the Dalat dataset for testing. Here, the LoveDA validation set serves as unseen samples within a familiar domain, while the Dalat dataset, collected independently in Dalat City, represents an unseen domain, simulating real-world deployment in new geographic areas.

Figure 10 presents the mIoU results on the validation dataset (LoveDA) and the test dataset (Dalat). In this figure, several deep learning-based architectures have been compared, such as DDRNet [2], U-Net [1], MaskFormer [6], PID [3], UperNet [7], and SegFormer [5]. Among them, SegFormer is the baseline of our method. “VIB” is our proposed method in which the SegFormer backbone is accompanied by a VIB module, and “VIB+OHEM” means that the OHEM is applied to learn on hard samples. In “VIB+OHEM”, only pixels with a confidence score under 0.7 are used to train, and we keep at least 100,000 pixels during training. Due to the domain gap, the mIoU in the test dataset is lower than in the validation dataset; for example, SegFormer achieves an mIoU of 53.43% in the validation set but drops to 19.94% in the Dalat test set. However, our proposed method, which integrates VIB and OHEM, shows significant performance improvement in the unseen domain (Dalat test set) even without incorporating Dalat data into the training process. Specifically, “Segformer+VIB” achieves 36.90% in mIoU in the unseen domain. When OHEM is applied, the mIoU increases to 37.13%. In the validation dataset, our method also slightly outperforms the baseline (Segformer), as well as other well-known segmentation methods. This means that VIB could help better on unseen domains, and it may not help too much if there are enough label samples in a training dataset.

Table 2 provides detailed mIoU values on the Dalat dataset across different land cover classes, including “Background”, “Building”, “Road”, “Water”, “Barren”, “Forest”, and “Agricultural”, for each method illustrated in Figure 10. In detail, MaskFormer attains the highest IoU for the “Background” class (15.03%), while U-Net has the lowest score (1.75%). However, “Background” is a special category where a pixel is difficult to classify; hence, it is not easy to rely on the category for evaluation. In the “Building” class, “VIB+ OHEM” achieves the highest performance (32.49%), while UperNet achieves the highest score for the class Road (21.92%). In particular, it outperforms other methods in the Water class with an IoU of 60.29% and demonstrates superior performance in the Forest class (64.83%), while UperNet leads in the Agricultural class with an IoU of 64.45%. In all classes, “VIB+OHEM” achieves the highest mIoU score (37.13%), closely followed by VIB (36.90%). This result indicates that the proposed models provide robust performance in diverse types of land cover. MaskFormer ranks third in overall performance, with 26.61% in mIoU.

Compared to the baseline SegFormer, our method achieves significantly better performance. Although SegFormer achieves only 19.94% mIoU in the unseen domain, our method reaches 36.90%, demonstrating a substantial improvement. This performance gain can be attributed to the use of sparse feature maps. Since segmentation is performed at the pixel level, the sparsity at each pixel should be analyzed in detail. Given a feature map

F \in R^{H \times W \times C}

, the sparsity score for the

i^{t h}

pixel is calculated as

s_{i} = \frac{1}{C} \sum_{k = 1}^{C} | F_{i} < t h |

. Hence, the overall sparsity score for the feature map is estimated to be

S_{s} = \frac{1}{H W} \sum s_{i}

. In the Dalat dataset, the sparsity score achieved by the proposed method is 0.75, whereas the SegFormer baseline only reaches 0.12. The relationship between the sparsity score and the mIoU score in the unseen domain highlights the importance of the variable information bottleneck (VIB) in our proposed method.

Table 3 presents a comparison of the computational complexity and training time between the proposed method and other segmentation solutions. The training time is measured over 160k iterations for all methods; while the complexity is evaluated using FLOPs and the number of parameters. FLOPs measure the total number of floating point operations performed by a model during inference or training. A higher FLOP count indicates greater computational demands, increased power consumption, and higher latency. In contrast, a higher number of parameters (weights and biases) allows the model to capture more complex patterns.

The results show that U-Net and UperNet are the most computationally complex due to their high FLOP values. This is a well-documented phenomenon, as these models rely on dense prediction networks. Although they may perform well on seen domains, their large number of parameters increases the risk of overfitting to the source domain, reducing their generalizability to unseen domains.

However, PID and DDRNet are lightweight solutions designed for efficiency, with significantly lower FLOP values compared to U-Net. However, their mean Intersection over Union (mIoU) performance remains poor on unseen domains.

MaskFormer, SegFormer, and the proposed method are based on a transformer architecture, which makes them computationally heavier compared to lightweight models. Furthermore, many research studies have pointed out that training a transformer-based model takes significantly longer compared to CNN-based methods. However, since segmentation for land resource management is not a real-time application, the higher inference cost is not a critical drawback.

The Dalat dataset [9] is a small custom dataset. Hence, a good performance on the DaLat dataset may not robustly prove the benefit of the proposed method. In the rest of this section, we use the well-known LoveDA dataset [8] to evaluate the performance of our proposed method. Here, the training and validation datasets are from urban areas; and the testing set is from rural areas in the LoveDA dataset. The setting in SSDA [40] is used to prepare the training and validation dataset on urban areas. Similarly to SSDA [40], we compare our method with supervised learning methods such as LANet [41], PSPnet [4], DeepLabV3 [30], HRnet [42], MAE+ UpperNet [43]. The result in Table 4 also proves the improvement of our method.

4.4. Compare to Upper Bound

In the section, we move on by comparing the proposed method with methods based on UDA. UDA defines a source domain with label data and a target domain without label data. Both the source and target domains are used for training, and only images from the target domain are used for testing. Note that we have different training and testing datasets in the target domain. The scenario is based on the assumption that the segmentation label of the satellite images is very costly, but raw input images are available. Hence, the non-label satellite image on a new domain could be used to increase the generalization of the segmentation model. In the experiment, we used training and validation datasets from urban areas of the LoveDA dataset, while the testing dataset was sourced from rural areas within LoveDA. To ensure a fair comparison, we adhere to the dataset-preparation guidelines established in LoveDA [8]. Given that the testing labels are not publicly available, we uploaded our results to the LoveDA evaluation service [39] for performance assessment.

We compare our method with various methods such as CLAN [44], IAST [45], DCA [46], FADA [47], TransNorm [48], PyCDA [49], DAformer [13], MIC [15], and DASegNet [50]. Among them, LAN [44], FADA [47], TransNorm [48], PyCDA [49], and IAST [45] are well-known UDA methods that are applied to the LoveDA dataset [8]. DAformer [13] and MIC [15] are SoTA segmentation UDA methods. Recently, DCA [46] and DASegNet [50] have been specially customized for the LoveDA dataset. Unlike these methods, our method does not use any images of the target domain during training. For clarity, we note our approach as “without target domain data” and designate comparison methods as “with target domain data”.

Table 5 presents an IoU comparison between these methods. The results demonstrate that our approach (VIB, with and without the online hard sampling) could have performance comparable to the domain-adaptation approaches even if no data on the target domain were used for training. While CLAN [44] and IAST [45] achieve 30.9% and 38.44% of mIoU, our method (VIB+OHEM) reaches 43.86%. It should be noted that our method does not use data in the target domain, but CLAN [44] and IAST [45] use the data. mIoU scores given by FADA [47], TransNorm [48], and PyCDA [49] are 28.65%, 24.62%, and 32.14%.

Our approach, without explicit UDA techniques and target domain data, does not surpass SoTA solutions such as DCA [46], DAformer [13], MIC [15] and DASegNet [50]. With SoTA techniques in the UDA field, DAformer [13] achieves 46.25% of mIoU. DASegNet [50], using a DeeplabV3 backbone [30], attains an mIoU of 45.69%, slightly higher than our “VIB+OHEM” method. When using the SegFormer backbone, DASegNet achieves an mIoU of 50.08%, placing it at the top of the leaderboard in the LoveDA challenge. Among them, we recognize that the SegFormer backbone is a critical factor that helps improve performance.

Additionally, Table 5 compares our method with other regularization techniques, such as dropout. By default, the dropout ratio in the SegFormer baseline is set to

0.1

. To examine its effect on performance, we increase this ratio to

0.15

and decrease it to

0.05

. The results indicate that SegFormer alone may not be well suited to handle unseen domains, as its mIoU reaches only 26.72%, which is significantly lower than VIB-based methods. Notably, increasing the dropout rate does not lead to consistent performance improvements. For example, in road segmentation, the mIoU at dropout = 0.1 is 32.71%, slightly higher than at dropout = 0.05 (31.70%), but drops again at dropout = 0.15 (30.90%). In contrast, water segmentation achieves the best performance at dropout = 0.15 (45.95%), suggesting that different classes respond differently to regularization. These findings highlight that VIB outperforms the dropout technique in making the model more generalized, especially in challenging segmentation tasks.

4.5. Ablation Study

This section evaluates the effect of loss of feature selection and OHEM using an ablation study. The training dataset is from urban areas, and the model is tested in both urban and rural areas. The datasets for both areas are prepared in the LoveDA [8] dataset. Data in urban areas are used to select the best model, which is tested by data in rural areas.

Figure 11 presents the mIoU and the accuracy given by the baseline method and the improvement given by our methods. In the urban validation set (

V a l_{U r b a n}

), the baseline model achieves an mIoU of 50.63%. Introducing feature loss (VIB) makes the performance increase slightly, increasing mIoU to 51.73%. This modest improvement suggests that VIB introduces beneficial features for urban segmentation, although the gains are incremental. The “VIB+OHEM” configuration further improves the urban mIoU to 51.84%. This result indicates that help with segmentation performance marginally improves in urban areas compared to VIB alone.

In the rural validation set (

V a l_{R u r a l}

), the enhancements are more pronounced because the model is trained by data in urban areas. The mIoU of the baseline model is 36.02%; this value is significantly lower than the performance in urban areas, reflecting the challenges of segmenting rural landscapes. Adding VIB increases the mIoU of rural areas to 39.54%; this is a substantial improvement over the baseline model. Therefore, this highlights the effectiveness of VIB in unseen environments. The “VIB+OEM” configuration provides an additional boost, achieving an mIoU of 40.66%, making it the best model in an unseen domain.

One critical observation is that the OHEM helps to improve the mACC metric more than the mIoU metric in the unseen domain. The mean accuracy given by VIB and “VIB+OHEM” is 67.14% and 66.67%, correspondingly on the urban validation set (source domain). However, in the rural validation dataset (target domain), these values are 51.07% and 53.74%. With an improvement of 2.67%, OHEM proves the ability to make the model more generalizable. In general, the results in Figure 11 indicate that VIB and “VIB+OHEM” outperform the baseline method in urban and rural settings, with more notable gains in rural areas that do not appear in the training process.

Although feature loss (VIB) improves generalization, the uncertainty introduced by the sampling process increases the difficulty of training, requiring a higher number of iterations for convergence. Specifically, the baseline model reaches its highest performance, around 90K iterations, while the VIB method requires more than 110K iterations to achieve its best result. In particular, with the addition of OHEM, the model achieves optimal performance at the 40Kth iteration. This suggests that focusing on hard samples helps to accelerate the training process, even when the KL loss is applied. Table 6 summarizes the iteration record when a method obtains the best performance in two scenarios, as in Section 4.3. One is that the model was trained on the LoveDA dataset and tested on the Dalat dataset; another is when the model was trained on an urban dataset and tested by a rural dataset. The result shows that VIB converges slower, from 90K to 120K. However, OHME can reduce the number of iterations to 40K to achieve the best model.

4.6. Qualitative Results

4.6.1. Result-Based Analysis

Figure 12 presents visualization results that compare SegFormer, “Segformer+VIB”, and “Segformer+VIB+OHEM”. Here, the model is trained with data from urban areas, and the test image is from rural areas. In the new data domain, many pixels are misclassified as background, represented by an abundance of white pixels. In addition, agricultural land (orange), forest land (green), and barren land (gray) are often confused. In zoomed-in views, barren land is often unrecognized in the test set or misclassified as forest or agricultural land.

Without the inclusion of VIB, the standard SegFormer struggles to accurately predict land types, resulting in a high number of unclassified pixels (background categories), as shown in the zoomed-in sections of Figure 12a and Figure 12b (second row). Incorporating VIB improves classification, with more pixels correctly identified. However, some misclassifications remain, such as in Figure 12a (second row). Here, some agricultural land (orange) is mistakenly identified as barren land (gray), a reasonable error given by the visual similarity of these categories.

Adding OHEM further refines segmentation by focusing on challenging patterns, reducing overfitting, and simplifying decision paths. This results in less fragmented segmentation. For example, in Figure 12a (second row), while SegFormer+VIB produces a two-color segmentation, SegFormer+VIB+OHEM generates a more cohesive single-color result for agricultural land.

In general, OHEM promotes more coherent and complete segmentation. In Figure 12b (second row), OHEM more effectively fills agricultural land than the results without OHEM. However, there are cases where OHEM introduces limitations. For example, in Figure 12a (third row), OHEM does not detect a narrow road (yellow) and classifies a larger area as background (white), while the “Segformer+VIB” method partially identifies the road in yellow. Thus, OHEM does not always guarantee superior results but expands the decision boundary for each category. This is evident in Figure 12b (third row) and Figure 12c (second row), where more classes are detected within OHEM compared to using VIB alone.

4.6.2. Feature-Based Analysis

In Section 4.6.1, a qualitative analysis was conducted to identify common errors in land classification. Two primary error patterns were observed:

Confusion Between Similar Land Types: Agricultural land, forest land, and degraded land are often misclassified as one another.
Unrecognizable Land Types: Due to domain gaps, certain land types do not appear correctly and are instead classified as background.

This section analyzes the feature maps for each test case in Section 4.6.1 to further investigate the root causes of these errors. Each analysis includes:

The original and ground truth images are represented in subfigures (a) and (e).
Feature maps with high activation strengths for both the proposed method and the baseline SegFormer model. Feature maps extracted by our method are presented in subfigures (b, c, d) and those obtained from the traditional SegFormer are presented in subfigures (f, g, h).

These feature maps are evaluated following two criteria. In the first criterion (C1), pixels belonging to the same class should exhibit similar responses within a feature map. In the second criterion (C2), pixels with similar response strengths in a feature map should ideally belong to the same class.

(a): Case 1

Figure 13 presents the feature maps for a given image, highlighting the differences between the traditional SegFormer method and our proposed approach. The traditional SegFormer method does not fully satisfy Criterion C1. For example, in Figure 13f, the strongest responses appear in the “Building” class, but only certain “Building” pixels exhibit high activation, while others do not. This inconsistency suggests that the feature extractor fails to generalize to all instances of the same class. In contrast, the proposed method performs significantly better in maintaining feature consistency within classes. Both Figure 13b,c demonstrate uniform responses across pixels belonging to the same class. In detail, Figure 13b shows strong responses for “Building”; and Figure 13c highlights “Barren” and “Forest” land with consistent activation patterns. This improvement explains why traditional SegFormer tends to produce fragmented segmentations, whereas our method generates more continuous and coherent segments, as discussed in Section 4.6.1.

Both the SegFormer baseline and our method exhibit some challenges in meeting Criteria C2. In the SegFormer baseline, the feature maps display extreme contrast, with some pixels appearing highly activated, while most remain nearly inactive. Given the observations from Criteria C1, this suggests that the filters fail to capture meaningful information, likely due to domain-gap issues. As a result, feature maps are not activated in new environments. In contrast, our method produces more structured feature maps, and certain land types still exhibit overlapping responses. As seen in Figure 13b, “Agriculture”, “Barren”, and “Forest” land share similar activation patterns, which explains why these categories are sometimes misclassified.

A notable strength of our method is that each feature map contributes distinct, nonoverlapping information, leading to more specialized feature representations. In Figure 13b, the feature map responds strongly to the “Building” class but minimally to the “Water” class. In Figure 13c, the feature map activates primarily in the “Agricultural” and “Forest” classes while deactivating the “Building” class. This differentiation suggests that the proposed method extracts more meaningful and discriminative features, ultimately improving segmentation performance across diverse land types.

(b): Case 2

Figure 14 further illustrates the differences in feature extraction in another example. Similarly to the first case, the traditional SegFormer struggles with criterion C1, as seen in Figure 14f, where only a few “Forest” pixels exhibit strong responses. In contrast, our method ensures better class consistency. Figure 14b,c show that pixels belonging to the same classes have similar responses.

Regarding criterion C2, both methods face challenges. The SegFormer baseline produces highly imbalanced feature maps, likely due to domain gaps, making them unresponsive. Meanwhile, our method, while more structured, still shows some overlap in responses between agricultural land, barren land, and forest land, leading to occasional misclassification. However, a key advantage of our approach is that the feature maps remain distinct, with Figure 14c focused on the “Building” class and Figure 14b highlighting the “Agricultural” and “Forest” classes while minimally responding to the “Building” class. The phenomenon indicates that meaningful features are given by the proposed method.

(c): Case 3

Figure 15 presents another example of feature maps, where our method effectively meets the C2 criterion because similar pixels exhibit consistent responses. For instance, Figure 15b strongly highlights the “Agricultural”, “Barren”, “Forest’, and “Road” classes while suppressing the “Building” class. Figure 15c shows consistent activation for the “Barren” and “Agricultural” classes, while Figure 15d partially detects the “Forest” class but misses some areas, leading to segmentation errors. In contrast, the traditional SegFormer continues to exhibit fragmented feature maps, causing many regions to be misclassified as background, further reinforcing the advantage of our approach in maintaining spatial consistency.

5. Potential Limitation and Future Works

5.1. Potential Limitation

This paper focuses on developing a satellite image-segmentation solution to support land resource management in both practice and research. The proposed method uses VIB to create sparse features so that the model trained on the source data will be able to predict better on the target data. Although there are some improvements, this method still cannot guarantee that the model learned from VIB will achieve the same results as on the source domain because the target domain is very diverse and contains a lot of unknown information. Moreover, adding noise factors in the training process, on the one hand, creates better generalization ability, but some random factors increase the sensitivity of the training model.

A possible solution to increase accuracy is to use labeling data. However, this method also has its challenges. While collecting data from satellite images is quite easy, labeling them is very expensive. A well-labeled satellite image may require more than 45 min to label. Therefore, unsupervised learning techniques with domain adaptation can be applied to make the most of the unlabeled data from the target domain. These methods rely on adversarial loss to extract common features of both the source domain and the target domain. From there, the classification model can be trained on the labeled data of the source dataset. Recently, a new technique based on self-training has been introduced and achieved many good results in the field of UDA for segmentation.

Meanwhile, UDA can utilize information from a large amount of unlabeled data to train the model. This method is often sensitive and unstable. There are many reports that the results of applying UDA are not as good as announced. The reason may be hardware limitations when reproducing the experiment. In an application that requires very high accuracy, labeling data is needed. Semi-supervised learning with domain adaptation would be a viable solution in this case. With the support of a small amount of labeled data, the model can eliminate the uncertainties of adversarial loss or self-training in UDA.

5.2. Potential Future Works

VIB is a plug-and-play model, which means it can be integrated into any network architecture or any training framework. In this paper, applying VIB to the SegFormer model without using any additional support from target domain data or domain-adaptation techniques has helped achieve accuracy surpassing most of the traditional domain-adaptation methods. Therefore, a viable research direction is to apply the VIB model to domain-adaptation frameworks. We hope that the sparse features of VIB can help solve the challenges of domain adaptation. To the best of our knowledge, there are no works that discuss the effectiveness of VIB in UDA for segmentation studies. In addition, semi-supervised domain adaptation could be used to improve the robustness. Last but not least, due to its plug-and-play nature, VIB can be applied in many other applications, such as object detection using satellite images. In this context, VIB can be integrated into advanced object-detection methods to improve object-detection results.

6. Conclusions

This paper presents a segmentation method for land resource management, designed for practical scenarios where the model is trained and tested in different domains. The approach integrates VIB and OHEM into the SegFormer framework. While VIB helps to ensure sparse features, OHEM helps with faster learning. This paper presents a segmentation method for land resource management, designed for practical scenarios where the model is trained and tested in different domains. The approach integrates VIB and OHEM into the SegFormer framework. While VIB helps to ensure sparse features, OHEM helps with faster learning. Through comprehensive experiments, the optimal hyperparameter setting was determined to be

l r = 0.0006, β = 0.05

. When the domain gap significantly affects performance, as seen in the Dalat dataset, our method improves the mIoU from 19.94% to 36.90%. Without using the target dataset or domain-adaptation techniques, it achieves 43.86% mIoU on the LoveDA dataset for the domain-adaptation task. Although the method does not surpass the state-of-the-art domain-adaptation methods (the upper bound), it outperforms some existing approaches under strict benchmarking conditions. Last but not least, visualizations of both prediction results and feature maps demonstrate how VIB enhances learned features, providing valuable insights into its effectiveness.

Author Contributions

Conceptualization, M.-H.N.; Methodology, M.-H.N.; Validation, C.-C.V.; Visualization, C.-C.V.; Writing—original draft, M.-H.N.; Writing—review & editing, C.-C.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work belongs to the project grant No: T2024-125 funded by Ho Chi Minh City University of Technology and Education, Vietnam.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data supporting the findings of this study are openly available at https://github.com/Junjue-Wang/LoveDA at https://zenodo.org/records/5706578. The code for the proposed method is openly available at https://github.com/HungNguyen224/VIB-STIC (accessed on 13 March 2025).

Acknowledgments

We extend our sincere appreciation to Van Linh-Vo and Long-Thien Bui of the Faculty of Electrical and Electronic Engineering at UTE for their invaluable assistance with data preparation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VIB	Variational Information Bottleneck
PPM	Pyramid Pooling Module
MiT	Mix Transformer
UDA	Unsupervised Domain Adaptation
IoU	Intersection Over Union
SoTA	State-of-The-Art
OHEM	Online Hard Example Mining

Technical Terms

The following technical terms are used in this manuscript:

Term	Explanations
Source domain	The domain where the model is originally trained. Typically the domain has a large dataset, often labeled.
Target domain	The domain where the model is applied. The domain may have fewer labeled samples or only unlabeled data.
Domain gap	Domain gap refers to the differences in data distribution between the source domain (where a model is trained) and the target domain (where the model is applied). This gap can lead to a drop in model performance when transitioning from the source to the target domain.
Un-seen domain	It is a testing dataset that has a domain gap issue.
Sparse feature	A feature which has many zero values.
Data augmentation	A technique used to artificially increase the size and diversity of a dataset by applying transformations or modifications to existing data.
FLOPs	Floating-point operations measure the computational complexity of a model by counting the number of floating-point operations (multiplications, additions, etc.) required to perform a forward pass through the model.
Plug-and-play (PnP)	A PnP module is a self-contained component that can be easily integrated into an existing system without requiring major modifications.
Best Model Selection	The model’s best performance is recorded based on the validation metric after a training process.
Iteration	One update of the model’s parameters using a single batch (a subset) of data.

References

Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3448–3460. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S. PIDNet: A Real-time Semantic Segmentation Network Inspired from PID Controller. In Proceedings of the Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2016, arXiv:1612.01105. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Neural Information Processing Systems (NeurIPS), Virtual Event, 6–14 December 2021. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2022, arXiv:2110.08733. [Google Scholar]
Bui, L.; Vo, V.; Pham, T.; Tong, V.; Nguyen, M. Land Resources Statistics on Satellite Images. In Proceedings of the 2024 7th International Conference On Green Technology and Sustainable Development (GTSD), Ho Chi Minh City, Vietnam, 25–26 July 2024; pp. 247–252. [Google Scholar]
Hung-Nguyen, M. Patch-Level Feature Selection for Thoracic Disease Classification by Chest X-ray Images Using Information Bottleneck. Bioengineering 2024, 11, 316. [Google Scholar] [CrossRef] [PubMed]
Alemi, A.; Fischer, I.; Dillon, J.; Murphy, K. Deep Variational Information Bottleneck. International Conference on Learning Representations. Available online: https://openreview.net/forum?id=HyxQzBceg (accessed on 1 July 2024).
Chen, M.; Zheng, Z.; Yang, Y.; Chua, T. Pipa: Pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. In Proceedings of the ACM Multimedia, Lisbon, Portugal, 23–27 October 2023. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. Domain Adaptive and Generalizable Network Architectures and Training Strategies for Semantic Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 220–235. [Google Scholar] [CrossRef]
Hoyer, L.; Dai, D.; Van Gool, L. HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 372–391. [Google Scholar]
Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–17209. [Google Scholar]
Wei, X.; Rao, L.; Fan, G.; Chen, N. MLFMNet: A Multilevel Feature Mining Network for Semantic Segmentation on Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 16165–16179. [Google Scholar] [CrossRef]
Yang, Q.; Rao, L.; Fan, G.; Chen, N.; Cheng, S.; Song, X.; Yang, D. WatNet: A high-precision water body extraction method in remote sensing images under complex backgrounds. J. Appl. Remote. Sens. 2024, 18, 11. [Google Scholar] [CrossRef]
Yu, Y.; Huang, L.; Lu, W.; Guan, H.; Ma, L.; Jin, S.; Yu, C.; Zhang, Y.; Tang, P.; Liu, Z.; et al. WaterHRNet: A multibranch hierarchical attentive network for water body extraction with remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103103. [Google Scholar] [CrossRef]
Vorotyntsev, P.; Gordienko, Y.; Alienin, O.; Rokovyi, O.; Stirenko, S. Satellite Image Segmentation Using Deep Learning for Deforestation Detection. In Proceedings of the 2021 IEEE 3rd Ukraine Conference on Electrical and Computer Engineering (UKRCON), Lviv, Ukraine, 26–28 August 2021; pp. 226–231. [Google Scholar]
Javed, A.; Kim, T.; Lee, C.; Oh, J.; Han, Y. Deep Learning-Based Detection of Urban Forest Cover Change along with Overall Urban Changes Using Very-High-Resolution Satellite Images. Remote. Sens. 2023, 15, 4285. [Google Scholar] [CrossRef]
John, D.; Zhang, C. An attention-based U-Net for detecting deforestation within satellite sensor imagery. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102685. [Google Scholar] [CrossRef]
Ding, Q.; Shao, Z.; Huang, X.; Wang, F.; Wang, M. MLFA-Net: Multi-level feature-aggregated network for semantic change detection in remote sensing images. Int. J. Digit. Earth. 2024, 17, 12. [Google Scholar] [CrossRef]
Selvaraj, R.; Nagarajan, S. Chapter 6—Change detection techniques for a remote sensing application: An overview. In Cognitive Systems and Signal Processing in Image Processing; Academic Press: London, UK, 2022; pp. 129–143. [Google Scholar]
Toker, A.; Kondmann, L.; Weber, M.; Eisenberger, M.; Andres, C.; Hu, J.; Hoderlein, A.; Senaras, C.; Davis, T.; Cremers, D.; et al. DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Gerke, M. Use of the Stair Vision Library within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); University of Twente: Twente, The Netherlands, 2015. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef]
Csurka, G.; Volpi, R.; Chidlovskii, B. Unsupervised Domain Adaptation for Semantic Image Segmentation: A Comprehensive Survey. arXiv 2021, arXiv:2112.03241. [Google Scholar]
Kukačka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Arpaci, S.; Varli, S. Semantic Segmentation with the Mixup Data Augmentation Method. In Proceedings of the 2022 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 18–20 May 2022; pp. 1–4. [Google Scholar]
Fang, F.; Hoang, N.; Xu, Q.; Lim, J. Data Augmentation Using Corner CutMix and an Auxiliary Self-Supervised Loss. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 830–834. [Google Scholar]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An all-MLP Architecture for Vision. arXiv 2021, arXiv:2105.01601. [Google Scholar]
Xue, Y.; Zhang, L.; Wang, B.; Li, F. Feature Selection Based on the Kullback–Leibler Distance and its Application on Fault Diagnosis. In Proceedings of the 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD), Suzhou, China, 21–22 September 2019; pp. 246–251. [Google Scholar]
Li, J.; Yu, Z.; Du, Z.; Zhu, L.; Shen, H. A Comprehensive Survey on Source-Free Domain Adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5743–5762. [Google Scholar] [CrossRef] [PubMed]
Codalab LoveDA Semantic Segmentation Challenge. Available online: https://codalab.lisn.upsaclay.fr/competitions/424, (accessed on 3 October 2024).
Gao, K.; Yu, A.; You, X.; Qiu, C.; Liu, B.; Zhang, F. Cross-Domain Multi-Prototypes with Contradictory Structure Learning for Semi-Supervised Domain Adaptation Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 3398. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–22. [Google Scholar] [CrossRef]
Luo, Y.; Zheng, L.; Guan, T.; Yu, J.; Yang, Y. Taking a Closer Look at Domain Shift: Category-Level Adversaries for Semantics Consistent Domain Adaptation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2502–2511. [Google Scholar]
Mei, K.; Zhu, C.; Zou, J.; Zhang, S. Instance Adaptive Self-Training for Unsupervised Domain Adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Wu, L.; Lu, M.; Fang, L. Deep Covariance Alignment for Domain Adaptive Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Wang, H.; Shen, T.; Zhang, W.; Duan, L.; Mei, T. Classes Matter: A Fine-Grained Adversarial Approach to Cross-Domain Semantic Segmentation. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 642–659. [Google Scholar]
Wang, J.; Zhong, Y.; Zheng, Z.; Ma, A.; Zhang, L. RSNet: The Search for Remote Sensing Deep Neural Networks in Recognition Tasks. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 2520–2534. [Google Scholar] [CrossRef]
Lian, Q.; Duan, L.; Lv, F.; Gong, B. Constructing Self-Motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6757–6766. [Google Scholar]
Zhao, Q.; Lyu, S.; Zhao, H.; Liu, B.; Chen, L.; Cheng, G. Self-training guided disentangled adaptation for cross-domain remote sensing image semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103646. [Google Scholar] [CrossRef]

Figure 1. Diverse geographical landscape on the LoveDA dataset. (a) Image and label in urban areas; (b) image and label in rural areas.

Figure 2. Diverse geographical landscape on the LoveDA dataset and the Dalat dataset. Diversity examples are in (a) “Water”; (b) “Barren”; (c) “Building” and “Road”; (d) “Forest”; and (e)“Agricultural”. The first row contains images from the LoveDA dataset; the second row contains labels of images from the LoveDA dataset; the third row contains images from the Dalat dataset; the fourth row contains labels of images from the Dalat dataset.

Figure 3. An intuition between conventional and sparse features under the domain gap issue. Blue color means features on the source domain; green color means features on the target domain (unseen domain). The circle is the classifier trained on the source domain.

Figure 4. Network overview.

Figure 5. MiT backbone architecture.

Figure 6. Examples of satellite images and their label in the Dalat dataset. The first row shows the original images, and the second row represents their corresponding labels.

Figure 7. Learning rate selection.

Figure 8. Effect of the

β

parameter. The validation dataset is from urban areas, and the testing dataset is from rural areas.

Figure 8. Effect of the

β

parameter. The validation dataset is from urban areas, and the testing dataset is from rural areas.

Figure 9. The feature-selection loss and the segmentation loss during training processes. The first, second and third rows include

L^{f e a}

and

L^{c l s}

for each

β = 0.1

,

β = 0.05

,

β = 0.01

correspondingly.

Figure 9. The feature-selection loss and the segmentation loss during training processes. The first, second and third rows include

L^{f e a}

and

L^{c l s}

for each

β = 0.1

,

β = 0.05

,

β = 0.01

correspondingly.

Figure 10. mIoU comparison with SoTA. Blue color means mIoU on the validation dataset (LoveDA), and orange color means mIoU on the testing dataset (Dalat).

Figure 11. mIoU and mACC in ablation studies. The values are percentages (%).

Figure 12. Qualitative results. The first row is the original image; the second and third rows are zoomed-in areas for each (a–c) example.

Figure 13. Feature analysis in case 1.

Figure 14. Feature analysis in case 2.

Figure 15. Feature analysis in case 3.

Table 1. Detailed settings of MiT-B5.

State	Output Size	Patch Size	Stride	Channel Number	Number of Blocks
State 1	7	4	3	64	3
State 2	3	2	1	128	6
State 3	3	2	1	320	18
State 4	3	2	1	512	24

Table 2. IoU comparison on Dalat dataset. Training/validation datasets are from LoveDA. The values are percentages (%).

	DDRNet	Unet	Mask-Former	PID	UperNet	Seg-Former	VIB	VIB+ OHEM
Background	14.45	1.75	15.03	6.46	11.87	6.46	6.67	6.31
Building	13.70	19.73	28.39	22.49	18.00	25.42	32.07	32.49
Road	11.61	2.05	18.50	14.75	21.92	5.24	19.47	19.11
Water	52.63	45.93	36.27	56.57	51.25	19.54	58.68	60.29
Barren	2.93	16.04	19.74	7.01	5.74	19.39	14.18	17.29
Forest	11.63	27.42	13.59	6.27	5.74	26.64	63.71	64.83
Agricultural	45.08	11.11	63.73	24.55	64.45	36.92	63.55	59.60
Mean	21.72	17.72	26.61	20.95	25.57	19.94	36.90	37.13

Table 3. Complexity comparison.

	DDRNet	Unet	Mask-Former	PID	UperNet	Seg-Former	VIB	VIB+ OHEM
Flpos (G)	71.734	815.23	293	23.733	948.12	425.02	451.12	440.26
Params (M)	20.296	28.991	63	7.718	64.044	83.672	85.261	84.126
Training time (h)	19.36	30.99	38.5	19.52	25.22	40.82	43.26	42.12

Table 4. IoU comparison on LoveDA dataset (urban areas). The values are percentages (%).

	LANet	PSPNet	DeepLabv3	HRNet	MAE+UPerNet	Our
Background	43.99	51.59	50.21	50.25	51.09	42.86
Building	45.77	51.32	45.21	50.23	46.12	65.83
Road	49.22	53.34	46.73	53.26	50.88	61.60
Water	64.96	71.07	67.06	73.20	74.93	69.78
Barren	29.95	24.77	29.45	28.95	33.24	33.06
Forest	31.91	22.29	31.42	33.07	29.89	46.03
Agricultural	24.90	32.02	31.27	23.64	37.60	43.70
Mean	41.53	43.77	43.05	44.66	46.25	51.84

Table 5. A comparison to domain-adaptation methods and other dropout settings. The source domain is the urban area; the target domain is the rural area in the LoveDA dataset. Results on the testing dataset are evaluated by services in the LoveDA domain-adaptation challenge [39]. The values are percentages (%).

	Background	Building	Road	Water	Barren	Forest	Agricultural	Mean
CLAN [44]	22.93	44.78	25.99	46.81	10.54	37.21	24.45	30.39
IAST [45]	29.97	49.48	28.29	64.49	2.13	33.36	61.37	38.44
FADA [47]	24.39	32.97	25.61	47.59	15.34	34.35	20.29	28.65
TransNorm [48]	19.39	36.30	22.04	36.68	14.00	40.62	3.30	24.62
PyCDA [49]	12.36	38.11	20.45	57.16	18.32	36.71	41.90	32.14
DAFormer [13]	37.39	52.84	41.99	72.05	11.46	46.79	61.27	46.25
MIC [15]	36.20	47.84	39.23	70.05	13.27	45.52	60.74	44.89
DCA [46]	36.38	55.89	40.56	62.03	22.01	38.92	60.52	45.17
DASegNet (DeeplabV3)	33.79	55.95	39.69	69.28	14.19	44.79	62.16	45.69
DASegNet (SegFormer)	36.78	59.83	43.77	73.83	19.38	49.96	67.01	50.08
VIB	30.31	56.64	39.54	63.75	13.51	37.91	50.65	41.76
VIB+OHEM	31.31	54.92	43.18	63.82	22.28	38.35	53.19	43.86
Segfomer (drop out = 0.05)	20.62	34.12	31.7	37.31	7.33	33.15	20.31	26.36
Segfomer (drop out = 0.1)	21.61	34.14	32.71	30.24	9.53	34.53	24.15	26.70
Segfomer (drop out = 0.15)	18.82	33.31	33.9	45.95	4.683	33.04	14.69	26.34

Table 6. The iteration when the model obtains the best performance.

	Base	SegFormer+VIB	SegFormer+VIB+OHEM
LovaDA → Dalat	90k	120k	40k
Urban → Rural	96k	110k	40k

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nguyen, M.-H.; Vu, C.-C. F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains. Sustainability 2025, 17, 2640. https://doi.org/10.3390/su17062640

AMA Style

Nguyen M-H, Vu C-C. F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains. Sustainability. 2025; 17(6):2640. https://doi.org/10.3390/su17062640

Chicago/Turabian Style

Nguyen, Manh-Hung, and Chi-Cuong Vu. 2025. "F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains" Sustainability 17, no. 6: 2640. https://doi.org/10.3390/su17062640

APA Style

Nguyen, M.-H., & Vu, C.-C. (2025). F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains. Sustainability, 17(6), 2640. https://doi.org/10.3390/su17062640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

F-Segfomer: A Feature-Selection Approach for Land Resource Management on Unseen Domains

Abstract

1. Introduction

2. Relative Works

2.1. AI Method and Remote Sensing Application

2.2. Image Segmentation

2.3. Avoid Overfitting Solutions

3. Proposed Method

3.1. Network Architecture

3.2. Object Functions

3.3. Online Hard Example Mining

4. Experimental Result

4.1. Dataset

4.1.1. LoveDA Dataset

4.1.2. Dalat Dataset

4.2. Hyperparameter Selection

4.3. Compare with SoTA

4.4. Compare to Upper Bound

4.5. Ablation Study

4.6. Qualitative Results

4.6.1. Result-Based Analysis

4.6.2. Feature-Based Analysis

5. Potential Limitation and Future Works

5.1. Potential Limitation

5.2. Potential Future Works

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Technical Terms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI