This section mainly introduces the basic network framework structure of DA-AT. Then, the channel-level data augmentation (CLDA), adaptive threshold (AT) and adaptive class weight (ACW) are introduced.
3.1. DA-AT Network Architecture
As shown in
Figure 1, the training consists of a supervised branch using a labeled dataset and an unsupervised branch using an unlabeled dataset. We first outline the SSCD settings, followed by an introduction to the encoder–decoder structure, and then delineate the flow of both the supervised and unsupervised branches.
3.1.1. SSCD Settings
In the semi-supervised change detection setting, the following tags and descriptions are primarily used. The training dataset part contains two parts: labeled dataset , and unlabeled dataset . Specifically, the labeled dataset is denoted as , where represents the ith pair of bi-temporal remote sensing images, represents the image at time T0, represents the image at time T1, represents the corresponding binary pixel-level ground truth, and n represents the size of the dataset. The unlabeled dataset is specifically denoted as , where the ith pair of bi-temporal remote sensing images in generation table N represents the size of the dataset. Different from the labeled dataset, it does not have the corresponding label of the ith pair of bi-temporal remote sensing images. The labeled dataset n has far fewer image pairs than the unlabeled dataset N.
3.1.2. Shared Encoder–Decoder Model
In the field of semi-supervised remote sensing change detection based on consistency regularization, the core is to explore effective strategies, but the network structure is not the focus of research. Therefore, we adopt the basic encoder–decoder structure [
26,
45], and the model mainly includes two identical encoders
E, a decoder
D, and a pyramid pooling module (PPM) [
46]. Two encoders are responsible for extracting features from two remote sensing images, respectively, and the decoder processes the difference features to obtain the change probability map.
Specifically, a pair of remote sensing images
are fed into two encoders E after the same data augmentation, where
and
have the same dimensions
. This results in a pair of feature maps
and
, which are both of the same size and have high dimensionality. Here,
H and
W represent the height and width of the input images, respectively, while
C and
s denote the number of feature spatial channels and the spatial scale ratio. Notably, we employ a pre-trained ResNet50 model [
47], where
C and
s are set to 2048 and 8, respectively.
Next, to compute the feature difference of the two images after applying the same type of data augmentation (either weak or strong augmentation),
and
are used in the difference operation. However, a simple difference operation may result in the loss of important details. To address this, we utilize the PPM to further process the feature difference map. This allows us to capture feature information
at different scales, which is crucial for understanding the context of the image.
Finally, the decoder D is employed to predict the feature difference map. The decoder D is composed of a series of convolutional upsampling modules [
48], which aim to restore the spatial resolution of the input
, and output a probability map
. The number 2 in
p represents the two classes: changed and unchanged. The map
p is then scaled by applying a softmax function along the category dimension, resulting in pixel-wise predictions in a range of [0, 1]. The number of input channels and output classes in our model can be flexibly adjusted according to the specific requirements of the dataset and task at hand.
Here, denotes the spatial pixel location in the probabilistic prediction map p.
3.1.3. DA-AT Framework
In
Figure 1, our training partition is divided into two parts according to whether labels are used or not, the supervised branch and the unsupervised branch are trained at the same time, and the weight information is shared. The labeled dataset
drives the supervised branch. Initially, the input bi-temporal remote sensing image
undergoes weak augmentation to obtain
, and then, it is input into the encoder–decoder. This process yields pixel-level change probability maps
. For the supervised part, cross-entropy (CE) loss [
49] is used to minimize the loss between the probability map
and the label
. It is expressed as follows:
The unlabeled dataset
is utilized in the unsupervised branch. Initially, the input bi-temporal remote sensing image pair
undergoes both weak and strong augmentations to obtain
and
. These two sets of enhanced image pairs are then fed separately into the encoder–decoder network, resulting in pixel-level change probability maps
and
. The map
is subsequently used to update the adaptive threshold filter, and a corresponding pseudo-label
is generated through binarization. Here, “stop gradient” refers to the fact that the results of the weak augmentation branch only provide pseudo-labels for the self-training of the strong augmentation branch. The consistency loss function is then applied to minimize the loss between the probability map
and the pseudo-label
. This process can be summarized as follows:
In this context, represents the predefined confidence threshold used to filter out noisy labels. The condition indicates that if the predicted probability exceeds , it is considered a high-quality pseudo-label and is assigned a value of 1. Otherwise, it is set to 0. The function H typically refers to the CE loss. The details of weak and strong augmentations here are described in the Channel-Level Data Augmentation section.
Specifically, in the experimental stage, the proposed semi-supervised framework integrates CLDA, AT, and ACW to achieve robust and effective change detection. First, weak augmentation is applied to the input bi-temporal remote sensing image pair , generating , which is processed by the network to produce pixel-level change probability maps . Simultaneously, CLDA introduces strong augmentation to the same input pair, yielding and corresponding probability maps . By enforcing consistency between and , CLDA improves the model’s robustness to input variations and strengthens its feature representations. Based on , the AT module dynamically adjusts thresholds for pseudo-label selection, balancing their quality and quantity to enhance training stability. Recognizing the inherent imbalance in change detection tasks, the ACW module applies targeted optimization constraints, assigning higher weights to change categories to mitigate class imbalance. Together, these components work synergistically; CLDA ensures robust and consistent learning across augmentations, AT refines pseudo-labels for effective semi-supervised learning, and ACW emphasizes minority class learning, forming a unified framework that addresses the challenges of semi-supervised change detection.
3.2. Channel-Level Data Augmentation
In consistency regularization-based training, utilizing both strong and weak augmentation methods is essential. General weak augmentation techniques, such as random flipping, cropping, and resizing, can increase the diversity of the training data, helping the model to better learn various features. On the other hand, strong augmentations involve more substantial perturbations, such as adjustments to brightness, color, and image masking [
50], which introduce a greater degree of change. However, traditional augmentation methods primarily focus on the superficial aspects of the image and often fail to fully exploit the channel information. To address this limitation, we improve the strong enhancement based on the randomized quantization (RQ) data enhancement method proposed by Wu et al. [
51].
In
Figure 2, the weak augmentation result is achieved by sequentially applying resize, crop, and flip operations to the original image. Specifically, the resize operation adjusts the image size within a range of [0.5, 2.0]. The crop operation randomly generates a crop size and extracts a corresponding region of the image. Finally, the flip operation applies a horizontal flip with a 50% probability. Building on this, RQ is then applied to the RGB channels of the image to achieve strong augmentation. Specifically, the data in each channel are divided into a certain number of intervals, defined by
, and the original value
x within each interval is mapped to a randomly sampled value
y from the same interval. This approach introduces a unique quantization value for each interval, thereby enhancing the diversity of the data. Based on our experiments and the referenced paper, we set the
to 8, as a smaller interval size results in a more pronounced augmentation effect.
3.3. Adaptive Threshold
To fully utilize unlabeled datasets and enhance the model’s generalization ability, we generate a pseudo-label
corresponding to the predicted probability map
, which is obtained after weak augmentation and then binarized. This pseudo-label
is used to supervise the predicted probability map
, which is obtained after strong augmentation. The formula for generating the pseudo-label is expressed as follows:
Specifically, represents the class with the maximum predicted probability in for the image pair at spatial position . Here, 1 in c denotes the changed class, while 0 denotes the unchanged class.
As mentioned earlier, a fixed threshold
(e.g., 0.5, 0.95, 0.99) is commonly used to filter out high-quality pixel-wise pseudo-labels to constrain the results of strong augmentation. For clarity, we denote the confidence mask corresponding to
as
. The corresponding calculation formula is expressed as follows:
Generally, setting a threshold of 0.5 allows the use of all pseudo-labels, but it introduces excessive noise, which can reduce training accuracy. Alternatively, setting a higher threshold, such as 0.99, yields higher-quality pseudo-labels, but this approach may miss the opportunity to learn diverse predictions during the early stages of training, leading to low data utilization and hindering the model’s ability to fully learn. To achieve a balance between quantity and quality, we implement an adaptive threshold strategy. This approach gradually increases the confidence threshold based on the model’s predictions at round t.
First, the model is used to predict the weakly augmented unlabeled data, and the maximum prediction probability for each pixel is calculated. Then, the probabilities of each class
c are averaged to obtain a local prediction confidence
.
Here, B represents the batch size, and denotes the maximum probability corresponding to class c at the spatial location in the K-th image pair.
Considering the imbalance between categories, we resample the data. Specifically, we calculate the proportion
of pixels corresponding to the maximum probability of class
c, and then determine the inverse weight
of class
c based on this proportion. The expression is as follows:
Here, refers to the number of pixels belonging to class c.
Then, to facilitate the update, we estimate the adaptive threshold
as the exponential moving average (EMA) of the confidence for each round. It is initialized to
, where
C is the number of classes.
Here, represents the momentum decay of the EMA, which falls within the range . For binary change detection, C is set to 2, meaning the initial threshold is set to 0.5.
After adopting the adaptive threshold, the confidence mask map corresponding to the high-quality and quantity of pseudo-labels obtained through our screening process can be redefined as
.
3.4. Adaptive Class Weight
However, adaptive thresholds alone are insufficient because they overlook the varying learning difficulties of different classes. Intuitively, in the semi-supervised change detection task, predicting the changed class is generally more challenging than predicting the unchanged class. To address this, we propose adaptive class weights, which encourage the model to focus more on training the minority class rather than predominantly on the majority class. Similarly, we resample the number of pixels in each category to obtain the corresponding resampling rate
for each category. Unlike the adaptive threshold, where we adjust by multiplying weights inversely, here, we utilize all the minority classes without inversely scaling the weights. This ensures that the model pays greater attention to the minority class information in the image.
Here, refers to the largest value across all classes.
Therefore, we can adjust the weight of the loss function based on the resampling rate
of each pixel. The weight map of pixels can be represented as
.
Here, refers to the class with the highest probability predicted by the model at the location.
Thus, the loss used in the unsupervised part can be reformulated as follows: