Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels

Zhao, Tong; Zeng, Yujun; Fang, Qiang; Xu, Xin; Xie, Haibin

doi:10.3390/rs17081474

Open AccessArticle

Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels

by

Tong Zhao

^†

,

Yujun Zeng

^†

,

Qiang Fang

^*

,

Xin Xu

and

Haibin Xie

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(8), 1474; https://doi.org/10.3390/rs17081474

Submission received: 8 March 2025 / Revised: 11 April 2025 / Accepted: 12 April 2025 / Published: 21 April 2025

(This article belongs to the Special Issue Deep Learning-Driven Remote Sensing Image Processing for Object Detection and Localization)

Download

Browse Figures

Versions Notes

Abstract

:

Semi-supervised learning aims to improve the generalization performance of a model by exploiting the large quantity of unlabeled data together with limited labeled data during training. When applied to object detection in remote sensing images, semi-supervised learning can not only effectively alleviate the time-consuming and costly labeling of bounding boxes but also improve the performance and generalization of corresponding object detection methods. However, most current semi-supervised learning-based object detection methods (especially combined with pseudo-labels) for remote sensing images ignore a key issue, that is, the consistency of pseudo-labels. In this paper, a novel semi-supervised learning-based method for object detection in remote sensing images called CDPL is proposed, which includes an adaptive mechanism that directly incorporates the potential object information into the dense pseudo-label selection process and carefully selects the appropriate dense pseudo-labels in the scenes where objects are densely distributed. CDPL consists of two main components: feature-aligned dense pseudo-label selection and sparse pseudo-label-based regression object alignment. The experimental results for typical remote sensing datasets show that the proposed method results in a satisfactory performance improvement.

Keywords:

object detection; remote sensing; semi-supervised learning; pseudo-label

Graphical Abstract

1. Introduction

Object detection is one of the core application problems in the field of computer vision and machine learning. Currently, the methods most widely used in object detection tasks are deep learning-based, which can be divided into four main categories. Firstly, multi-stage object detection methods like RCNN [1], Fast RCNN [2], Faster RCNN [3], FPN [4], Cascade RCNN [5], and Mask RCNN [6] contain two stages, i.e., extracting candidate regions that may contain objects and subsequently extracting features based on the candidate regions for classification and localization. Among them, the representative one is Faster RCNN, which realizes end-to-end detection, and the detection speed can be close to real time. It introduces the Region Proposal Network (RPN), which greatly reduces the generation cost of candidate regions. Secondly, there are single-stage object detection methods like YOLO [7], SSD [8], and RetinaNet [9], which discard the candidate region generation stage and directly predict the offset of the anchor frame to obtain the final detection result. The third type is anchor-free object detection methods like CornerNet [10], CenterNet [11], and FCOS [12], which utilize a substitution method to remove the step of presetting the anchors and detect objects directly from the feature map. Note that FCOS introduces several optimizations such as an additional branch to measure the centrality of the detection results and employs a center sampling strategy to improve the quality of positive samples. The fourth is transformer-based object detection methods like ViT [13], Swin Transformer [14], and DETR [15], which use transformers [16] to extend all the aspects of deep learning-based object detection methods. The above methods can be collectively referred to as object detection methods in generalized scenarios, which belong to the category of supervised learning.

Recently, object detection in remote sensing images [17,18] has gradually become a research hotspot in the field of computer vision and artificial intelligence. Because objects in remote sensing images usually exhibit arbitrary angles, dense distributions, and large aspect ratios, object detection in remote sensing images usually uses a rotated bounding box to represent the detected target. In order to effectively alleviate the large-scale data labeling burden when using supervised learning-based object detection methods, which is time-consuming, costly, and extremely difficult, more attention has been paid to object detection in remote sensing images based on semi-supervised learning, which tries to utilize limited labeled data and a large quantity of easily available unlabeled data to improve performance. It is of great significance to fully explore the characteristics of remote sensing images and design appropriate mechanisms to improve the performance and generalization of semi-supervised learning-based object detection methods for remote sensing images.

SOOD [19] firstly explored the possibility of a semi-supervised object detection task for remote sensing images by crafting adaptive loss weights based on the angular difference between the teacher model and the student model for rotated objects to focus on difficult samples with large angular deviations. In addition, SOOD explored the utilization of the object distribution in object detection in remote sensing images by introducing a global consistency loss. With the aforementioned designs, SOOD achieved excellent detection performance. PST [20] employs a new semi-supervised learning framework, which centers on cross-checking predictions and collaboratively generating high-quality pseudo-labels through two teacher models that are each optimally updated by the same student model. To reduce the unreliability of pseudo-labels in terms of localization, scale, and orientation, PST introduces Gaussian distribution modeling combined with symmetric and bounded Jensen–Shannon divergence to evaluate the bias between the predictions of different teacher models in order to exclude low-quality pseudo-labels generated under inconsistent regression estimates. In addition, in order to solve the scale invariance problem in object detection for remote sensing images, PST proposes a scale-adaptive knowledge distillation mechanism to align the features of the student model extracted from multi-scale images with the interpolated multi-scale features generated by the teacher model. Focal Teacher [21], on the other hand, analyzes the factors affecting semi-supervised object detection in remote sensing images and proposes a global focus learning method that does not require any manual priori design. The method blurs the boundaries between positive and negative samples through global regions and soft regression methods, while utilizing the localization consistency between the teacher model and the student model to focus on difficult regions.

In this paper, a novel semi-supervised learning-based method for object detection in remote sensing images called CDPL is proposed, which innovatively introduces feature-aligned dense pseudo-label selection and sparse pseudo-label-based regression object alignment to tackle the feature inconsistency problem in dense pseudo-label selection and the regression object inconsistency of single-instance dense pseudo-labels, to improve both the accuracy and the generalization performance of object detection in remote sensing images. The experimental results for typical benchmarks with related comparative methods show that the proposed method achieves SOTA performance.

The rest of this paper is organized as follows: Section 2 provides a necessary preliminary description and introduces related work. Section 3 introduces the proposed CDPL, detailing its mechanism, architecture, and implementation. Section 4 provides the corresponding experimental results for typical datasets (including from ablation and comparison experiments), together with the implementation details. Finally, Section 5 summarizes the findings and innovations regarding the proposed CDPL.

2. Preliminary

Semi-supervised learning, which aims to improve the generalization performance of a model by exploiting the large quantity of unlabeled data during training, thus alleviating the need for labeled data, is a representative research area in machine learning. Semi-supervised learning usually assumes the existence of two sample datasets which follows the assumption of identical and independent distribution, i.e., the labeled dataset

D_{l} = {(X_{i}^{l}, Y_{i}^{l}) |_{i = 1}^{N_{l}}}

and the unlabeled dataset

D_{u} = {X_{i}^{u} |_{i = 1}^{N_{u}}}

, containing

N_{l}

labeled data and

N_{u}

unlabeled data, respectively and usually

N_{u} > > N_{l}

. Semi-supervised learning is usually based on several assumptions:

Smoothness assumption: If two samples located in a high-density region are close together, then their corresponding labels should also be close. Conversely, if two samples are separated by a low-density region, their corresponding labels will tend to be different.
Cluster assumption: when two samples are located in the same cluster, they have the same label with high probability.
Low-density separation assumption: decision boundaries should cut through sparse data regions, while avoiding splitting samples from high-density data regions on either side of the decision boundary.
Manifold assumption: the input space consists of multiple low-dimensional manifolds on which all samples are located, and samples located on the same manifold have the same label.

Pseudo-labeling is a key technique in semi-supervised learning, which trains a model by making iterative predictions on unlabeled data and using some of the predictions that satisfy specific conditions as pseudo-labels. Related works include Pseudo-Label [22], Noisy student [23], Meta Pseudo Labels [24], CSD [25], STAC [26], FixMatch [27], Mean Teacher [28], Unbiased Teacher [29], Instant-Teacher [30], Soft Teacher [31], Consistent Teacher [32], PseCo [33], DualPolish [34], and so on. Among them, Unbiased Teacher, Soft Teacher, Consistent Teacher, and PseCo are the representative examples.

Unbiased Teacher uses focal loss [9] to replace cross-entropy to address class imbalance in pseudo-labeling. It introduces adversarial distortion and label smoothing to enhance fairness. Adversarial distortion forces the model to focus on non-sensitive features by applying small perturbations to the input data, while label smoothing helps the model avoid relying too heavily on any specific feature by adding noise to the label assignment.

Soft Teacher utilizes a teacher model to provide pseudo-labels for unlabeled data, thereby enhancing the performance of a student model. The teacher model generates soft labels (probability distributions) for unlabeled data, which are then used to supervise the training of the student model. The teacher model is typically updated as an exponential moving average of the student model’s parameters, ensuring that it remains accurate and stable throughout the training process.

Consistent Teacher proposes a unified framework combining adaptive anchor frame assignment, 3D feature alignment, and adaptive thresholding based on Gaussian mixture models to mitigate inconsistencies during semi-supervised training.

The core idea of PseCo is improving pseudo-label quality and incorporating both multi-level consistency in consistency training. In pseudo-labeling, PseCo focuses on generating high-quality pseudo-bounding boxes for unlabeled data, ensuring not only accurate classification scores but also precise localization. For consistency training, PseCo emphasizes both label-level and feature-level consistency. It highlights the importance of feature-level consistency in ensuring scale invariance.

DualPolish aims to refine pseudo-labels generated from unlabeled data to bridge the gap between noisy pseudo-annotations and ground-truth labels. It focuses on polishing noisy pseudo-labels to make them more accurate and reliable, which leverages the additional information contained in the unlabeled data while mitigating the negative impact of noisy annotations.

It is worth mentioning that all the above methods are based on the traditional sparse pseudo-labeling paradigm. As a more direct form of pseudo-label representation, dense pseudo-labeling retains more information and is gradually gaining popularity. In contrast, Dense Teacher [35] replaces sparse pseudo-label bounding boxes with the dense pseudo-label ones in order to mitigate the impact of post-processing and confidence thresholds on detection performance. Denser Teacher [36] tries to generate dense pseudo-labels by utilizing unlabeled data, which are then used to train the detection model alongside the labeled data. The key innovation lies in the ability to produce high-quality pseudo-labels that are close to ground-truth annotations, enabling the model to learn more effectively from the additional data. ARSL [37] also adopts the form of dense pseudo-labeling, focusing on the selection ambiguity and assignment ambiguity of dense pseudo-labeling. To address the selection ambiguity, ARSL proposes joint confidence estimation to jointly quantify the classification and localization quality of pseudo-labels. To address the assignment ambiguity, ARSL introduces a task-separated assignment mechanism to assign labels based on dense pseudo-labels and exploits pseudo-labels for classification and localization tasks to improve the robustness to assignment ambiguity.

However, the existing methods mentioned above ignore a key issue in the semi-supervised learning process, i.e., the consistency of pseudo-labels. When this process is used for object detection in remote sensing images, it is specifically manifested as inconsistent features selected by dense pseudo-labels as well as inconsistent regression targets for a single instance of dense pseudo-labels, which adversely affect the accuracy as well as the generalization ability of the object detection method. This issue is particularly prominent in single-stage object detection methods since the detection model’s classification and regression branches are relatively independent, which exacerbates the problem. Even though a method like ARSL [37] employs an extra relatively lightweight branch of localization quality to mitigate feature inconsistency, its subsequent selection of dense pseudo-labels still opts for a threshold-based design, which is not as efficient in complex and changing remote sensing image scenarios. The performance of ARSL as well as other representative methods with different label scale settings is compared in Figure 1. Even though dense pseudo-labels can provide richer information compared to sparse pseudo-labels, the consistency problem of their regression still adversely affects the final detection results. It still requires more effective solutions to overcome feature inconsistency in semi-supervised object detection on remote sensing images. For a single object, the regression objective of dense pseudo-labels belonging to the same instance are not always consistent, and sometimes the difference is even very large.

It can be seen from Figure 2 that for the detected object (i.e., ship), the feature region with the highest score and the one with the highest IoU are not the same. Similarly, the feature region with a high IoU can suffer from a relatively low score. Therefore, it is necessary to provide more accurate and consistent regression targets for dense pseudo-labels. In this situation, relying on the confidence score to select dense pseudo-labels is not suitable.

3. Methods

In this section, more details on the proposed CDPL are given. Figure 3 shows the framework. It has two core functions, that is, feature-aligned dense pseudo-label selection and sparse pseudo-label-based regression object alignment. Specifically speaking, it contains a student model and a teacher model. During the training process, the teacher model parameters are updated from the student model using the exponential moving average as follows:

θ_{t + 1}^{T} = (1 - λ) θ_{t}^{S} + λ θ_{t}^{T}

(1)

where

θ^{T}

and

θ^{S}

stand for the weights of the teacher model and the student model, respectively; t denotes the number of training episodes; and

λ

is a momentum to maintain the difference between the teacher model and the student model.

CDPL employs a feature-aligned global dynamic K estimation method to select appropriate dense pseudo-labels for the unlabeled data based on the K values, and a sparse pseudo-label-based regression object alignment method is used to improve the regression quality of the dense pseudo-labels.

In each training episode, the training on labeled data follows the conventional way and is completely supervised by truth labels. For unlabeled data, the teacher model first generates pseudo-labels on the weakly augmented images, which provide a supervised signal to the student model using the strongly augmented images. Subsequently, the student model is updated using the loss of labeled data and the loss of unlabeled data. The overall training objectives can be formulated as follows:

L = L_{s} + α L_{u}

(2)

where

L_{s}

and

L_{u}

denote the loss of labeled images and the loss of unlabeled images, respectively.

α

controls the contribution of the unsupervised loss.

3.1. Global Dynamic K-Estimation with Feature Alignment

Taking the feature alignment problem during object detection in remote sensing images into consideration and inspired by earlier work [12,37,38], we propose a global dynamic K-estimation with feature alignment (FA-GDE), which uses a separate auxiliary branch to estimate the localization quality as shown in Figure 4.

The confidence level after feature alignment of the prediction results

\hat{S}

is defined as follows:

\hat{S} = {\hat{S}}_{c l s} * {\hat{S}}_{i o u}

(3)

where

{\hat{S}}_{c l s}

and

{\hat{S}}_{i o u}

are the classification confidence and localization confidence (represented by the IOU) of the prediction results, respectively. For labeled data, the learning objective of

\hat{S}

is defined as:

S = {0, \dots, IoU, \dots, 0}

(4)

Because

S

is a continuous value between 0 and 1, the discrete version of the focal loss [9], i.e., the quality focal loss [39], is used for the classification branch, which is shown as follows:

L_{c l s} = Q F L (\hat{S}, S)

(5)

For the auxiliary branch, we use the commonly used cross-entropy loss:

L_{i o u} = B C E ({\hat{S}}_{i o u}, I O U)

(6)

It is worth mentioning that the above formulation can be realized with only minor changes in the single-stage baseline methods commonly used for semi-supervised object detection, such as RetinaNet [9] and FCOS [12]. For example, for RetinaNet, only a 3 × 3 lightweight convolutional layer needs to be added, while for FCOS, it is sufficient to adjust the centrality-assisted branch to a localization-assisted branch, which ensures the applicability and simplicity of the proposed method.

Then, the confidence based on feature alignment is further used for feature-aligned global dynamic K-estimation. Specifically, for an unlabeled image, the teacher model dynamically estimates the number of dense pseudo-labels K. K is defined as follows:

K = \sum_{l = 1}^{M} \sum_{i = 1}^{W_{l}} \sum_{j = 1}^{H_{l}} G_{l i j}

(7)

G_{l i j} = max_{c} {\hat{S}}_{l i j, c}

(8)

where

{\hat{S}}_{l i j, c}

stands for the probability of class c at the corresponding feature mapping location (

i, j

) in the lth layer of the feature pyramid. M is the number of layers of the feature pyramid. Thus, the selection of dense pseudo-labels follows the following rule:

\vec{d_{l i j}} = \{\begin{matrix} 1, if G_{l i j} in top K, \\ 0, otherwise \end{matrix}

(9)

where

\vec{d_{l i j}}

represents whether the feature pyramid is a dense pseudo-label at the lth layer at the corresponding feature mapping position (

i, j

). Note that the value of K is rounded down.

In order to deal with

S

, which is a continuous value between 0 and 1, the quality focal loss [39] is used as the classification loss function for unlabeled data. Let

{\hat{S}}^{T}

and

{\hat{S}}^{S}

denote the confidence level after feature alignment for the teacher model and student model, respectively. The classification loss is calculated as follows:

L_{u}^{c l s} = - {| {\hat{S}}^{T} - {\hat{S}}^{S} |}^{γ} \times [{\hat{S}}^{T} l o g ({\hat{S}}^{S}) + (1 - {\hat{S}}^{T}) l o g (1 - {\hat{S}}^{S})]

(10)

where

γ

is an inhibitory factor.

For the regression branch and auxiliary branch, we use rotated IoU loss and cross-entropy loss, respectively. Therefore, the final overall loss function for unlabeled data has the following form:

L_{u} = L_{u}^{c l s} + L_{u}^{r e g} + L_{u}^{i o u}

(11)

where

L_{u}^{r e g}

and

L_{u}^{i o u}

denote regression loss and auxiliary loss for unlabeled data, respectively.

3.2. Regression Object Alignment Based on Sparse Pseudo-Labeling

For a single instance, the regression objectives of dense pseudo-labels belonging to the same instance are not always the same. This problem limits the learning performance of the model to some extent. In semi-supervised object detection based on dense pseudo-labels, due to the lack of instance-level supervisory signals, there is a lack of connection between the dense pseudo-labels that actually belong to the same instance, and this problem leads to a lack of consistency within the dense pseudo-labels of the same instance.

To address the aforementioned issues, we propose a sparse pseudo-label-based regression target alignment (SPL-based RA). As Figure 5 shows, SPL-based RA innovatively introduces sparse pseudo-labels in the dense pseudo-label framework to help improve the consistency of dense pseudo-label regression. Note that the sparse pseudo-labels are generated on the basis of the dense pseudo-labels given by the FA-GDE module. To solve the problem of overlapping pseudo-label frames, SPL-based RA uses the NMS operation to eliminate the overlapping pseudo-label frames and finally obtains sparse pseudo-labels. Considering that the generated sparse pseudo-labels contain noise, we employ a thresholding operation for the obtained sparse pseudo-labels to obtain the sparse pseudo-labels used for the regression target alignment. The value of threshold t was empirically chosen as 0.5. The results of corresponding ablation experiments showed that the threshold operation significantly improved the reliability of the object regression and the setting of the threshold value itself was insensitive to the final performance. Next, regression alignment is performed based on the generated sparse pseudo-labels. Firstly, the pseudo-positive samples are generated with the help of sparse pseudo-labels, then we intersect these pseudo-positive samples with dense pseudo-labels, and finally, we update the uniform and consistent regression objects for the dense pseudo-labels within the intersection set based on sparse pseudo labels.

It should be noted that the quality of the generated pseudo-positive samples is critical to the final result of the regression object alignment. In order to generate better pseudo-positive samples and inspired by Consistent Teacher [32], we observed that the common center sampling strategy had allocation limitations when generating pseudo-positive samples. Specifically, the common center sampling strategy tends to sample too small an area when generating pseudo-positive samples for objects with large aspect ratios, which are commonly found in remote sensing images. It cannot correctly reflect the geometric properties of the objects, which further leads to too few intersections of the generated pseudo-positive samples with dense pseudo-labels and ultimately affects the coverage of the regression alignment. To solve this problem, we propose using Gaussian modeling sampling instead of center sampling based on some existing Gaussian modeling sampling methods (more implementation details can be found in [20,40]). We modeled the rotated object as a two-dimensional Gaussian distribution and noted the rotated object as

(x_{c}, y_{c}, w, h, θ)

, where

(x_{c}, y_{c})

are the object center coordinates,

(w, h)

are the object’s width and height, and

θ

is the angle of the object. Altogether, the object was treated as a Gaussian distribution with a mean value

μ

being

{(x_{c}, y_{c})}^{T}

, as well as a variance

Σ

, where

Σ

is defined as:

Σ = [\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}] [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}] [\begin{matrix} cos θ & sin θ \\ - sin θ & cos θ \end{matrix}]

(12)

For any point

X = {(x, y)}^{T}

in the sparse pseudo-labeled bounding box, we computed and determined whether it was a pseudo-positive sample denoted by a flag based on the following rule:

flag = \{\begin{matrix} True, & if {(X - μ)}^{⊤} Σ^{- 1} (X - μ) \leq 1 \\ False, & otherwise . \end{matrix}

(13)

4. Experimental Results

4.1. Dataset

DOTA [41]: This is one of the largest datasets for object detection in remote sensing images. Related experiments were performed on DOTA-v1.5 and DOTA-v1.0. Compared to DOTA-v1.0, the images in DOTA-v1.5 remain unchanged, but there are additional annotations for targets smaller than 10 pixels. One category named Container crane (CC) was added. Both DOTA-v1.5 and DOTA-v1.0 contain 2806 large-scale remote sensing images divided into a training set, a validation set, and a test set. The training set contains 1411 images, the validation set contains 458 images, and the test set contains 937 images. The experiments used the standardized mAP as the performance evaluation metric. In the DOTA-v1.5 dataset, we followed the settings of SOOD [19]; we randomly selected 1%, 5%, 10%, 20%, and 30% of the images from the training set as labeled data and set the remaining images as unlabeled data. For each experiment, we provided a fold with a similar distribution to the training set to avoid distribution mismatch [20]. The division details are shown in Table 1 and Table 2. Note that with the 1% setting, only 14 images were provided as labeled data. For DOTA-v1.0, we used the same settings as for DOTA-v1.5.

DIOR-R [42]: This is a challenging dataset and is based on the DIOR dataset, using OBB to re-label the object. In the DIOR-R dataset, 11,725 images are used as the training set and 11,738 images are used as the test set, with a uniform size of 800 × 800, covering 20 categories. The experiments used mAP as the performance evaluation metric. Compared with the DOTA dataset, the DIOR-R dataset includes data of uniform size, resulting in a more balanced target size and density distribution. Similar to the setup in DOTA, we randomly selected 1%, 5%, 10%, 20%, and 30% of the images from the training set of DIOR-R as labeled data, and the rest of the data as unlabeled data. The division details are shown in Table 3.

4.2. Implementation Details

Without loss of generality, we adopted the Rotated FCOS [12] as the base detector and used ResNet-50 with an FPN as the backbone network. The implementation of the base detector followed the MMRotate open source framework [43].

All the experiments were conducted on a server equipped with an Intel Xeon Gold 6326 CPU, whose memory size was 256 G, and an NVIDIA 3090 GPU, whose graphics memory size was 24 G. For the experiments on the DOTA dataset, the model was trained on two NVIDIA RTX3090 GPUs for 120,000 iterations. Each GPU had three images. The optimizer used SGD with the learning rate initialized to 0.0025. The weight decay and momentum were set to 0.0001 and 0.9, respectively. For a fair comparison, the ratio of samples between labeled and unlabeled data was set to 2:1 in this paper, following the settings in SOOD [19]. Consistent with previous work [19,20], the original image was divided into sub-blocks of pixel size 1024 × 1024, with a pixel overlap of 200 between neighboring blocks. The experiments on the DIOR-R dataset used the same configuration as the ones on the DOTA dataset.

All experiments used asymmetric data augmentation in SOOD [19]. That is, strong enhancement was used for the student model, and weak enhancement was used for the teacher model. Strong enhancement included random flipping, color dithering, random grayscale, and random Gaussian blur. Weak enhancement included only random flipping. Consistent with [19,20,32], we used a warm-up strategy to initialize the teacher model. The value of

α

was set to one.

4.3. Comparative Experiment

DOTA results: On the DOTA-v1.5 and DOTA-v1.0 datasets, we compared the proposed CDPL with SOTA methods. The results are shown in Table 4 and Table 5. In the DOTA-v1.5 dataset, CDPL achieved the best performance at 1%, 5%, 10%, 20%, and 30% of labeled data, reaching mAP values of 0.2241, 0.4497, 0.5221, 0.5854, and 0.6043, respectively. Similarly, CDPL also outperformed SOTA methods such as Denser Teacher [36], Focal Teacher [21], and PST [20] in several experimental settings. For example, CDPL achieved a performance improvement of 1.43% and 1.57% compared to Denser Teacher under the conditions of extreme sparsity with 1% and 5% scale settings, respectively. This shows that CDPL is still competitive even under sparse labeled data conditions. At 10%, 20%, and 30% scale settings, CDPL still had an advantage over SOTA methods. At 10% of labeled data, CDPL achieved comparable performance to Focal Teacher and Denser Teacher. It outperformed SOTA methods at the 20% and 30% scale settings. In particular, CDPL achieved an mAP of 0.5854 under the 20% scale setting, which was a 1.05% improvement over Denser Teacher.

In addition, we also compared CDPL with reproduced related methods based on dense pseudo-labels in the DOTA-v1.0 dataset, and the results are shown in Table 5. CDPL achieved the optimal performance at the majority of experimental settings. Under the 30% scale setting, CDPL still achieved an mAP of 0.6271, which was slightly inferior to that of Denser Teacher’s mAP of 0.6282. The above experimental results show that CDPL’s performance is significantly better than that of SOTA methods on the DOTA dataset, indicating its excellent performance in semi-supervised object detection in remote sensing images.

DIOR-R results: To further validate the effectiveness of CDPL on more datasets, we compared CDPL with the reproduced dense pseudo-labeling-based methods on the DIOR-R dataset. The results are shown in Table 6. It is clear that CDPL achieved the best results for most task settings. Specifically, it achieved mAP values of 0.2671, 0.4866, 0.5503, 0.5824, and 0.6004 under 1%, 5%, 10%, 20%, and 30% of labeled data, outperforming the supervised learning baseline methods by 7.38%, 11.21%, 11.37%, 10.27%, and 7.81%. Compared to Denser Teacher [36], CDPL showed significant improvement at all scale settings but 1%, where CDPL maintained comparable performance to Denser Teacher. The above experimental results further validate the effectiveness and robustness of CDPL.

In order to illustrate the advantages of CDPL more intuitively, Figure 6 shows the results when comparing CDPL with SOTA methods. We found that CDPL had an advantage in detecting objects with large aspect ratios, which to some extent explained CDPL’s ability to provide better supervisory information for such objects. In addition, CDPL was able to significantly reduce false negative (marked by red dashed circles) and false positive (marked by red solid circles) results and make object detection in remote sensing images more robust. Figure 7 demonstrates CDPL’s detection results in different scenarios. It can be seen that CDPL obtained good performance for the detection of remote sensing image objects.

4.4. Model Efficiency Analysis

To analyze the efficiency of the proposed CDPL, we statistically evaluated the model’s GFLOPs, parameter count, FPS during inference, training speed, and model size during training. For comparison, the model efficiency of Dense Teacher [35], SOOD [19], and the previously proposed Denser Teacher [36] were also evaluated under identical hardware conditions. The experiments were performed under the 10% partially labeled setting on DOTA-v1.5. The corresponding results are presented in Table 7. Since semi-supervised object detection methods use the teacher model for inference during testing, model efficiency is largely determined by the base architecture. As shown in the table, although CDPL modified FCOS’s auxiliary branch into a localization auxiliary branch, it maintained comparable GFLOPs, parameter count, and FPS to previous works. In terms of training efficiency, compared with Denser Teacher, CDPL significantly increased the training speed while achieving superior performance metrics and a more compact model size. Even though CDPL had a slight increase in model size when compared with Dense Teacher, it still could achieve comparable training speed due to its adaptive dense pseudo-label selection mechanism, which was more efficient compared to the fixed selection strategy used by Dense Teacher. Meanwhile, CDPL effectively improved the consistency of dense pseudo-label selection and target regression by introducing the FA-GDE and SPL-based RA modules, which could be the main reason why CDPL was able to achieve significantly better performance than Dense Teacher. Notably, the training speed showed a strong correlation with computational complexity during the training process. Taking SOOD as an example, despite having comparable parameter quantities to CDPL, the former suffered from significantly slower training speeds due to the additional computational complexity introduced by the optimal transport (OT) loss function (i.e., a complexity-intensive component not employed in CDPL). These experiments validated CDPL’s excellent efficiency characteristics.

4.5. Ablation Experiment

The results of the ablation experiments for different components of CDPL are shown in Table 8. We performed the ablation experiments under the 20% scaling setting on DOTA-v1.5. It can be seen that the Rotated-FCOS supervised learning baseline method achieved an mAP of 0.5132. By introducing FA-GDE, the performance was significantly improved from an mAP of 0.5132 to 0.5712, which surpassed some of the methods in Table 4. By introducing SPL-based RA, the model’s mAP was further improved to 0.5854, thus validating the effectiveness of each module in CDPL.

Because there is a hyperparameter threshold t during sparse pseudo-labeling-based regression object alignment, we also conducted the corresponding ablation experiment. The experimental results are shown in Table 9. It can be observed that the model achieved an mAP of 0.5798 when no threshold operation was performed on the sparse pseudo-labels (i.e., when

t = 0.0

). When the threshold t was increased to 0.1, the model’s performance increased slightly, indicating that the generated sparse pseudo-labels failed to provide accurate regression information. When the threshold t was increased to 0.3, 0.5, and 0.7, the model demonstrated a large performance improvement. The model achieved the best mAP of 0.5854 when the threshold t was set to 0.5, which indicated that the model achieved a balance between selecting high-quality sparse pseudo-labels and a sufficient number of sparse pseudo-labels. Overall, the model was not sensitive to the setting of the threshold t, indicating the robustness of the proposed method.

5. Discussion

Frankly speaking, even though the proposed CDPL was designed for remote sensing scenarios, we think that it would still work with other types of image data from other object detection tasks. The reasons is twofold. Firstly, the proposed CDPL in fact follows a common framework (i.e., the teacher–student network based on pseudo-labels) of semi-supervised object detection in common generalized scenarios, on the basis of which the proposed CDPL was designed with two main modules (i.e., FA-GDE and SPL-based RA) to make itself more suitable for remote sensing scenarios with existing objects with arbitrary angles, dense distributions, and large aspect ratios. Object detection in remote sensing scenarios is a rather hard task. Object detection in common generalized scenarios is relatively much easier. Since the proposed CDPL could deal well with a hard task, it is possible that the proposed CDPL could also be applied to object detection in common generalized scenarios with other types of images, which is an easier task.

Secondly, experimental results demonstrated its potential generalization capability through rigorous validation across diverse datasets. Evaluations on three distinct benchmarks, i.e., DOTA-v1.0, DOTA-v1.5, and DIOR-R, revealed consistent performance improvements despite their substantial distributional differences. For instance, DOTA-v1.5 introduces small targets (which contain no more than 10 pixels) absent in DOTA-v1.0, while DIOR-R exhibits more balanced category distributions and scale variations. Furthermore, comparative evaluations with general semi-supervised detection methods (e.g., Dense Teacher) highlighted the proposed CDPL’s superior robustness to dataset shifts.

We think that the limitations of the proposed CDPL lie in the fact that it is primarily limited to visible light remote sensing images due to dataset constraints. However, remote sensing object detection inherently faces challenges posed by multi-scale, multi-modal (e.g., optical, SAR, infrared), and multi-temporal data in complex scenarios. Since the recent emergence of high-quality large-scale SAR datasets, it has great potential to explore semi-supervised object detection for SAR or multi-modal remote sensing images.

6. Conclusions

To deal with the feature inconsistency problem of dense pseudo-label selection and the regression object inconsistency of single-instance dense pseudo-labels, a novel semi-supervised learning object detection method based on dense pseudo-labels, called CDPL, was proposed in this paper. CDPL consists of two main components, that is, the feature-aligned dense pseudo-label selection and the sparse pseudo-label-based regression object alignment. Firstly, we analytically identified the consistency issues inherent in dense pseudo-labels. To resolve feature inconsistency during dense pseudo-label selection, we introduced feature-aligned global dynamic K-estimation and used a dedicated localization branch to assist pseudo-label selection. Additionally, addressing the often-neglected regression target inconsistency in single-instance dense pseudo-labels, we proposed a sparse pseudo-label guided regression alignment. By incorporating sparse pseudo-labels into dense pseudo-label supervision, we improved regression consistency. Extensive experimental results on typical remote sensing image datasets showed that the proposed method had satisfactory performance improvement over SOTA related work.

Author Contributions

Conceptualization, T.Z., H.X. and X.X.; methodology, T.Z., Q.F. and Y.Z.; writing—original draft preparation, Q.F. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work in this paper is supported by a grant from the National Natural Science Foundation of China (No. U21A20518).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant U21A20518, for which we are immensely grateful. We would also like to thank the anonymous reviewers for their support and valuable discussion on this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SVM	Support Vector Machine
CNN	Convolutional Neural Network
NMS	Non-Maximum Suppression
mAP	Mean Average Precision
SOTA	State of the Art
IOU	Intersection over Union

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Wang, J.; Zhu, F.; Wang, Q.; Zhao, P.; Fang, Y. An Active Object-Detection Algorithm for Adaptive Attribute Adjustment of Remote-Sensing Images. Remote Sens. 2025, 17, 818. [Google Scholar] [CrossRef]
Guan, T.; Chang, S.; Wang, C.; Jia, X. SAR Small Ship Detection Based on Enhanced YOLO Network. Remote Sens. 2025, 17, 839. [Google Scholar] [CrossRef]
Hua, W.; Liang, D.; Li, J.; Liu, X.; Zou, Z.; Ye, X.; Bai, X. SOOD: Towards Semi-Supervised Oriented Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15558–15567. [Google Scholar] [CrossRef]
Wu, W.; Wong, H.S.; Wu, S. Pseudo-Siamese Teacher for Semi-Supervised Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1000914. [Google Scholar] [CrossRef]
Wang, K.; Xiao, Z.; Wan, Q.; Xia, F.; Chen, P.; Li, D. Global Focal Learning for Semi-Supervised Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636013. [Google Scholar] [CrossRef]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML 2013 Workshop on Challenges in Representation Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 3, p. 896. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based semi-supervised learning for object detection. Adv. Neural Inf. Process. Syst. 2019, 32, 10759–10768. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.L.; Zhang, H.; Lee, C.Y.; Pfister, T. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Zhou, Q.; Yu, C.; Wang, Z.; Qian, Q.; Li, H. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4079–4088. [Google Scholar]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft Teacher. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3040–3049. [Google Scholar] [CrossRef]
Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3240–3249. [Google Scholar] [CrossRef]
Li, G.; Li, X.; Wang, Y.; Wu, Y.; Liang, D.; Zhang, S. Pseco: Pseudo labeling and consistency training for semi-supervised object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 457–472. [Google Scholar]
Zhang, L.; Sun, Y.; Wei, W. Mind the Gap: Polishing Pseudo labels for Accurate Semi-supervised Object Detection. arXiv 2022, arXiv:2207.08185. [Google Scholar] [CrossRef]
Zhou, H.; Ge, Z.; Liu, S.; Mao, W.; Li, Z.; Yu, H.; Sun, J. Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 35–50. [Google Scholar]
Zhao, T.; Fang, Q.; Xu, X. Denser Teacher: Rethinking Dense Pseudo-Label for Semi-supervised Oriented Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 1. [Google Scholar] [CrossRef]
Liu, C.; Zhang, W.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Li, X.; Ding, E.; Wang, J. Ambiguity-resistant semi-supervised learning for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15579–15588. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]

Figure 1. The performance of different methods on the DOTA-v1.5 dataset with different label scale settings.

Figure 2. Example for the problem of feature inconsistency.

Figure 3. The framework of CDPL.

Figure 4. The workflow of the global dynamic K-estimation with feature alignment.

Figure 5. The workflow of sparse pseudo-label-based regression alignment.

Figure 6. Example of detection results with the DOTA-v1.5 dataset. Green rectangles indicate detection predictions. Red dashed circles, red solid circles, and red arrows indicate false negative, false positive, and inaccurate directional predictions, respectively.

Figure 7. Some examples of detection results given by CDPL with the DOTA-v1.5 dataset.

Table 1. The data distribution after data division with different label scales (1% and 5%) on DOTA.

	Labeled	Unlabeled	Labeled	Unlabeled
Plane	168	14,209	905	13,472
Baseball diamond	10	716	84	642
Bridge	20	3432	247	3205
Ground track field	2	534	33	503
Small vehicle	1687	220,136	17,869	203,954
Large vehicle	240	43,650	1415	42,475
Ship	485	66,613	2945	64,153
Tennis court	21	4510	183	4348
Basketball court	4	935	38	901
Storage tank	65	9308	586	8787
Soccer-ball field	4	582	21	565
Roundabout	3	760	46	717
Harbor	117	11,240	654	10,703
Swimming pool	29	3856	257	3628
Helicopter	14	1072	31	1055

Table 2. The data distribution after data division with different label scales (10%, 20% and 30%) on DOTA.

	Labeled	Unlabeled	Labeled	Unlabeled	Labeled	Unlabeled
Plane	1454	12,923	3402	10,975	5645	8732
Baseball diamond	112	614	200	526	250	476
Bridge	378	3074	839	2613	1324	2128
Ground track field	54	482	118	418	161	375
Small vehicle	25,090	196,733	47,760	174,063	79,263	142,560
Large vehicle	2925	40,965	6176	37,714	12,713	31,177
Ship	5549	61,549	10,307	56,791	16,336	50,762
Tennis court	363	4168	892	3639	1387	3144
Basketball court	50	889	109	830	219	720
Storage tank	787	8586	2166	7207	3224	6149
Soccer-ball field	38	548	119	467	211	375
Roundabout	85	678	135	628	198	565
Harbor	1102	10,255	1921	9436	2738	8619
Swimming pool	390	3495	949	2936	1326	2559
Helicopter	57	1029	290	796	294	792

Table 3. The data distribution after data division with different label scales on DIOR-R. L stands for labeled and U stands for unlabeled.

Category	1% (117)		5% (575)		10% (1102)		20% (2129)		30% (3057)
Category	L	U	L	U	L	U	L	U	L	U
Airplane	20	1868	69	1819	195	1693	330	1558	513	1375
Airport	9	653	35	627	69	593	121	541	164	498
Baseball field	22	2362	112	2272	210	2174	432	1952	700	1684
Basketball court	8	1069	63	1014	94	983	194	883	298	779
Bridge	11	1356	70	1297	99	1268	233	1134	395	972
Chimney	10	639	49	600	61	588	126	523	157	492
Expressway service area	8	1072	49	1031	112	968	171	909	285	795
Expressway toll station	3	607	21	589	74	536	99	511	162	448
Dam	6	506	28	484	52	460	106	406	141	371
Golf field	7	504	19	492	35	476	96	415	119	392
Ground track field	8	1154	67	1095	117	1045	242	920	309	853
Harbor	30	2334	95	2269	242	2122	337	2027	658	1706
Overpass	20	1310	51	1279	118	1212	307	1023	354	976
Ship	273	27,078	932	26,419	2480	24,871	4136	23,215	6804	20,547
Stadium	2	593	23	572	56	539	113	482	152	443
Storage tank	19	3023	187	2855	264	2778	692	2350	824	2218
Tennis court	38	4860	248	4650	478	4420	884	4014	1102	3796
Train station	4	497	28	473	50	451	101	400	134	367
Vehicle	135	13,590	738	12,987	1295	12,430	2410	11,315	3633	10,092
Windmill	11	2354	115	2250	176	2189	356	2009	587	1778

Table 4. Comparative experimental results with DOTA-v1.5. * and ^† stand for reproduced results based on Rotated Faster R-CNN and Rotated FCOS.

Setting	Methods	Partially Labeled Data
Setting	Methods	1%	5%	10%	20%	30%
Supervised	Faster RCNN [3]	0.1322	0.3395	0.4343	0.5132	0.5314
Semi-supervised	Unbiased Teacher * [29]	-	-	0.4451	0.5280	0.5333
	Soft Teacher * [31]	-	-	0.4846	0.5489	0.5783
	PseCo * [33]	-	-	0.4804	0.5528	0.5803
	DualPolish * [34]	-	-	0.4902	0.5517	0.5844
	PST * [20]	-	0.4139	0.4963	0.5739	0.6040
Supervised	FCOS [12]	0.1567	0.3338	0.4278	0.5011	0.5479
Semi-supervised	Dense Teacher ^† [35]	0.1838	0.4027	0.4690	0.5393	0.5786
	ARSL ^† [37]	-	-	0.4817	0.5534	0.5902
	SOOD ^† [19]	0.1712	0.4002	0.4863	0.5558	0.5923
	Denser Teacher ^† [36]	0.2098	0.4340	0.5205	0.5749	0.6040
	Focal Teacher ^† [21]	-	-	0.5224	0.5700	0.6021
	CDPL (Ours) ^†	0.2241	0.4497	0.5221	0.5854	0.6043

Table 5. Comparative experimental results with DOTA-v1.0.

Setting	Methods	Partially Labeled Data
Setting	Methods	1%	5%	10%	20%	30%
Supervised	FCOS [12]	0.1555	0.3434	0.4303	0.5140	0.5530
Semi-supervised	Dense Teacher [35]	0.2005	0.4257	0.4953	0.5576	0.5807
	SOOD [19]	0.1752	0.4300	0.5018	0.5647	0.6037
	Denser Teacher [36]	0.1945	0.4584	0.5262	0.5920	0.6282
	CDPL (Ours)	0.2235	0.4835	0.5376	0.6059	0.6271

Table 6. Comparative experimental results with DIOR-R.

Setting	Methods	Partially Labeled Data
Setting	Methods	1%	5%	10%	20%	30%
Supervised	FCOS [12]	0.1933	0.3745	0.4366	0.4796	0.5223
Semi-supervised	Dense Teacher [35]	0.2698	0.4445	0.5105	0.5522	0.5751
	SOOD [19]	0.2502	0.4156	0.4818	0.5261	0.5547
	Denser Teacher [36]	0.2789	0.4646	0.5287	0.5593	0.5873
	CDPL (Ours)	0.2671	0.4866	0.5503	0.5824	0.6004

Table 7. Results of comparative experiments on model efficiency.

Methods	mAP	Inference			Training
Methods	mAP	GFLOPs	Volume Parameters (MB)	FPS	Training Speed (s/Batch)	Model Size (MB)
FCOS (fully supervised) [12]	0.4297	206.96	31.92	47.7	0.20	132.83
Dense Teacher [35]	0.4690	206.96	31.92	47.7	0.36	260.90
SOOD [19]	0.4863	206.96	31.92	47.7	0.54	263.54
Denser Teacher	0.5205	206.96	31.92	47.7	0.59	288.02
CDPL (ours)	0.5221	206.96	31.92	47.7	0.36	268.93

Table 8. The results of the ablation experiment on components of CDPL. The checkmark means the usage of the corresponding component.

Methods	FA-GDE	SPL-Based RA	DOTA-v1.5
Faster RCNN [3] (supervised)	-	-	0.5132
CDPL	✓	-	0.5718
CDPL	✓	✓	0.5854

Table 9. The results of the ablation experiment on the threshold t of CPDL.

Threshold	0.0	0.1	0.3	0.5	0.7
mAP	0.5798	0.5808	0.5849	0.5854	0.5849

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, T.; Zeng, Y.; Fang, Q.; Xu, X.; Xie, H. Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels. Remote Sens. 2025, 17, 1474. https://doi.org/10.3390/rs17081474

AMA Style

Zhao T, Zeng Y, Fang Q, Xu X, Xie H. Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels. Remote Sensing. 2025; 17(8):1474. https://doi.org/10.3390/rs17081474

Chicago/Turabian Style

Zhao, Tong, Yujun Zeng, Qiang Fang, Xin Xu, and Haibin Xie. 2025. "Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels" Remote Sensing 17, no. 8: 1474. https://doi.org/10.3390/rs17081474

APA Style

Zhao, T., Zeng, Y., Fang, Q., Xu, X., & Xie, H. (2025). Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels. Remote Sensing, 17(8), 1474. https://doi.org/10.3390/rs17081474

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels

Abstract

1. Introduction

2. Preliminary

3. Methods

3.1. Global Dynamic K-Estimation with Feature Alignment

3.2. Regression Object Alignment Based on Sparse Pseudo-Labeling

4. Experimental Results

4.1. Dataset

4.2. Implementation Details

4.3. Comparative Experiment

4.4. Model Efficiency Analysis

4.5. Ablation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI