1. Introduction
Object detection is one of the core application problems in the field of computer vision and machine learning. Currently, the methods most widely used in object detection tasks are deep learning-based, which can be divided into four main categories. Firstly, multi-stage object detection methods like RCNN [
1], Fast RCNN [
2], Faster RCNN [
3], FPN [
4], Cascade RCNN [
5], and Mask RCNN [
6] contain two stages, i.e., extracting candidate regions that may contain objects and subsequently extracting features based on the candidate regions for classification and localization. Among them, the representative one is Faster RCNN, which realizes end-to-end detection, and the detection speed can be close to real time. It introduces the Region Proposal Network (RPN), which greatly reduces the generation cost of candidate regions. Secondly, there are single-stage object detection methods like YOLO [
7], SSD [
8], and RetinaNet [
9], which discard the candidate region generation stage and directly predict the offset of the anchor frame to obtain the final detection result. The third type is anchor-free object detection methods like CornerNet [
10], CenterNet [
11], and FCOS [
12], which utilize a substitution method to remove the step of presetting the anchors and detect objects directly from the feature map. Note that FCOS introduces several optimizations such as an additional branch to measure the centrality of the detection results and employs a center sampling strategy to improve the quality of positive samples. The fourth is transformer-based object detection methods like ViT [
13], Swin Transformer [
14], and DETR [
15], which use transformers [
16] to extend all the aspects of deep learning-based object detection methods. The above methods can be collectively referred to as object detection methods in generalized scenarios, which belong to the category of supervised learning.
Recently, object detection in remote sensing images [
17,
18] has gradually become a research hotspot in the field of computer vision and artificial intelligence. Because objects in remote sensing images usually exhibit arbitrary angles, dense distributions, and large aspect ratios, object detection in remote sensing images usually uses a rotated bounding box to represent the detected target. In order to effectively alleviate the large-scale data labeling burden when using supervised learning-based object detection methods, which is time-consuming, costly, and extremely difficult, more attention has been paid to object detection in remote sensing images based on semi-supervised learning, which tries to utilize limited labeled data and a large quantity of easily available unlabeled data to improve performance. It is of great significance to fully explore the characteristics of remote sensing images and design appropriate mechanisms to improve the performance and generalization of semi-supervised learning-based object detection methods for remote sensing images.
SOOD [
19] firstly explored the possibility of a semi-supervised object detection task for remote sensing images by crafting adaptive loss weights based on the angular difference between the teacher model and the student model for rotated objects to focus on difficult samples with large angular deviations. In addition, SOOD explored the utilization of the object distribution in object detection in remote sensing images by introducing a global consistency loss. With the aforementioned designs, SOOD achieved excellent detection performance. PST [
20] employs a new semi-supervised learning framework, which centers on cross-checking predictions and collaboratively generating high-quality pseudo-labels through two teacher models that are each optimally updated by the same student model. To reduce the unreliability of pseudo-labels in terms of localization, scale, and orientation, PST introduces Gaussian distribution modeling combined with symmetric and bounded Jensen–Shannon divergence to evaluate the bias between the predictions of different teacher models in order to exclude low-quality pseudo-labels generated under inconsistent regression estimates. In addition, in order to solve the scale invariance problem in object detection for remote sensing images, PST proposes a scale-adaptive knowledge distillation mechanism to align the features of the student model extracted from multi-scale images with the interpolated multi-scale features generated by the teacher model. Focal Teacher [
21], on the other hand, analyzes the factors affecting semi-supervised object detection in remote sensing images and proposes a global focus learning method that does not require any manual priori design. The method blurs the boundaries between positive and negative samples through global regions and soft regression methods, while utilizing the localization consistency between the teacher model and the student model to focus on difficult regions.
In this paper, a novel semi-supervised learning-based method for object detection in remote sensing images called CDPL is proposed, which innovatively introduces feature-aligned dense pseudo-label selection and sparse pseudo-label-based regression object alignment to tackle the feature inconsistency problem in dense pseudo-label selection and the regression object inconsistency of single-instance dense pseudo-labels, to improve both the accuracy and the generalization performance of object detection in remote sensing images. The experimental results for typical benchmarks with related comparative methods show that the proposed method achieves SOTA performance.
The rest of this paper is organized as follows:
Section 2 provides a necessary preliminary description and introduces related work.
Section 3 introduces the proposed CDPL, detailing its mechanism, architecture, and implementation.
Section 4 provides the corresponding experimental results for typical datasets (including from ablation and comparison experiments), together with the implementation details. Finally,
Section 5 summarizes the findings and innovations regarding the proposed CDPL.
2. Preliminary
Semi-supervised learning, which aims to improve the generalization performance of a model by exploiting the large quantity of unlabeled data during training, thus alleviating the need for labeled data, is a representative research area in machine learning. Semi-supervised learning usually assumes the existence of two sample datasets which follows the assumption of identical and independent distribution, i.e., the labeled dataset and the unlabeled dataset , containing labeled data and unlabeled data, respectively and usually . Semi-supervised learning is usually based on several assumptions:
Smoothness assumption: If two samples located in a high-density region are close together, then their corresponding labels should also be close. Conversely, if two samples are separated by a low-density region, their corresponding labels will tend to be different.
Cluster assumption: when two samples are located in the same cluster, they have the same label with high probability.
Low-density separation assumption: decision boundaries should cut through sparse data regions, while avoiding splitting samples from high-density data regions on either side of the decision boundary.
Manifold assumption: the input space consists of multiple low-dimensional manifolds on which all samples are located, and samples located on the same manifold have the same label.
Pseudo-labeling is a key technique in semi-supervised learning, which trains a model by making iterative predictions on unlabeled data and using some of the predictions that satisfy specific conditions as pseudo-labels. Related works include Pseudo-Label [
22], Noisy student [
23], Meta Pseudo Labels [
24], CSD [
25], STAC [
26], FixMatch [
27], Mean Teacher [
28], Unbiased Teacher [
29], Instant-Teacher [
30], Soft Teacher [
31], Consistent Teacher [
32], PseCo [
33], DualPolish [
34], and so on. Among them, Unbiased Teacher, Soft Teacher, Consistent Teacher, and PseCo are the representative examples.
Unbiased Teacher uses focal loss [
9] to replace cross-entropy to address class imbalance in pseudo-labeling. It introduces adversarial distortion and label smoothing to enhance fairness. Adversarial distortion forces the model to focus on non-sensitive features by applying small perturbations to the input data, while label smoothing helps the model avoid relying too heavily on any specific feature by adding noise to the label assignment.
Soft Teacher utilizes a teacher model to provide pseudo-labels for unlabeled data, thereby enhancing the performance of a student model. The teacher model generates soft labels (probability distributions) for unlabeled data, which are then used to supervise the training of the student model. The teacher model is typically updated as an exponential moving average of the student model’s parameters, ensuring that it remains accurate and stable throughout the training process.
Consistent Teacher proposes a unified framework combining adaptive anchor frame assignment, 3D feature alignment, and adaptive thresholding based on Gaussian mixture models to mitigate inconsistencies during semi-supervised training.
The core idea of PseCo is improving pseudo-label quality and incorporating both multi-level consistency in consistency training. In pseudo-labeling, PseCo focuses on generating high-quality pseudo-bounding boxes for unlabeled data, ensuring not only accurate classification scores but also precise localization. For consistency training, PseCo emphasizes both label-level and feature-level consistency. It highlights the importance of feature-level consistency in ensuring scale invariance.
DualPolish aims to refine pseudo-labels generated from unlabeled data to bridge the gap between noisy pseudo-annotations and ground-truth labels. It focuses on polishing noisy pseudo-labels to make them more accurate and reliable, which leverages the additional information contained in the unlabeled data while mitigating the negative impact of noisy annotations.
It is worth mentioning that all the above methods are based on the traditional sparse pseudo-labeling paradigm. As a more direct form of pseudo-label representation, dense pseudo-labeling retains more information and is gradually gaining popularity. In contrast, Dense Teacher [
35] replaces sparse pseudo-label bounding boxes with the dense pseudo-label ones in order to mitigate the impact of post-processing and confidence thresholds on detection performance. Denser Teacher [
36] tries to generate dense pseudo-labels by utilizing unlabeled data, which are then used to train the detection model alongside the labeled data. The key innovation lies in the ability to produce high-quality pseudo-labels that are close to ground-truth annotations, enabling the model to learn more effectively from the additional data. ARSL [
37] also adopts the form of dense pseudo-labeling, focusing on the selection ambiguity and assignment ambiguity of dense pseudo-labeling. To address the selection ambiguity, ARSL proposes joint confidence estimation to jointly quantify the classification and localization quality of pseudo-labels. To address the assignment ambiguity, ARSL introduces a task-separated assignment mechanism to assign labels based on dense pseudo-labels and exploits pseudo-labels for classification and localization tasks to improve the robustness to assignment ambiguity.
However, the existing methods mentioned above ignore a key issue in the semi-supervised learning process, i.e., the consistency of pseudo-labels. When this process is used for object detection in remote sensing images, it is specifically manifested as inconsistent features selected by dense pseudo-labels as well as inconsistent regression targets for a single instance of dense pseudo-labels, which adversely affect the accuracy as well as the generalization ability of the object detection method. This issue is particularly prominent in single-stage object detection methods since the detection model’s classification and regression branches are relatively independent, which exacerbates the problem. Even though a method like ARSL [
37] employs an extra relatively lightweight branch of localization quality to mitigate feature inconsistency, its subsequent selection of dense pseudo-labels still opts for a threshold-based design, which is not as efficient in complex and changing remote sensing image scenarios. The performance of ARSL as well as other representative methods with different label scale settings is compared in
Figure 1. Even though dense pseudo-labels can provide richer information compared to sparse pseudo-labels, the consistency problem of their regression still adversely affects the final detection results. It still requires more effective solutions to overcome feature inconsistency in semi-supervised object detection on remote sensing images. For a single object, the regression objective of dense pseudo-labels belonging to the same instance are not always consistent, and sometimes the difference is even very large.
It can be seen from
Figure 2 that for the detected object (i.e., ship), the feature region with the highest score and the one with the highest IoU are not the same. Similarly, the feature region with a high IoU can suffer from a relatively low score. Therefore, it is necessary to provide more accurate and consistent regression targets for dense pseudo-labels. In this situation, relying on the confidence score to select dense pseudo-labels is not suitable.
3. Methods
In this section, more details on the proposed CDPL are given.
Figure 3 shows the framework. It has two core functions, that is, feature-aligned dense pseudo-label selection and sparse pseudo-label-based regression object alignment. Specifically speaking, it contains a student model and a teacher model. During the training process, the teacher model parameters are updated from the student model using the exponential moving average as follows:
where
and
stand for the weights of the teacher model and the student model, respectively;
t denotes the number of training episodes; and
is a momentum to maintain the difference between the teacher model and the student model.
CDPL employs a feature-aligned global dynamic K estimation method to select appropriate dense pseudo-labels for the unlabeled data based on the K values, and a sparse pseudo-label-based regression object alignment method is used to improve the regression quality of the dense pseudo-labels.
In each training episode, the training on labeled data follows the conventional way and is completely supervised by truth labels. For unlabeled data, the teacher model first generates pseudo-labels on the weakly augmented images, which provide a supervised signal to the student model using the strongly augmented images. Subsequently, the student model is updated using the loss of labeled data and the loss of unlabeled data. The overall training objectives can be formulated as follows:
where
and
denote the loss of labeled images and the loss of unlabeled images, respectively.
controls the contribution of the unsupervised loss.
3.1. Global Dynamic K-Estimation with Feature Alignment
Taking the feature alignment problem during object detection in remote sensing images into consideration and inspired by earlier work [
12,
37,
38], we propose a global dynamic K-estimation with feature alignment (FA-GDE), which uses a separate auxiliary branch to estimate the localization quality as shown in
Figure 4.
The confidence level after feature alignment of the prediction results
is defined as follows:
where
and
are the classification confidence and localization confidence (represented by the IOU) of the prediction results, respectively. For labeled data, the learning objective of
is defined as:
Because
is a continuous value between 0 and 1, the discrete version of the focal loss [
9], i.e., the quality focal loss [
39], is used for the classification branch, which is shown as follows:
For the auxiliary branch, we use the commonly used cross-entropy loss:
It is worth mentioning that the above formulation can be realized with only minor changes in the single-stage baseline methods commonly used for semi-supervised object detection, such as RetinaNet [
9] and FCOS [
12]. For example, for RetinaNet, only a 3 × 3 lightweight convolutional layer needs to be added, while for FCOS, it is sufficient to adjust the centrality-assisted branch to a localization-assisted branch, which ensures the applicability and simplicity of the proposed method.
Then, the confidence based on feature alignment is further used for feature-aligned global dynamic K-estimation. Specifically, for an unlabeled image, the teacher model dynamically estimates the number of dense pseudo-labels K. K is defined as follows:
where
stands for the probability of class
c at the corresponding feature mapping location (
) in the
lth layer of the feature pyramid.
M is the number of layers of the feature pyramid. Thus, the selection of dense pseudo-labels follows the following rule:
where
represents whether the feature pyramid is a dense pseudo-label at the
lth layer at the corresponding feature mapping position (
). Note that the value of K is rounded down.
In order to deal with
, which is a continuous value between 0 and 1, the quality focal loss [
39] is used as the classification loss function for unlabeled data. Let
and
denote the confidence level after feature alignment for the teacher model and student model, respectively. The classification loss is calculated as follows:
where
is an inhibitory factor.
For the regression branch and auxiliary branch, we use rotated IoU loss and cross-entropy loss, respectively. Therefore, the final overall loss function for unlabeled data has the following form:
where
and
denote regression loss and auxiliary loss for unlabeled data, respectively.
3.2. Regression Object Alignment Based on Sparse Pseudo-Labeling
For a single instance, the regression objectives of dense pseudo-labels belonging to the same instance are not always the same. This problem limits the learning performance of the model to some extent. In semi-supervised object detection based on dense pseudo-labels, due to the lack of instance-level supervisory signals, there is a lack of connection between the dense pseudo-labels that actually belong to the same instance, and this problem leads to a lack of consistency within the dense pseudo-labels of the same instance.
To address the aforementioned issues, we propose a sparse pseudo-label-based regression target alignment (SPL-based RA). As
Figure 5 shows, SPL-based RA innovatively introduces sparse pseudo-labels in the dense pseudo-label framework to help improve the consistency of dense pseudo-label regression. Note that the sparse pseudo-labels are generated on the basis of the dense pseudo-labels given by the FA-GDE module. To solve the problem of overlapping pseudo-label frames, SPL-based RA uses the NMS operation to eliminate the overlapping pseudo-label frames and finally obtains sparse pseudo-labels. Considering that the generated sparse pseudo-labels contain noise, we employ a thresholding operation for the obtained sparse pseudo-labels to obtain the sparse pseudo-labels used for the regression target alignment. The value of threshold
t was empirically chosen as 0.5. The results of corresponding ablation experiments showed that the threshold operation significantly improved the reliability of the object regression and the setting of the threshold value itself was insensitive to the final performance. Next, regression alignment is performed based on the generated sparse pseudo-labels. Firstly, the pseudo-positive samples are generated with the help of sparse pseudo-labels, then we intersect these pseudo-positive samples with dense pseudo-labels, and finally, we update the uniform and consistent regression objects for the dense pseudo-labels within the intersection set based on sparse pseudo labels.
It should be noted that the quality of the generated pseudo-positive samples is critical to the final result of the regression object alignment. In order to generate better pseudo-positive samples and inspired by Consistent Teacher [
32], we observed that the common center sampling strategy had allocation limitations when generating pseudo-positive samples. Specifically, the common center sampling strategy tends to sample too small an area when generating pseudo-positive samples for objects with large aspect ratios, which are commonly found in remote sensing images. It cannot correctly reflect the geometric properties of the objects, which further leads to too few intersections of the generated pseudo-positive samples with dense pseudo-labels and ultimately affects the coverage of the regression alignment. To solve this problem, we propose using Gaussian modeling sampling instead of center sampling based on some existing Gaussian modeling sampling methods (more implementation details can be found in [
20,
40]). We modeled the rotated object as a two-dimensional Gaussian distribution and noted the rotated object as
, where
are the object center coordinates,
are the object’s width and height, and
is the angle of the object. Altogether, the object was treated as a Gaussian distribution with a mean value
being
, as well as a variance
, where
is defined as:
For any point
in the sparse pseudo-labeled bounding box, we computed and determined whether it was a pseudo-positive sample denoted by a flag based on the following rule:
4. Experimental Results
4.1. Dataset
DOTA [
41]: This is one of the largest datasets for object detection in remote sensing images. Related experiments were performed on DOTA-v1.5 and DOTA-v1.0. Compared to DOTA-v1.0, the images in DOTA-v1.5 remain unchanged, but there are additional annotations for targets smaller than 10 pixels. One category named Container crane (CC) was added. Both DOTA-v1.5 and DOTA-v1.0 contain 2806 large-scale remote sensing images divided into a training set, a validation set, and a test set. The training set contains 1411 images, the validation set contains 458 images, and the test set contains 937 images. The experiments used the standardized mAP as the performance evaluation metric. In the DOTA-v1.5 dataset, we followed the settings of SOOD [
19]; we randomly selected 1%, 5%, 10%, 20%, and 30% of the images from the training set as labeled data and set the remaining images as unlabeled data. For each experiment, we provided a fold with a similar distribution to the training set to avoid distribution mismatch [
20]. The division details are shown in
Table 1 and
Table 2. Note that with the 1% setting, only 14 images were provided as labeled data. For DOTA-v1.0, we used the same settings as for DOTA-v1.5.
DIOR-R [
42]: This is a challenging dataset and is based on the DIOR dataset, using OBB to re-label the object. In the DIOR-R dataset, 11,725 images are used as the training set and 11,738 images are used as the test set, with a uniform size of 800 × 800, covering 20 categories. The experiments used mAP as the performance evaluation metric. Compared with the DOTA dataset, the DIOR-R dataset includes data of uniform size, resulting in a more balanced target size and density distribution. Similar to the setup in DOTA, we randomly selected 1%, 5%, 10%, 20%, and 30% of the images from the training set of DIOR-R as labeled data, and the rest of the data as unlabeled data. The division details are shown in
Table 3.
4.2. Implementation Details
Without loss of generality, we adopted the Rotated FCOS [
12] as the base detector and used ResNet-50 with an FPN as the backbone network. The implementation of the base detector followed the MMRotate open source framework [
43].
All the experiments were conducted on a server equipped with an Intel Xeon Gold 6326 CPU, whose memory size was 256 G, and an NVIDIA 3090 GPU, whose graphics memory size was 24 G. For the experiments on the DOTA dataset, the model was trained on two NVIDIA RTX3090 GPUs for 120,000 iterations. Each GPU had three images. The optimizer used SGD with the learning rate initialized to 0.0025. The weight decay and momentum were set to 0.0001 and 0.9, respectively. For a fair comparison, the ratio of samples between labeled and unlabeled data was set to 2:1 in this paper, following the settings in SOOD [
19]. Consistent with previous work [
19,
20], the original image was divided into sub-blocks of pixel size 1024 × 1024, with a pixel overlap of 200 between neighboring blocks. The experiments on the DIOR-R dataset used the same configuration as the ones on the DOTA dataset.
All experiments used asymmetric data augmentation in SOOD [
19]. That is, strong enhancement was used for the student model, and weak enhancement was used for the teacher model. Strong enhancement included random flipping, color dithering, random grayscale, and random Gaussian blur. Weak enhancement included only random flipping. Consistent with [
19,
20,
32], we used a warm-up strategy to initialize the teacher model. The value of
was set to one.
4.3. Comparative Experiment
DOTA results: On the DOTA-v1.5 and DOTA-v1.0 datasets, we compared the proposed CDPL with SOTA methods. The results are shown in
Table 4 and
Table 5. In the DOTA-v1.5 dataset, CDPL achieved the best performance at 1%, 5%, 10%, 20%, and 30% of labeled data, reaching mAP values of 0.2241, 0.4497, 0.5221, 0.5854, and 0.6043, respectively. Similarly, CDPL also outperformed SOTA methods such as Denser Teacher [
36], Focal Teacher [
21], and PST [
20] in several experimental settings. For example, CDPL achieved a performance improvement of 1.43% and 1.57% compared to Denser Teacher under the conditions of extreme sparsity with 1% and 5% scale settings, respectively. This shows that CDPL is still competitive even under sparse labeled data conditions. At 10%, 20%, and 30% scale settings, CDPL still had an advantage over SOTA methods. At 10% of labeled data, CDPL achieved comparable performance to Focal Teacher and Denser Teacher. It outperformed SOTA methods at the 20% and 30% scale settings. In particular, CDPL achieved an mAP of 0.5854 under the 20% scale setting, which was a 1.05% improvement over Denser Teacher.
In addition, we also compared CDPL with reproduced related methods based on dense pseudo-labels in the DOTA-v1.0 dataset, and the results are shown in
Table 5. CDPL achieved the optimal performance at the majority of experimental settings. Under the 30% scale setting, CDPL still achieved an mAP of 0.6271, which was slightly inferior to that of Denser Teacher’s mAP of 0.6282. The above experimental results show that CDPL’s performance is significantly better than that of SOTA methods on the DOTA dataset, indicating its excellent performance in semi-supervised object detection in remote sensing images.
DIOR-R results: To further validate the effectiveness of CDPL on more datasets, we compared CDPL with the reproduced dense pseudo-labeling-based methods on the DIOR-R dataset. The results are shown in
Table 6. It is clear that CDPL achieved the best results for most task settings. Specifically, it achieved mAP values of 0.2671, 0.4866, 0.5503, 0.5824, and 0.6004 under 1%, 5%, 10%, 20%, and 30% of labeled data, outperforming the supervised learning baseline methods by 7.38%, 11.21%, 11.37%, 10.27%, and 7.81%. Compared to Denser Teacher [
36], CDPL showed significant improvement at all scale settings but 1%, where CDPL maintained comparable performance to Denser Teacher. The above experimental results further validate the effectiveness and robustness of CDPL.
In order to illustrate the advantages of CDPL more intuitively,
Figure 6 shows the results when comparing CDPL with SOTA methods. We found that CDPL had an advantage in detecting objects with large aspect ratios, which to some extent explained CDPL’s ability to provide better supervisory information for such objects. In addition, CDPL was able to significantly reduce false negative (marked by red dashed circles) and false positive (marked by red solid circles) results and make object detection in remote sensing images more robust.
Figure 7 demonstrates CDPL’s detection results in different scenarios. It can be seen that CDPL obtained good performance for the detection of remote sensing image objects.
4.4. Model Efficiency Analysis
To analyze the efficiency of the proposed CDPL, we statistically evaluated the model’s GFLOPs, parameter count, FPS during inference, training speed, and model size during training. For comparison, the model efficiency of Dense Teacher [
35], SOOD [
19], and the previously proposed Denser Teacher [
36] were also evaluated under identical hardware conditions. The experiments were performed under the 10% partially labeled setting on DOTA-v1.5. The corresponding results are presented in
Table 7. Since semi-supervised object detection methods use the teacher model for inference during testing, model efficiency is largely determined by the base architecture. As shown in the table, although CDPL modified FCOS’s auxiliary branch into a localization auxiliary branch, it maintained comparable GFLOPs, parameter count, and FPS to previous works. In terms of training efficiency, compared with Denser Teacher, CDPL significantly increased the training speed while achieving superior performance metrics and a more compact model size. Even though CDPL had a slight increase in model size when compared with Dense Teacher, it still could achieve comparable training speed due to its adaptive dense pseudo-label selection mechanism, which was more efficient compared to the fixed selection strategy used by Dense Teacher. Meanwhile, CDPL effectively improved the consistency of dense pseudo-label selection and target regression by introducing the FA-GDE and SPL-based RA modules, which could be the main reason why CDPL was able to achieve significantly better performance than Dense Teacher. Notably, the training speed showed a strong correlation with computational complexity during the training process. Taking SOOD as an example, despite having comparable parameter quantities to CDPL, the former suffered from significantly slower training speeds due to the additional computational complexity introduced by the optimal transport (OT) loss function (i.e., a complexity-intensive component not employed in CDPL). These experiments validated CDPL’s excellent efficiency characteristics.
4.5. Ablation Experiment
The results of the ablation experiments for different components of CDPL are shown in
Table 8. We performed the ablation experiments under the 20% scaling setting on DOTA-v1.5. It can be seen that the Rotated-FCOS supervised learning baseline method achieved an mAP of 0.5132. By introducing FA-GDE, the performance was significantly improved from an mAP of 0.5132 to 0.5712, which surpassed some of the methods in
Table 4. By introducing SPL-based RA, the model’s mAP was further improved to 0.5854, thus validating the effectiveness of each module in CDPL.
Because there is a hyperparameter threshold
t during sparse pseudo-labeling-based regression object alignment, we also conducted the corresponding ablation experiment. The experimental results are shown in
Table 9. It can be observed that the model achieved an mAP of 0.5798 when no threshold operation was performed on the sparse pseudo-labels (i.e., when
). When the threshold
t was increased to 0.1, the model’s performance increased slightly, indicating that the generated sparse pseudo-labels failed to provide accurate regression information. When the threshold
t was increased to 0.3, 0.5, and 0.7, the model demonstrated a large performance improvement. The model achieved the best mAP of 0.5854 when the threshold
t was set to 0.5, which indicated that the model achieved a balance between selecting high-quality sparse pseudo-labels and a sufficient number of sparse pseudo-labels. Overall, the model was not sensitive to the setting of the threshold
t, indicating the robustness of the proposed method.
5. Discussion
Frankly speaking, even though the proposed CDPL was designed for remote sensing scenarios, we think that it would still work with other types of image data from other object detection tasks. The reasons is twofold. Firstly, the proposed CDPL in fact follows a common framework (i.e., the teacher–student network based on pseudo-labels) of semi-supervised object detection in common generalized scenarios, on the basis of which the proposed CDPL was designed with two main modules (i.e., FA-GDE and SPL-based RA) to make itself more suitable for remote sensing scenarios with existing objects with arbitrary angles, dense distributions, and large aspect ratios. Object detection in remote sensing scenarios is a rather hard task. Object detection in common generalized scenarios is relatively much easier. Since the proposed CDPL could deal well with a hard task, it is possible that the proposed CDPL could also be applied to object detection in common generalized scenarios with other types of images, which is an easier task.
Secondly, experimental results demonstrated its potential generalization capability through rigorous validation across diverse datasets. Evaluations on three distinct benchmarks, i.e., DOTA-v1.0, DOTA-v1.5, and DIOR-R, revealed consistent performance improvements despite their substantial distributional differences. For instance, DOTA-v1.5 introduces small targets (which contain no more than 10 pixels) absent in DOTA-v1.0, while DIOR-R exhibits more balanced category distributions and scale variations. Furthermore, comparative evaluations with general semi-supervised detection methods (e.g., Dense Teacher) highlighted the proposed CDPL’s superior robustness to dataset shifts.
We think that the limitations of the proposed CDPL lie in the fact that it is primarily limited to visible light remote sensing images due to dataset constraints. However, remote sensing object detection inherently faces challenges posed by multi-scale, multi-modal (e.g., optical, SAR, infrared), and multi-temporal data in complex scenarios. Since the recent emergence of high-quality large-scale SAR datasets, it has great potential to explore semi-supervised object detection for SAR or multi-modal remote sensing images.
6. Conclusions
To deal with the feature inconsistency problem of dense pseudo-label selection and the regression object inconsistency of single-instance dense pseudo-labels, a novel semi-supervised learning object detection method based on dense pseudo-labels, called CDPL, was proposed in this paper. CDPL consists of two main components, that is, the feature-aligned dense pseudo-label selection and the sparse pseudo-label-based regression object alignment. Firstly, we analytically identified the consistency issues inherent in dense pseudo-labels. To resolve feature inconsistency during dense pseudo-label selection, we introduced feature-aligned global dynamic K-estimation and used a dedicated localization branch to assist pseudo-label selection. Additionally, addressing the often-neglected regression target inconsistency in single-instance dense pseudo-labels, we proposed a sparse pseudo-label guided regression alignment. By incorporating sparse pseudo-labels into dense pseudo-label supervision, we improved regression consistency. Extensive experimental results on typical remote sensing image datasets showed that the proposed method had satisfactory performance improvement over SOTA related work.