4.2. Implementation Details
We propose a semi-supervised object detection method in complex maritime scenes based on adaptive adversarial self-training, which can be trained end-to-end in a semi-supervised manner. All training data are fed into the detection model in batches to minimize the objective function using SGD method with momentum. Due to the limitation of memory of GPU, the batch size is set to 2, which consists of labeled data and unlabeled data. The number
NA of weak augmentations is also set to 2. The input images are resized so that the shorter side is 800 pixels. Training epoch is set to 30 to achieve convergence, and first 5 epochs are dedicated to full supervision training using only the labeled data for the stability of model training. The settings of hyperparameters of the objective function in Equation (1) are as follows:
for the unsupervised loss
to alleviate the pseudo-label bias in the early training stage;
for the adaptive adversarial loss
as a regularization item to reduce the distribution bias. The hyperparameter
of EMA in Equation (2) is set to 0.999. The hyperparameter
of the sharpening function in Equation (12) is set to 0.5. The upper threshold
in Equation (13) is experimentally set to 0.9 which is a high enough confidence. It is noted that the proposed method and the compared methods use FPN [
40] and same data augmentations including Cutout, random fog, random rain, random sun flare and KeepAugment [
44] to deal with the complex maritime scenes.
4.3. Experimental Results
To evaluate the proposed method in complex maritime scenes, three experiments are conducted as follows: (i) ablation studies on ADD and LAA; (ii) performance of various data amounts; (iii) performance of different methods.
(i) Ablation studies on ADD and LAA: In our way of semi-supervised training, ADD and LAA are proposed to reduce the distribution bias and pseudo-label bias. To analyze the influence of the ADD and LAA, we conduct the experiments by setting, respectively,
and
in Equation (1) to zero. We use 10% of the training samples as the labeled data (1358) and the remaining samples (12,226) as the unlabeled data. Experimental results are shown in
Table 3, the proposed method with ADD and LAA obtains the best mAPs overall that are 92.9%, 88.7%, 92.7% and 93.8%, respectively, in corresponding maritime scenes. In all maritime scenes, the fully supervised baseline method [
15], the method with only LAA and the method with ADD obtain mAP values of 90.5%, 90.2% and 90.9%, respectively. This demonstrates that the strong–weak augmentations for labeled and unlabeled data can produce distribution bias, and the proposed ADD helps to reduce this bias.
To confirm that the distribution bias is generated from the strong–weak augmentation, we perform a feature reduction between the original data and their augmented versions. The features are extracted from the last layer of the backbone. As shown in
Figure 4a, the distribution of the augmented data exhibits an offset from that of the original data, and there are some feature points of augmented data that form an isolated cluster at the bottom. In
Figure 4b, the distribution of augmented data matches the distribution of the original data, laying a solid foundation for the effectiveness of data augmentation.
To highlight the advantages of the proposed adaptive threshold, the compared experiment about the different threshold strategies is shown in
Table 4. The adaptive threshold strategy with LAA obtains the highest mAP of 92.9% in
Table 4, while the fixed threshold strategy requires manual search for the optimal threshold which may not be suitable across different datasets. The trend of adaptive threshold during training is shown in
Figure 5. As seen in
Figure 5a, the thresholds of all classes gradually converge around 0.8, which explains why the fixed threshold of 0.8 achieves the second-highest mAP of 89.9% to some extent. And the proposed LAA provides more appropriate threshold according to the learning status of the detection model, which reduces the pseudo-label bias and improves the performance semi-supervised object detection.
(ii) Performance of various data amounts: To evaluate the sensitivity of varying data amounts, we conduct this experiment by changing the proportions of labeled and unlabeled samples. The total numbers of all the labeled and unlabeled samples are 1,358 and 12,226, respectively. We set 4 types of ratios (1%, 2%, 5% and 10%) for labeled samples and 3 types of ratios (0%, 50%, 100%) for unlabeled samples. It is noted that the 0% ratios of unlabeled samples denote the baseline in fully supervised manner. As shown in
Table 5, the more labeled samples, the higher the mAP, which aligns with the widely held intuition in training detection model. Similarly, more unlabeled samples also help improve the mAP in
Table 5. However, the following two cases attract our attention. (1) Little improvement is achieved while training with unlabeled samples and 1% labeled samples, which suggests that the number of labeled samples is too small to effectively leverage the unlabeled samples. (2) The mAP obtained by combing 2% of labeled samples with all unlabeled samples is higher than that achieved with 5% of labeled sample in a fully supervised manner. The same phenomenon occurs when combing 5% of labeled samples with all unlabeled samples, whose mAP is higher than that with 10% of labeled sample in a fully supervised manner. This demonstrates that a large number of unlabeled samples can bring greater performance gains than a limited number of labeled samples. More discussion will be presented in
Section 5.
(iii) Performance of different methods: To highlight the advantage of semi-supervised learning, we compare the proposed method with other existing detection method [
10,
15,
38,
48], using 5% of labeled samples and all unlabeled data. All these methods are trained with the same hyperparameters and tested on the same test set with the complex maritime scenes of occlusion, scale variations and lighting variations. Faster R-CNN [
15] is considered as the baseline method that only trained with labeled samples.
Experimental results are listed in
Table 6. The baseline method [
15] gets the mAPs of 79.8%, 76.9%, 79.7% and 80.4 in corresponding maritime scenes, respectively. YOLOv8 [
48] improves specifically for small object detection and has slightly higher mAP than STAC [
38], which gets the mAPs of 85.8%, 83.3%, 85.1% and 87.6 in corresponding scenes, respectively. All the compared semi-supervised detection methods have the higher mAPs than the baseline method, demonstrating the powerful strength of the semi-supervised detection and the potential to reduce labeling costs. Specifically, the semi-supervised detection method named STAC [
38] uses pseudo-label to improve the detection performance and gets the mAPs of 85.6%, 85.5%, 86.9% and 89.5% in corresponding maritime scenes. The other semi-supervised detection method named UT [
10], which combines the mean teacher framework [
29] with pseudo-labeling, achieves the second-best mAPs of 87.4%, 85.5%, 86.9% and 89.5% in corresponding maritime scenes. The proposed method uses the ADD and LAA to obtain the best mAPs of 91.4%, 89.3%, 91.0% and 92.6% in corresponding maritime scenes. Our method aligns the feature distribution of strong–weak augmentation and reduces the pseudo-label bias, thereby helping to improve the detection performance in complex maritime scenes. Regrading ACE, the proposed method and YOLOv8 show best value, indicative of minimal center deviation and thus superior location accuracy. Since different methods in this part are performed in the same two-stage detection framework, their detection speeds are considered roughly to be the same. The FPS of the proposed method is 11.1, which meets the real-time requirements.
The visualization results in complex maritime scenes for the proposed method and UT [
10] are shown in
Figure 6. The results with different colors indicate different target classes. Different columns show the detection results from different methods and the ground truths (GT) are showed in the first column. Complex maritime scenes, including occlusion, scales variations and lighting variations, are presented in different rows. In
Figure 6(1) with scene of occlusion, the proposed method detects two targets with high confidence score while UT [
10] detects a false positive of bulk cargo carrier. In
Figure 6(2) with scene of scale variations, the proposed method detects all the targets while UT [
10] misses some small targets in a distant. In
Figure 6(3) with scene of lighting variations, the proposed method misses a hard target while UT [
10] generates the false positives.