Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes

Feng, Junjian; Tian, Lianfang; Li, Xiangxia

doi:10.3390/math12152348

Open AccessArticle

Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes

by

Junjian Feng

¹

,

Lianfang Tian

² and

Xiangxia Li

^1,*

¹

School of Information Science, Guangdong University of Finance and Economics, Guangzhou 510320, China

²

School of Automation Science and Engineering, South China University of Technology, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(15), 2348; https://doi.org/10.3390/math12152348 (registering DOI)

Submission received: 18 June 2024 / Revised: 15 July 2024 / Accepted: 19 July 2024 / Published: 27 July 2024

(This article belongs to the Special Issue Mathematics for Visual Computing: Acquisition, Processing, Analysis and Rendering of Visual Information)

Download

Browse Figures

Versions Notes

Abstract

:

Semi-supervised object detection helps to monitor and manage maritime transportation effectively, saving labeling costs. Currently, many semi-supervised object detection methods use a combination of data augmentation and pseudo-label to improve model performance. However, these methods may get into trouble in complex maritime scenes, including occlusion, scale variations and lighting variations, leading to distribution bias between labeled data and unlabeled data and pseudo-label bias. To address these problems, we propose a semi-supervised object detection method in complex maritime scenes based on adaptive adversarial self-training, which provides a teacher–student detection framework to use a robust pseudo-label with data augmentation. The proposed method contains two modules called adversarial distribution discriminator and label adaptive assigner. The adversarial distribution discriminator is proposed to match the distribution between augmented data generated from different data augmentations, while the label adaptive assigner is proposed to reduce the labeling bias for unlabeled data so that the pseudo-label of unlabeled data contributes to the detection performance effectively. Experimental results show that the proposed method achieves a better mean average precision of 91.4%, with only 5% of the labeled samples compared with other semi-supervised object detection methods, and its detection speed is 11.1 frames per second. Experiments also demonstrate that the proposed method improves the detection performance compared with fully supervised detectors.

Keywords:

adaptive adversarial self-training; complex maritime scenes; distribution bias; pseudo-label bias; semi-supervised object detection

MSC:

68T45; 68U10

1. Introduction

Maritime object detection plays an important role in waterway management, and its effective application greatly improves intelligent maritime transportation systems. Currently, vision-based maritime object detection methods [1,2,3,4,5,6] are becoming more and more popular because of their rich visual information and the popularity of marine surveillance. Additionally, multiple research studies on general object detection, which help to develop the field of maritime object detection, are proposed [7,8].

It is well known that object detection should estimate the locations and the categories of objects of interest at the same time. Although deep-learning-based detection methods [9] are good at learning rich and high-level feature representation directly from data, the large number of trainable parameters need a lot of labeled data to train a robust model. Labeling such large datasets is a very costly and time-consuming task. The complex scenes in real-world maritime conditions present many challenges, including occlusion, scale variations and lighting variations, which increase labeling costs. Vast unlabeled maritime data hold the potential for enhanced detection performance, making reduced reliance on labeled datasets a feasible and vital goal.

To improve robustness and alleviate overfitting with limited labeled data, some semi-supervised detection methods such as unbiased teacher (UT) [10] combined data augmentation and pseudo-labeling to achieve excellent detection performance under the teacher–student framework [11,12,13]. However, these methods used different pipelines of data augmentation, such as strong–weak augmentations, for labeled data and unlabeled data, which may produce distribution bias between labeled data and unlabeled data [14]. The generation of pseudo-labels on unlabeled data often employs a method involving a fixed threshold. As a result, these methods are likely to generate pseudo-label bias on unlabeled data whose distribution is different from the training data. This problem has occurred during object detection in complex maritime scenes where the semi-supervised setting degrades the detection performance.

To address the problems mentioned above, we propose a semi-supervised object detection method based on adaptive adversarial self-training, which consists of two modules called adversarial distribution discriminator (ADD) and label adaptive assigner (LAA). The proposed method aims to dynamically evaluate the learning status of the detection model and narrow the difference in data distribution in a teacher–student detection framework. Thus, the distribution bias and pseudo-label bias are reduced in complex maritime scenes during the semi-supervised training. The main contributions of this paper are summarized as follows:

(1) ADD is proposed to reduce distribution bias between different augmented views of training data, which encourages the detection model to generate unbiased and robust feature distribution with adversarial learning. This module also helps to alleviate the pseudo-label bias caused by distribution bias.

(2) LAA is proposed to reduce the pseudo-label bias, which uses augmentation with N times to obtain the average output. It adaptively estimates the current confidence thresholds of different classes to generate the accuracy pseudo-labels for unlabeled data.

(3) A new semi-supervised objective function is proposed to train in an end-to-end semi-supervised detection framework. This objective function consists of three items, including supervised loss, unsupervised loss and adaptive adversarial loss, which use class-balance loss to further alleviate the class imbalance problem.

This manuscript is organized as follows: In Section 2, literature that is relevant to our approach is reviewed. In Section 3, we give a detailed description of the proposed method. In Section 4, several comparative experiments are shown. Discussion is presented in Section 5. Finally, conclusions are provided in Section 6.

2. Related Works

Studies relevant to the proposed method are reviewed from two aspects: object detection based on deep learning and object detection based on semi-supervised learning.

2.1. Object Detection Based on Deep Learning

In recent years, deep learning has made great breakthroughs in the field of object detection [7,9]. Without designing feature representations by hand, they learn robust and high-level feature representation directly from data. Many detection models based on deep learning have been proposed and are becoming milestones [15,16,17,18]. Spraul et al. [19] reviewed a marine object detection study that amalgamated multiple datasets and incorporated various training techniques such as multi-scale training, which significantly bolsters detection accuracy. Kim et al. [2] fine-tuned Faster region-based convolutional neural network (R-CNN) [15] to detect targets and improve detection accuracy by performing Bayesian fusion on the detection results between video frames. It improved the detection performance of nearby objects in the videos, but scale variations in complex maritime would lead to frequent detection failures if Faster R-CNN was not trained on a large-scale dataset. Moosbauer et al. [3] trained the Mask R-CNN [20] to detect maritime objects with instance segmentation labels generated by local background subtraction. The Mask R-CNN [20] has more trainable parameters than Faster R-CNN [15], which is easier to fall into overfitting without enough labeled data. Shao et al. [21] proposed a saliency-aware convolutional neural network for maritime object detection using coastline segmentation to reduce the inspection range, improving the accuracy and efficiency of object detection. This method used the detection framework called you only look once (YOLO) [16], which was trained on a large-scale dataset to produce the initial detection results, and then performed the saliency detection to refine the detection results in complex scenes such as occlusion.

The methods mentioned above use pre-processing or post-processing to assist deep neural networks and improve detection results in complex maritime scenes such as occlusion, scale variations and lighting variations [22]. Many technologies are proposed to alleviate overfitting such as data augmentation [23,24,25] and dropout [26,27] when lacking enough training data. But they may produce the distribution bias for the limited labeled data and therefore damage the detection performance when the test data sampled from other maritime scene. We propose ADD to reduce the distribution bias of augmented data and match the labeled data and unlabeled data, so that the detection model can learn the unbiased and robust feature.

2.2. Object Detection Based on Semi-Supervised Learning

It is known that semi-supervised learning is a power technique to discover useful information from unlabeled data and limited labeled data [28]. Many classic semi-supervised methods are proposed including mean teacher [29], virtual adversarial training (VAT) [30], Pseudo-Label [31] and FlexMatch [32,33], spawning many excellent solutions for the semi-supervised object detection such as the teacher–student framework [34].

Chen et al. [35] proposed a domain adaptive detection framework to perform semi-supervised detection at maritime. It performed well for small objects, which is designed with cross-domain co-attention feature correlation and multi-level feature alignment. Jeong et al. [36] proposed a consistency-based semi-supervised learning method for object detection. It applied the consistency regularization to the divergence between the original image and its flipped version, which tends to encourage the learning of flip-invariant features. Tang et al. [37] transferred both visual and semantic knowledge to a target task that shares common attributes with the source task, which found that visually similar objects contributed semi-supervised learning. However, it still requires a more effective way to solve the problems in complex maritime scenes. Sohn et al. [38] proposed a semi-supervised detection framework with self-training, augmentation and consistency regularization, referred to as STAC. It pretrained a detector on the labeled data and then generate the pseudo labels for the unlabeled data. Strong data augmentations and consistency regularization were than applied to train a final detector. But this method relied on a well-trained model to generate the high-quality pseudo labels and could not support the end-to-end semi-supervised training. Liu et al. [10] proposed the semi-supervised detection framework based on mean teacher [29]. It utilized focal loss [17] to alleviate class imbalance and trained student network and teacher network in a mutually beneficial manner. To detect targets in complex maritime scenes well, its data augmentations need to be adapted to reflect the real maritime environment.

These detection methods above got the benefits from semi-supervised learning, which enabled the models to use the potential information of unlabeled data during training phase. Further, it is important to learn the robust features to improve the object detection in complex maritime scenes. Therefore, we propose a semi-supervised object detection method based on adaptive adversarial self-training, which provides a teacher–student detection framework to achieve semi-supervised object detection in complex maritime scenes. This method combines suitable data augmentation with accuracy online pseudo-labeling.

3. The Proposed Method

The proposed method aims to detect objects in complex maritime scenes within a semi-supervised setting, where the training set includes a labeled dataset

D_{L} = {d_{L}^{i}, y_{L}^{i}}_{i = 1}^{N_{L}}

and an unlabeled dataset

D_{U} = {d_{U}^{i}}_{i = 1}^{N_{U}}

.

N_{L}

and

N_{U}

are the number of the labeled data and the unlabeled data, respectively, and

y_{L}^{i}

is the annotation for the i-th labeled sample. It is noted that both the labeled and unlabeled data are sampled from the same maritime scenes, denoted as

D_{M}

. So their relationships can be formulated as

D_{L} \subset D_{M}

and

D_{U} \subset D_{M}

. In addition, the data distribution of labeled and unlabeled data is constrained as

\lim_{N_{L} \to \infty} P (d_{L}) = \lim_{N_{U} \to \infty} P (d_{U})

. In fact,

N_{L}

and

N_{U}

are limited with

N_{L} ≪ N_{U}

, which leads to the statistically distribution bias between

P (d_{L})

and

P (d_{U})

. Part of pseudo-label bias comes from this distribution bias as well as the class imbalance. Therefore, our work focuses on narrowing the distribution bias and applying the adaptive technologies to reduce pseudo-label bias, thereby contributing to the semi-supervised object detection in complex maritime scenes.

3.1. Objective Function Based on Adaptive Adversarial Self-Training

The training illustration of the proposed method based on adaptive adversarial self-training is shown in Figure 1. This figure presents the teacher–student framework, which primarily consists of six components: the data augmentation module (depicted in blue), the teacher network (in orange), the student network (in green), the ADD (in yellow), the LAA (in purple), and the objective function (in gray). It is noted that both the teacher network and student network use the same detection framework based on Faster R-CNN [15], where the backbone combines Resnet50 [39] with feature pyramid network (FPN) [40] to extract the multi-scale features. The proposed ADD and LAA are employed to alleviate distribution bias and pseudo-label bias, thereby enhancing semi-supervised detection performance in complex maritime scenes.

To start the semi-supervised training, the teacher network is pretrained with labeled data and then predict the outputs for the unlabeled data. In pseudo-labeling, the unlabeled data is fed into the data augmentation module to generate N weakly augmented samples. After the teacher network makes predictions on these samples, the LAA is used to generate the robust pseudo-label through ensemble judgement on different augmented version of the unlabeled sample. With the pseudo-labels, the student network can be trained in a “fully supervised” manner, where supervised loss and unsupervised loss are the important items in objective function. In the student network, a mix of labeled data and weakly augmented version of the unlabeled data is put into strong augmentation first. Subsequently, the student network generates a feature map from the weakly augmented set of the unlabeled data. Since different data augmentations will bring different data distribution that easily causes the pseudo-label bias in teacher network, the ADD is proposed to encourage the student network to improve the robustness of features derived from strong–weak augmented samples. This process of ADD is achieved through adversarial learning, which produces the adaptive adversarial loss. Combing the supervised loss, unsupervised loss and adaptive adversarial loss, the objective function can be formulated as follow:

\begin{array}{c} L (D_{L}, D_{U}, θ_{s}) = L_{L} (D_{L}, θ_{s}) + λ_{1} L_{U} (D_{U}, θ_{s}) \\ + λ_{2} R_{A} (D_{L}, D_{U}, θ_{s}) \end{array}

(1)

where

θ_{s}

is the parameter of the student network.

L_{L} (\cdot)

is the supervised loss,

L_{U} (\cdot)

is the unsupervised loss and

R_{A} (\cdot)

is the adaptive adversarial loss.

λ_{1}

and

λ_{2}

are the coefficient for corresponding items. For convenience, we omit the

θ_{s}

.

In each training step, the objective function is optimized with stochastic gradient descent (SGD) [41], formulated as

θ_{s}^{i} = \underset{θ_{s}^{i}}{\arg \min} L (θ_{s}^{i - 1})

, where

i

denotes the updating step. After that, the parameters of teacher network are gradually updated to predicted stable pseudo-label via exponential moving average (EMA), formulated as:

θ_{t}^{i} = ε θ_{t}^{i - 1} + (1 - ε) θ_{s}^{i}

(2)

where

θ_{t}

is the parameters of teacher network and

ε

is the smoothing coefficient.

At the end of the training, the teacher network aggregates the knowledge from the student network, which reduces the overfitting risk and the pseudo-label bias. We use the final teacher network as detection model to improve object detection in complex maritime scenes.

With the Faster R-CNN [15] as our detection framework, the supervised loss

L_{L} (\cdot)

consists of four items: the classification loss and regression loss of region proposal network (RPN), the classification loss and regression loss of the task head for region of interest (ROI). They can be formulated as:

\begin{array}{l} L_{L} (D_{L}) = & \frac{1}{|D_{L}|} \sum_{d \in D_{L}} L_{c}^{R P N} (d) + L_{r}^{R P N} (d) \\ + \frac{1}{|D_{L}|} \sum_{d \in D_{L}} L_{c}^{R O I} (d) + L_{r}^{R O I} (d) \end{array}

(3)

where

|D_{L}|

denotes the number of the labeled samples. The classification losses

L_{c}^{R P N} (d)

and

L_{c}^{R O I} (d)

use the focal loss function [17] to address the class-imbalance about foreground-background and foreground classes. While the regression losses

L_{r}^{R P N} (d)

and

L_{r}^{R O I} (d)

use the smooth-L₁ function [42] to generate robust bounding box.

Similarly, the unsupervised loss

L_{U} (\cdot)

for unlabeled data has the same mathematical form as the supervised loss, which leverages supervised information from the teacher network. It can be formulated as:

L_{U} (D_{U}) = L_{L} (D_{U}) \cdot f_{1} (\hat{y} \neq 0)

(4)

where

f_{1} (\cdot)

is the indicator function and only outputs 1 when the pseudo-label

\hat{y}

is not equal to 0.

More details of the adaptive adversarial loss

R_{A} (\cdot)

related to ADD and the robust pseudo-label

\hat{y}

related to LAA will be introduced in Section 3.2 and Section 3.3, respectively.

3.2. Adversarial Distribution Discriminator

In teacher–student framework, data augmentation and pseudo-labeling are two important technologies to improve the semi-supervised object detection. However, we found that the strong–weak augmentation would produce the distribution bias for the training data, which has the negative influence in the complex maritime scenes. To address this problem, we propose ADD that produces the adaptive adversarial loss

R_{A} (\cdot)

to reduce the distribution bias, which is shown in Figure 2.

The proposed ADD uses three convolutional layers as the domain classifier with gradient reversal layer (GRL) [43], and makes a domain discrimination on the input feature maps, formulated as:

\{\begin{matrix} g = f_{D} (f_{G R L} (F)) \\ F = F_{w} \oplus F_{s} \end{matrix}

(5)

where

f_{D} (\cdot)

and

f_{G R L} (\cdot)

are the mapping functions of the domain classifier and GRL, respectively.

\hat{g} \in ℝ^{b \times w \times h}

is the augmentation prediction of domain map, where

b

,

w

and

h

are, respectively, the batch size, width and height of feature map.

F \in ℝ^{b \times w \times h}

is a batch of feature maps concatenating feature maps

F_{w} \in ℝ^{w \times h}

and

F_{s} \in ℝ^{w \times h}

from weakly and strongly augmented version, respectively. ⊕ is the concatenation operation.

The value of target label

g \in ℝ^{b \times w \times h}

of the predicted domain map is 0 if the feature maps from weakly augmented version otherwise 1. It is noted that sign of the gradient is reversed when passing through the GRL in back-propagation, which performs the adversarial training to reduce the feature distribution bias between labeled data and unlabeled data. So the adaptive adversarial loss in Equation (1) can be formulated as:

R_{A} (D_{L}, D_{U}) = \sum_{F \in D_{L}, D_{U}} f_{c} (g, f_{A D D} (F))

(6)

where

f_{A D D} = f_{D} \cdot f_{G R L}

is the mapping function of ADD.

f_{c}

is the focal loss function [17] that address the hard domain samples, it is formulated as:

f_{c} (v, u) = \{\begin{matrix} - \sum_{u, v \in u, v} {(1 - u)}^{γ} \log u & if v = 1 \\ - \sum_{u, v \in u, v} u^{γ} \log (1 - u) & otherwise \end{matrix}

(7)

where

u

and

v

are the inputs,

γ

is the focusing coefficient usually set to 2.

We further define the feature distribution bias as

|P (F_{s}) - P (F_{w})|

, where the feature map is obtained as:

\{\begin{matrix} F_{w} = f_{E} (α (d)) & d \in D_{U} \\ F_{s} = f_{E} (A (d)) & d \in D_{L} \cup D_{U} \end{matrix}

(8)

where

f_{E}

is feature backbone in the student network.

α (\cdot)

and

A (\cdot)

are the weak and strong augmentation, respectively. The weak augmentation includes techniques such as the gaussian blur, horizontal flip and identity mapping. The strong augmentation employs Cutout, color jitter, random fog, random rain, random sun flare and KeepAugment [44].

During the training in ADD, minimize the objective function is going to reduce the feature distribution bias by adversarial manner, formulated as:

\min_{θ_{s}} L (D_{L}, D_{U}, θ_{s}) \to \min_{f_{D}} \max_{f_{G R L}} R_{A} (D_{L}, D_{U}; F)

(9)

As the feature distribution bias is reduced, the labeled and unlabeled data come to share a more similar feature space, which helps the teacher network to extract the unbiased features and generate the accuracy pseudo-label for the unlabeled data.

3.3. Label Adaptive Assigner

In self-training, it is important to generate accuracy pseudo-label for the unlabeled data, which helps to improve the semi-supervised detection performance. Unlike pseudo-labeling with fixed threshold, we use an adaptive threshold filter shown in Figure 3 to determine the pseudo-label of an unlabeled sample based on its multiple augmented weakly versions.

As the sample augmented weakly with N_A times fed into the teacher network, the average output in LAA helps to achieve the consistency regularization, which reduces the pseudo-label bias, formulated as:

\bar{y} = \frac{1}{N_{A}} \sum_{k = 1}^{N_{A}} f_{θ_{t}} (d_{k})

(10)

where

\bar{y} = [{\bar{y}}_{(c)}, {\bar{y}}_{(r)}]

is the average output of teacher network, which includes the classification output

{\bar{y}}_{(c)}

and regression output

{\bar{y}}_{(r)}

.

f_{θ_{t}} (\cdot)

is the mapping function of teacher network with parameters

θ_{t}

,

d_{k}

is the input sample at k-th augmentation. N_A is the number of augmentations.

To consider the learning difficulties across all classes, we adjust the thresholds adaptively for each class using the adaptive threshold filter. Motivated from curriculum learning [33], the adaptive threshold filter determines the flexible thresholds for each class based on the current learning status of the detection model and saves the pseudo-labels whose confidence scores exceed these thresholds. The pseudo-label

\hat{y}

is obtained as follows:

\hat{y} = \{\begin{matrix} f_{s h a r p} (\bar{y}) & if \max ({\bar{y}}_{(c)}) > T_{i} (\arg \max ({\bar{y}}_{(c)})) \\ 0 & otherwise \end{matrix}

(11)

where

f_{s h a r p} (\cdot)

is a sharpening function to reduce the entropy of the class distribution,

T_{i} (c)

is the flexible threshold for class

c = \arg \max ({\bar{y}}_{(c)})

at training step i.

In more detail, the sharpening function is formulated as:

f_{s h a r p} (y) = [\frac{y_{(c)}^{δ}}{\sum y_{(c)}^{δ}}, y_{(r)}]

(12)

where

δ

is the hyperparameter. As

δ \to \infty

, the class distribution of

y

becomes a Dirac distribution.

The flexible threshold

T_{i} (c)

is formulated as:

T_{i} (c) = κ_{i} (c) τ

(13)

where

κ_{i} (c)

is the current learning status of the detection model with a range between 0 and 1, allowing

T_{i} (c)

to be adjusted within a certain range

(0, τ]

.

τ

is the predefined threshold.

As the high threshold may produce accuracy pseudo-label but result in a low recall rate, a low threshold can easily produce false positives. We use the F1 score of each class to evaluate the current learning status of the detection model, which balances the precision and recall rate, formulated as:

κ_{i} (c) = f_{F 1} (d) if f_{1} (c \in d) = 1

(14)

where

κ_{i} (c)

is updated only if

f_{1} (c \in d)

that is the indicator function, and only outputs 1 when the class

c

appears in the sample

d \in D_{L}

.

f_{F 1} (\cdot) \in [0, 1]

is the F1 score function to measure the learning status of detection model.

3.4. Semi-Supervised Training for the Proposed Method

We propose a semi-supervised object detection in complex maritime scenes based on adaptive adversarial self-training. The key components are the ADD and LAA, which are used to reduce the distribution bias and pseudo-label bias, respectively. There are two same network structures named teacher network and student network, which are parameterized by

θ_{t}

and

θ_{s}

, respectively. As the student network pretrained with the labeled data and share its parameters

θ_{s}

to the teacher network, the teacher network acquires a certain level of discrimination to generate pseudo-labels for the unlabeled data. With the proposed ADD, the distribution bias of augmented training data from strong–weak augmentations is reduce through optimizing the adaptive adversarial loss

R_{A} (\cdot)

, which helps to improve the accuracy the pseudo-label. Also, the strong–weak augmentations applied in teacher network and student network, performs implicitly the consistency regularization for the training data. In each training step with the ground-true and the pseudo-label, the student network is trained in a “fully supervised” manner and its parameters

θ_{s}

is optimized by SGD method with momentum [41], formulated as:

θ_{s}^{*} = \underset{θ_{s}}{\arg \min} L (D_{L}, D_{U}, θ_{s})

(15)

After that, the parameters

θ_{t}

of the teacher network is updated via EMA, as shown in Equation (2), which helps to generate the robust pseudo-labels. Further, the proposed LAA leads to average output and an adaptive threshold filter to adjust the threshold for each class according to the learning status of detection model. This further reduces the pseudo-label bias and improve the accuracy of pseudo-labels. Finally, at the end of training, the teacher network is used as the final detection model.

4. Experiments and Results

4.1. The Dataset and Evaluating the Protocol

We have made a large marine object dataset combining two public datasets [45,46] to evaluate the proposed method. This marine object dataset consists of 16,980 images (1920 × 1080 pixels) and covers 14 categories. We split the marine object dataset into a training set and a test set at a 4:1 ratio. The training set therefore contains 13,584 samples and the test set contains 3396 samples. Among these training samples, up to 10% are labeled samples while the remaining samples are unlabeled samples whose annotations are artificially discarded. Specially, the test set typically contains the samples in complex maritime scenes including occlusion, scale variations and lighting variations, with 1661, 1144 and 940 samples, respectively. It is noted that some samples in the test set cover multiple scenes. Detailed information of the marine object dataset is shown in Table 1 and Table 2.

Following the standard protocol to evaluate detection task, we adopt (i) precision rate, (ii) recall rate, (iii) mean average precision (mAP) [47], (iv) frame per second (FPS) and (v) average center error (ACE) [4] as the evaluation metrics. Among them, mAP is the average of the average precisions (AP) for each class, where AP is calculated as the area under the precision-recall curve for that class. This metric is the tradeoff between precision and recall at different classification thresholds, providing a comprehensive measure of how well a detector can identify objects across multiple classes. ACE is taken as the metric for evaluating the location accuracy of object detection. It calculates the mean distance between the center points of the predicted bounding boxes and their corresponding ground truth. A lower ACE value indicates higher positional precision, indicating that the predicted locations of object centers are closer to the actual locations. The FPS evaluates the real-time performance.

4.2. Implementation Details

We propose a semi-supervised object detection method in complex maritime scenes based on adaptive adversarial self-training, which can be trained end-to-end in a semi-supervised manner. All training data are fed into the detection model in batches to minimize the objective function using SGD method with momentum. Due to the limitation of memory of GPU, the batch size is set to 2, which consists of labeled data and unlabeled data. The number N_A of weak augmentations is also set to 2. The input images are resized so that the shorter side is 800 pixels. Training epoch is set to 30 to achieve convergence, and first 5 epochs are dedicated to full supervision training using only the labeled data for the stability of model training. The settings of hyperparameters of the objective function in Equation (1) are as follows:

λ_{1} = 0.1

for the unsupervised loss

L_{U} (\cdot)

to alleviate the pseudo-label bias in the early training stage;

λ_{2} = 0.01

for the adaptive adversarial loss

R_{A} (\cdot)

as a regularization item to reduce the distribution bias. The hyperparameter

ε

of EMA in Equation (2) is set to 0.999. The hyperparameter

δ

of the sharpening function in Equation (12) is set to 0.5. The upper threshold

τ

in Equation (13) is experimentally set to 0.9 which is a high enough confidence. It is noted that the proposed method and the compared methods use FPN [40] and same data augmentations including Cutout, random fog, random rain, random sun flare and KeepAugment [44] to deal with the complex maritime scenes.

4.3. Experimental Results

To evaluate the proposed method in complex maritime scenes, three experiments are conducted as follows: (i) ablation studies on ADD and LAA; (ii) performance of various data amounts; (iii) performance of different methods.

(i) Ablation studies on ADD and LAA: In our way of semi-supervised training, ADD and LAA are proposed to reduce the distribution bias and pseudo-label bias. To analyze the influence of the ADD and LAA, we conduct the experiments by setting, respectively,

λ_{1}

and

λ_{2}

in Equation (1) to zero. We use 10% of the training samples as the labeled data (1358) and the remaining samples (12,226) as the unlabeled data. Experimental results are shown in Table 3, the proposed method with ADD and LAA obtains the best mAPs overall that are 92.9%, 88.7%, 92.7% and 93.8%, respectively, in corresponding maritime scenes. In all maritime scenes, the fully supervised baseline method [15], the method with only LAA and the method with ADD obtain mAP values of 90.5%, 90.2% and 90.9%, respectively. This demonstrates that the strong–weak augmentations for labeled and unlabeled data can produce distribution bias, and the proposed ADD helps to reduce this bias.

To confirm that the distribution bias is generated from the strong–weak augmentation, we perform a feature reduction between the original data and their augmented versions. The features are extracted from the last layer of the backbone. As shown in Figure 4a, the distribution of the augmented data exhibits an offset from that of the original data, and there are some feature points of augmented data that form an isolated cluster at the bottom. In Figure 4b, the distribution of augmented data matches the distribution of the original data, laying a solid foundation for the effectiveness of data augmentation.

To highlight the advantages of the proposed adaptive threshold, the compared experiment about the different threshold strategies is shown in Table 4. The adaptive threshold strategy with LAA obtains the highest mAP of 92.9% in Table 4, while the fixed threshold strategy requires manual search for the optimal threshold which may not be suitable across different datasets. The trend of adaptive threshold during training is shown in Figure 5. As seen in Figure 5a, the thresholds of all classes gradually converge around 0.8, which explains why the fixed threshold of 0.8 achieves the second-highest mAP of 89.9% to some extent. And the proposed LAA provides more appropriate threshold according to the learning status of the detection model, which reduces the pseudo-label bias and improves the performance semi-supervised object detection.

(ii) Performance of various data amounts: To evaluate the sensitivity of varying data amounts, we conduct this experiment by changing the proportions of labeled and unlabeled samples. The total numbers of all the labeled and unlabeled samples are 1,358 and 12,226, respectively. We set 4 types of ratios (1%, 2%, 5% and 10%) for labeled samples and 3 types of ratios (0%, 50%, 100%) for unlabeled samples. It is noted that the 0% ratios of unlabeled samples denote the baseline in fully supervised manner. As shown in Table 5, the more labeled samples, the higher the mAP, which aligns with the widely held intuition in training detection model. Similarly, more unlabeled samples also help improve the mAP in Table 5. However, the following two cases attract our attention. (1) Little improvement is achieved while training with unlabeled samples and 1% labeled samples, which suggests that the number of labeled samples is too small to effectively leverage the unlabeled samples. (2) The mAP obtained by combing 2% of labeled samples with all unlabeled samples is higher than that achieved with 5% of labeled sample in a fully supervised manner. The same phenomenon occurs when combing 5% of labeled samples with all unlabeled samples, whose mAP is higher than that with 10% of labeled sample in a fully supervised manner. This demonstrates that a large number of unlabeled samples can bring greater performance gains than a limited number of labeled samples. More discussion will be presented in Section 5.

(iii) Performance of different methods: To highlight the advantage of semi-supervised learning, we compare the proposed method with other existing detection method [10,15,38,48], using 5% of labeled samples and all unlabeled data. All these methods are trained with the same hyperparameters and tested on the same test set with the complex maritime scenes of occlusion, scale variations and lighting variations. Faster R-CNN [15] is considered as the baseline method that only trained with labeled samples.

Experimental results are listed in Table 6. The baseline method [15] gets the mAPs of 79.8%, 76.9%, 79.7% and 80.4 in corresponding maritime scenes, respectively. YOLOv8 [48] improves specifically for small object detection and has slightly higher mAP than STAC [38], which gets the mAPs of 85.8%, 83.3%, 85.1% and 87.6 in corresponding scenes, respectively. All the compared semi-supervised detection methods have the higher mAPs than the baseline method, demonstrating the powerful strength of the semi-supervised detection and the potential to reduce labeling costs. Specifically, the semi-supervised detection method named STAC [38] uses pseudo-label to improve the detection performance and gets the mAPs of 85.6%, 85.5%, 86.9% and 89.5% in corresponding maritime scenes. The other semi-supervised detection method named UT [10], which combines the mean teacher framework [29] with pseudo-labeling, achieves the second-best mAPs of 87.4%, 85.5%, 86.9% and 89.5% in corresponding maritime scenes. The proposed method uses the ADD and LAA to obtain the best mAPs of 91.4%, 89.3%, 91.0% and 92.6% in corresponding maritime scenes. Our method aligns the feature distribution of strong–weak augmentation and reduces the pseudo-label bias, thereby helping to improve the detection performance in complex maritime scenes. Regrading ACE, the proposed method and YOLOv8 show best value, indicative of minimal center deviation and thus superior location accuracy. Since different methods in this part are performed in the same two-stage detection framework, their detection speeds are considered roughly to be the same. The FPS of the proposed method is 11.1, which meets the real-time requirements.

The visualization results in complex maritime scenes for the proposed method and UT [10] are shown in Figure 6. The results with different colors indicate different target classes. Different columns show the detection results from different methods and the ground truths (GT) are showed in the first column. Complex maritime scenes, including occlusion, scales variations and lighting variations, are presented in different rows. In Figure 6(1) with scene of occlusion, the proposed method detects two targets with high confidence score while UT [10] detects a false positive of bulk cargo carrier. In Figure 6(2) with scene of scale variations, the proposed method detects all the targets while UT [10] misses some small targets in a distant. In Figure 6(3) with scene of lighting variations, the proposed method misses a hard target while UT [10] generates the false positives.

5. Discussion

To reduce the distribution bias generated from strong–weak data augmentation and reduce the pseudo-label bias for different classes, we propose a semi-supervised object detection method in complex maritime scenes based on adaptive adversarial self-training. ADD and LAA are the two important components that provide solutions to these challenges. We make a further discussion according to three experiments implemented in this work.

(i) Ablation studies on ADD and LAA: The results of the first experiment are shown in Table 3. This experiment conducts the ablation studies of ADD and LAA which are proposed to reduce the distribution bias and pseudo-label bias. Although the strong–weak augmentation [32] has become the popular setting, we found that this augmentation strategy produces the distribution bias for the training data in this experiment, which has the negative influence on pseudo-labeling in the complex maritime scenes. As shown in Figure 4a. There are some feature points of augmented data become an isolated cluster, which will produce the unpredictable output through the nonlinear mapping of neural network. With the proposed ADD, the distribution bias can be reduced and therefore the consistency regularization is applied implicitly. Further, LAA is proposed to adaptive adjust the pseudo-label threshold for different classes according to the learning status of the detection model. Compared with the fixed threshold strategy, the adaptative threshold achieves the best mAP of 92.9% shown in Table 4, which indicates that more accuracy pseudo-labels are generated and the costs of manual search for the optimal threshold can be reduced.

(ii) Performance of various data amounts: The results of the second experiment are shown in Table 5, which shows the performance with various data amounts. This experiment demonstrates intuitively that more labeled samples help improve detection performance. It is crucial to improve the accuracy of the pseudo-labels for unlabeled samples. But the accuracy pseudo-labels also depends on the number of labeled samples. On one hand, as shown in Table 5, it is difficult to make full use of the unlabeled sample with just 1% of labeled samples. On the other hand, the accuracy of these pseudo-labels is inherently contingent upon the availability of labeled samples, which lay a foundation to discover the useful information of the unlabeled samples.

(iii) Performance of different methods: The results of the third experiment are shown in Table 6 for complex maritime scenes including occlusion, scale variations and lighting variations. It can be observed that the semi-supervised detection methods achieve better mAPs than fully supervised detection method in the same detection framework, and the proposed method obtains the best mAP and ACE in each maritime scene. Also, thanks to the FPN [40] and data augmentations, each method in this experiment does not generate large gap in mAPs for different maritime scenes. The FPN is used to extract the multi-scale features to deal with scale variations. The data augmentations of Cutout and KeepAugment are used to deal with occlusion. And the data augmentations including color jitter and random fog are used to deal with the lighting variations.

Overall, the proposed method yields commendable outcomes in terms of localization accuracy and detection performance. However, there remains a likelihood of encountering persistent pseudo-label errors during the semi-supervised training, which can undermine the performance of the detection model. Particularly the risk of mislabeling escalates in scenarios of extreme ambiguity and noise label, where the model is confronted with conditions that are inadequately represented within the labeled sample set. This deficiency in coverage exacerbates the model’s susceptibility to errors, potentially compromising its robustness and reliability in real-world applications.

6. Conclusions

In this work, we propose adaptive adversarial self-training for semi-supervised object detection in complex maritime scenes, leveraging the unlabeled data with limited labeled data. To address the problem of distribution bias and the pseudo-label bias, we employ technologies including adversarial learning, data augmentation, and entropy minimization, and therefore proposed two important modules called ADD and LAA, respectively. Also, an end-to-end semi-supervised detection framework with objective function is proposed to alleviate the class imbalance problem. They are designed to enhance the robustness of feature extraction from different objects and broaden applicability across various scenarios such as different weather and locations. Experimental results show that the proposed method achieves the best mAP and ACE compared with the other methods and meets the real-time requirement, saving the labeling costs and aiding object detection in maritime intelligence surveillance.

However, it is important to pay additional attention to the propagation of learning bias that stem from the use of inexact pseudo-labels. Understanding and mitigating these inaccuracies are crucial for enhancing the overall performance and reliability of detection models, particularly in semi-supervised learning scenarios, where pseudo-labels play a pivotal role. In the future, we will focus on extending our current framework to incorporate video-based semi-supervised object detection, aiming to delve into the complexities of learning under conditions of label noise and ensuring they perform optimally within an open-world context.

Author Contributions

Dataset construction and methodology, J.F.; application investigation, X.L.; model studies and resources, L.T.; writing—original draft, J.F., L.T. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Guangdong Marine Economic Development Project under Grant GDNRC [2020]018, 2021 Guangdong Provincial Science and Technology Special Fund (“Big Project + Task List”) under Grant 210719145863737, Key Research and Development Plan of Guangdong Province-Next Generation of Artificial Intelligence under Grant 2018B010109001, and Guangdong Philosophy and Social Science Planning Project under GD21YGL16.

Data Availability Statement

Data are contained within this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Prasad, D.K.; Prasath, C.K.; Rajan, D.; Rachmawati, L.; Rajabally, E.; Quek, C. Object Detection in a Maritime Environment: Performance Evaluation of Background Subtraction Methods. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1787–1802. [Google Scholar] [CrossRef]
Kim, K.; Hong, S.; Choi, B.; Kim, E. Probabilistic ship detection and classification using deep learning. Appl. Sci. 2018, 8, 936. [Google Scholar] [CrossRef]
Moosbauer, S.; Konig, D.; Jakel, J.; Teutsch, M.; Koenig, D.; Jaekel, J.; Teutsch, M. A benchmark for deep learning based object detection in maritime environments. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; IEEE: New York, NY, USA, 2019; pp. 916–925. [Google Scholar]
Feng, J.; Li, B.; Tian, L.; Dong, C. Rapid Ship Detection Method on Movable Platform Based on Discriminative Multi-Size Gradient Features and Multi-Branch Support Vector Machine. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1357–1367. [Google Scholar] [CrossRef]
Bloisi, D.D.; Previtali, F.; Pennisi, A.; Nardi, D.; Fiorini, M. Enhancing Automatic Maritime Surveillance Systems with Visual Information. IEEE Trans. Intell. Transp. Syst. 2017, 18, 824–833. [Google Scholar] [CrossRef]
Li, Y.; Pang, Y.; Cao, J.; Shen, J.; Shao, L. Improving Single Shot Object Detection with Feature Scale Unmixing. IEEE Trans. Image Process. 2021, 30, 2708–2721. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. In Proceedings of the International Conference on Learning Representations, OpenReview.net, Virtual, 3–7 May 2021; pp. 1–17. [Google Scholar]
Zhang, F.; Pan, T.; Wang, B. Semi-Supervised Object Detection with Adaptive Class-Rebalancing Self-Training. arXiv 2021, arXiv:2107.05031. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft Teacher. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 3040–3049. [Google Scholar]
Zhou, Q.; Yu, C.; Wang, Z.; Qian, Q.; Li, H. Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 4081–4090. [Google Scholar]
Oza, P.; Sindagi, V.A.; Sharmini, V.V.; Patel, V.M. Unsupervised Domain Adaptation of Object Detectors: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4018–4040. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Lin, T.Y.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhao, C.; Liu, R.W.; Qu, J.; Gao, R. Deep learning-based object detection in maritime unmanned aerial vehicle imagery: Review and experimental comparisons. Eng. Appl. Artif. Intell. 2024, 128, 107513. [Google Scholar] [CrossRef]
Spraul, R.; Sommer, L.; Schumann, A. A comprehensive analysis of modern object detection methods for maritime vessel detection. In Proceedings of the Artificial Intelligence and Machine Learning in Defense Applications II, Online, 21–25 September 2020; SPIE: Bellingham, WA, USA, 2020; Volume 11543, p. 1154305. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Shao, Z.; Wang, L.; Wang, Z.; Du, W.; Wu, W. Saliency-Aware Convolution Neural Network for Ship Detection in Surveillance Video. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 781–794. [Google Scholar] [CrossRef]
Lyu, H.; Shao, Z.; Cheng, T.; Yin, Y.; Gao, X. Sea-Surface Object Detection Based on Electro-Optical Sensors: A Review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 190–216. [Google Scholar] [CrossRef]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 13001–13008. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 2918–2928. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Achille, A.; Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: New York, NY, USA, 2018; pp. 1–11. [Google Scholar]
Yang, X.; Song, Z.; King, I.; Xu, Z. A Survey on Deep Semi-Supervised Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; pp. 1196–1205. [Google Scholar]
Miyato, T.; Maeda, S.I.; Koyama, M.; Ishii, S. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1979–1993. [Google Scholar] [CrossRef]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K.; O’Connor, N.E.; McGuinness, K. Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Sohn, K.; Berthelot, D.; Li, C.L.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying semi-supervised learning with consistency and confidence. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 596–608. [Google Scholar]
Zhang, B.; Wang, Y.; Hou, W.; WUH, A.O.; Wang, J.; Okumura, M.; Shinozaki, T. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates, Inc.: New York, NY, USA, 2021; pp. 18408–18419. [Google Scholar]
Wang, Y.; Liu, Z.; Lian, S. Semi-supervised Object Detection: A Survey on Recent Research and Progress. arXiv 2023, arXiv:2306.14106. [Google Scholar]
Chen, S.; Zhan, R.; Wang, W.; Zhang, J. Domain Adaptation for Semi-Supervised Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4507405. [Google Scholar] [CrossRef]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based Semi-supervised Learning for Object detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; pp. 10758–10767. [Google Scholar]
Tang, Y.; Wang, J.; Wang, X.; Gao, B.; Dellandrea, E.; Gaizauskas, R.; Chen, L. Visual and Semantic Knowledge Transfer for Large Scale Semi-Supervised Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 3045–3058. [Google Scholar] [CrossRef] [PubMed]
Sohn, K.; Zhang, Z.; Li, C.L.; Zhang, H.; Lee, C.Y.; Pfister, T. A Simple Semi-Supervised Learning Framework for Object Detection. arXiv 2020, arXiv:2005.04757. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR:Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015; pp. 1440–1448. [Google Scholar]
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1180–1189. [Google Scholar]
Gong, C.; Wang, D.; Li, M.; Chandra, V.; Liu, Q. KeepAugment: A Simple Information-Preserving Data Augmentation Approach. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 1055–1064. [Google Scholar]
Prasad, D.K.; Rajan, D.; Rachmawati, L.; Rajabally, E.; Quek, C. Video Processing from Electro-Optical Sensors for Object Detection and Tracking in a Maritime Environment: A Survey. IEEE Trans. Intell. Transp. Syst. 2017, 18, 1993–2016. [Google Scholar] [CrossRef]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]

Figure 1. The training illustration of the proposed method with teacher–student framework. The proposed ADD and LAA are shown as the yellow and purple modules, respectively. They are designed to reduce the distribution bias and the pseudo-label bias. During the training stage, the student network aims to minimize the objective function and the teacher network aggregates the knowledge via EMA on the parameters of student network. At the end of the training, the teacher network is used as the final model to perform the object detection in complex maritime scenes. ⊕ is used to concatenate the input batch.

Figure 2. The overview of the structure of the proposed ADD.

Figure 3. The pipeline of the proposed LAA.

Figure 4. The visualization of feature reduction. (a) is conducted without ADD and (b) with ADD.

Figure 5. The trend of adaptive threshold during training. (a,b) are the threshold trends for different classes and their APs in each echo.

Figure 6. The visualization results for the proposed method, UT [10] and ground truth. Different columns show the results conducted by corresponding method and different rows show the complex maritime scenes. Different colors indicate different classes with their confidence scores.

Table 1. Detailed information of the marine object dataset to evaluate the proposed method.

Class Name	Label	Number	Percentage
Passenger ship	1	1536	2.41%
Ore carrier	2	3516	5.52%
General cargo ship	3	2426	3.81%
Fishing boat	4	3458	5.43%
Sail boat	5	2409	3.78%
Kayak	6	3532	5.55%
Vessel	7	22597	35.48%
Buoy	8	2474	3.89%
Ferry	9	5463	8.58%
Container ship	10	1773	2.78%
Other	11	4673	7.34%
Boat	12	1270	1.99%
Speed boat	13	5100	8.01%
Bulk cargo carrier	14	3454	5.42%
Total	-	63,681	100.0%

Table 2. Category information of the test set of the marine object dataset.

Category	Occlusion	Scale Variations	Lighting Variations
Passenger ship	2.8%	2.3%	2.5%
Ore carrier	6.3%	5.2%	6.0%
General cargo ship	4.3%	6.3%	3.7%
Fishing boat	6.4%	6.4%	5.4%
Sail boat	3.0%	3.8%	3.6%
Kayak	4.5%	5.6%	5.5%
Vessel	33.2%	31.4%	36.0%
Buoy	4.4%	4.1%	3.8%
Ferry	8.6%	8.6%	8.5%
Container ship	3.1%	2.8%	2.7%
Other	7.0%	7.4%	7.3%
Boat	2.2%	1.7%	1.8%
Speed boat	8.2%	9.2%	7.9%
Bulk cargo carrier	5.9%	5.2%	5.3%
Total	1161	1144	940

Table 3. Performance of the proposed ADD and LAA. The mAPs (%) are obtained combing different modules of the proposed method. Best Results are displayed in bold.

ADD	LAA	mAP (%)
ADD	LAA	All	Occlusion	Scale Variations	Lighting Variations
		90.5	86.9	90.3	91.2
	√	90.2	86.4	90.1	90.9
√		90.9	87.5	90.3	92.0
√	√	92.9	88.8	92.7	93.8

Table 4. Performance of different threshold strategy in all maritime scenes. Best results are displayed in bold.

Threshold Strategy	LAA	0.5	0.6	0.7	0.8	0.9
mAP (%)	92.9	85.6	88.1	87.9	89.9	89.6

Table 5. The mAPs (%) of different amounts of labeled and unlabeled samples. Best results are displayed in bold.

	1%	2%	5%	10%
Unlabeled	1%	2%	5%	10%
0%	63.8	71.2	79.8	90.5
50%	64.1	78.2	88.7	92.0
100%	64.1	80.1	91.4	92.9

Table 6. Detection results of different methods in complex maritime scenes. Best Results are displayed in bold.

Method	mAP (%)				ACE	FPS
Method	All	Occlusion	Scale Variations	Lighting Variations	ACE	FPS
FRCNN [15]	79.8	76.9	79.7	80.4	0.110	-
YOLOv8 [48]	85.8	82.5	85.9	87.3	0.093	25.8
STAC [38]	85.6	82.4	85.1	86.9	0.102	-
UT [10]	87.4	85.5	86.9	89.5	0.100	-
Proposed	91.4	89.3	91.0	92.6	0.093	11.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, J.; Tian, L.; Li, X. Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes. Mathematics 2024, 12, 2348. https://doi.org/10.3390/math12152348

AMA Style

Feng J, Tian L, Li X. Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes. Mathematics. 2024; 12(15):2348. https://doi.org/10.3390/math12152348

Chicago/Turabian Style

Feng, Junjian, Lianfang Tian, and Xiangxia Li. 2024. "Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes" Mathematics 12, no. 15: 2348. https://doi.org/10.3390/math12152348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Adaptive Adversarial Self-Training for Semi-Supervised Object Detection in Complex Maritime Scenes

Abstract

1. Introduction

2. Related Works

2.1. Object Detection Based on Deep Learning

2.2. Object Detection Based on Semi-Supervised Learning

3. The Proposed Method

3.1. Objective Function Based on Adaptive Adversarial Self-Training

3.2. Adversarial Distribution Discriminator

3.3. Label Adaptive Assigner

3.4. Semi-Supervised Training for the Proposed Method

4. Experiments and Results

4.1. The Dataset and Evaluating the Protocol

4.2. Implementation Details

4.3. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI