A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads

Cai, Xiaowei; Luo, Fuyi; Qi, Wei; Liu, Hong

doi:10.3390/electronics11233849

Open AccessArticle

A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads

by

Xiaowei Cai

^1,2,

Fuyi Luo

^1,3,

Wei Qi

^1,*

and

Hong Liu

¹

School of Information and Electrical Engineering, Zhejiang University City College, Hangzhou 310015, China

²

College of Information and Electronic Engineering, Zhejiang University, Hangzhou 310027, China

³

College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(23), 3849; https://doi.org/10.3390/electronics11233849

Submission received: 27 October 2022 / Revised: 17 November 2022 / Accepted: 21 November 2022 / Published: 22 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

Semi-supervised object detection algorithms based on the self-training paradigm produce pseudo bounding boxes with unavoidable noise. We propose a semi-supervised object detection algorithm based on teacher-student models with strong-weak heads to cope with this problem. The strong and weak heads of the teacher model solve the quality measurement problem of pseudo label localization to obtain higher-quality pseudo labels. The strong and weak heads of the student model are decoupled to reduce the negative impact of pseudo label noise on classification and regression. We reach 52.5 mAP (+1.8) on the PASCAL visual object classes (PASCAL VOC) dataset and even up to 53.5 mAP (+3.2) by using Microsoft common objects in context (MS-COCO) train2017 as additional unlabeled data. On the MS-COCO dataset, our method also improves about 1.0 mAP with the experimental configurations of 10% COCO and COCO-full as labeled data.

Keywords:

object detection; semi-supervised learning; strong-weak heads; teacher-student models

1. Introduction

Object detection has made significant progress with the development of deep convolutional neural networks (DNNs). Training an accurate object detector requires a large and well-annotated dataset. There have been many creative outputs for semi-supervised learning (SSL) in image classification but few academic results for semi-supervised object detection (SSOD).

Pseudo-labeling methods [1,2,3,4] have recently gained considerable attention in SSOD. Specifically, a teacher model generates a set of prediction boxes on unlabeled images and filters out possible pessimistic predictions with a confidence threshold. The remaining boxes are named pseudo-boxes and then used as targets to train student models. However, the noisy pseudo labels lead to two critical problems. First, the confidence comes from the output of the classifier, which can be used to filter the classification but should not be used to filter the localization. Second, mixing pseudo-labels and ground truth labels in a branch for training could cancel out the correct parameters learned from the truth data, leading to performance degradation.

To overcome the above-mentioned problems, we propose a semi-supervised object detection algorithm based on the FasterRCNN [5] framework using teacher-student models with strong-weak heads. From the whole framework, we follow the Mean Teacher [6] paradigm. Mean Teacher [6] works with large datasets and improves the speed of learning and the classification accuracy of the trained network. We migrate it from classification to object detection. Specifically, our teacher model generates pseudo annotations for unlabeled images and filters out high-quality pseudo labels. The student model uses these pseudo labels for semi-supervised training. The student model updates its weights by training, and the teacher updates the weights from the student model by exponential moving average (EMA).

In more detail, we introduce dual heads for the teacher and student models, respectively. We regard the strong head inference results of the teacher model as pseudo labels and prompt the weak head inference results to filter the pseudo labels. Other methods [1,2,3,4] using self-training in SSOD have no way to calculate the intersection over union (IOU) of the pseudo boxes. While there is a one-to-one correspondence between the prediction boxes of our teacher model’s dual heads, the IOU between them can determine the localization quality of the pseudo labels. Compared with others, the two-head structure of our teacher model solves the localization quality measurement problem of pseudo labels.

The strong and weak heads of the student model can better utilize the pseudo and ground truth labels for training. Other self-training approaches [1,2,3,4] use a single branch to mix pseudo labels and ground truth data for training. Such a method causes pseudo label noise to interfere severely with the proper weights learned from the ground truth data, especially in the fully-connected layers that directly determine the final classification and regression output. Compared with others, our student model can extract features from pseudo and ground truth labels by the shared network and reduce the negative impact of noise in low-quality pseudo labels on classification and regression by decoupling the strong and weak heads. Furthermore, we introduce the convolutional block attention module (CBAM) [7] to enhance feature extraction capability of the dual-head shared part in both spatial and channel dimensions.

The main contributions of this paper are as follows:

We propose a novel semi-supervised object detection algorithm with competitive performance in sufficient experiments on PASCAL VOC and MS-COCO datasets.
In the teacher model, the strong head inference results could generate pseudo labels, with which the weak head inference results can be used to compute the IOUs to determine the quality of pseudo label location to filter pseudo annotations.
In the student model, the strong head focuses on clean data (ground truth labels and high-quality pseudo labels), while the weak head can extract features from clean data and dirty data (low-quality pseudo labels) without polluting the strong head weights by noise in dirty ones.

This paper is organized as follows: Section 2 illustrates the related works; Section 3 exhibits the details of our method; we present the experimental results and comparisons to other methods in Section 4; and, finally, Section 5 concludes the paper.

2. Related Work

2.1. Object Detection

Object detection [5,8,9,10,11,12,13,14,15] has received considerable attention in recent years. Object detection can be classified into two categories: two-stage and one-stage. Two-stage object detection methods [5,12,14,16,17] divide the detection process into two steps: firstly, generating a region of interest (ROI) through a region proposals network (RPN) and then classifying and locating in the ROI. One-stage object detection methods [10,11,13] do not need to generate regions of interest but directly generate classification probabilities and regression results in one step. Generally, the two-stage algorithms are more accurate, while the single-stage algorithms are faster. Relying on massive labeled data, these detectors perform well in pedestrian detection [18], vehicle detection [19] and other applications [20,21,22,23,24]. In our work, we develop our framework based on FasterRCNN [5], the classical two-stage object detector. Unlike previous approaches that trained the model only on labeled data, we trained our object detector on labeled and unlabeled data with our proposed semi-supervised learning strategy.

2.2. Semi-Supervised Image Classification

In the field of semi-supervised image classification, there are several recent research results, and two research methods have received much attention. One is self-training [25,26], also known as the pseudo labeling method, a popular method for SSL that generates pseudo labels on unlabeled images to guide the model training with a pre-trained model. Another method is the consistent regularization [27,28,29]. The core idea is to make the same image’s predicted output as consistent as possible after perturbations. These perturbations include model dithering [25], data augmentation [27], and feature-based augmentation [30]. Data augmentation is widely used, which includes Mosaic, Mixup, CutOut, and reinforcement learning [31,32], in addition to rotation, panning, flipping, or color dithering. Rational data augmentation improves the robustness of deep neural networks [33] and dramatically enhances the performance on semi-supervised image classification tasks. Mean teacher approaches [2,4,6,34,35,36] use a dual-head structure for teachers and students, where teachers update their weights from the students’ weights by EMA. We follow some ideas in SSL, but our core aim is to further address the localization problem in SSOD.

2.3. Semi-Supervised Object Detection

Semi-supervised object detection [1,2,3,4,37,38,39,40,41,42,43,44] applies semi-supervised learning methods to object detection by training object detectors on labeled and unlabeled data. Recently, several papers have been published on SSOD. CSD [38] utilizes a consistent algorithm that encourages no change in the output of the original image and the image after perturbation. STAC [1] is a self-training approach with a pre-trained teacher model to generate pseudo-labels on unlabeled images, which are used to train a student model. STAC [1] lays down a paradigm for semi-supervised object detection research, but generating pseudo labels at once limits its performance. Instant-teaching [3] generates pseudo labels and designs a teacher-student model interaction learning algorithm. ISMT [4] considers historical pseudo-labels, but its performance is not superior. All the self-training methods mentioned above use a confidence threshold to coarsely filter pseudo labels, which inevitably leads to poor quality of pseudo label localization. Considering this, Unbiased Teacher [2] does not involve pseudo boxes in the regression training. Nevertheless, this method would waste the valid localization information in the noisy pseudo boxes. In this paper, we continue the research paradigm of STAC [1] and propose a new semi-supervised object detection algorithm. The teacher model dynamically generates pseudo labels for the student model. The two-head prediction boxes of our teacher model can calculate IOU to be a basis for localization filtering. The student model fully extracts features from pseudo labels with a CBAM [7] and decouples strong and weak heads to reduce the negative impact of low-quality pseudo label noise on classification and regression.

3. Methodology

We want to solve the object detection problem under semi-supervised conditions, where

D s = {x_{i}^{l}, y_{i}^{l}}_{i = 1}^{N^{l}}

denotes labeled images and

D s = {x_{i}^{u}, y_{i}^{u}}_{i = 1}^{N^{u}}

denotes unlabeled images.

x_{i}^{l}, x_{i}^{u}

denote the ith labeled image and the ith unlabeled image, respectively;

y_{i}^{l}, y_{i}^{u}

denote the truth labels of the ith labeled image and the pseudo labels of the ith unlabeled image. The ground truth labels and the pseudo labels consist of classification and regression annotations; the difference is that the teacher model generates the pseudo labels.

N^{l}, N^{u}

denote the number of supervised and unsupervised pictures.

3.1. Method Overview

The framework of the algorithm is shown in Figure 1. First, the strong head of the pre-trained teacher model generates pseudo labels for the weakly data augmentation unlabeled images. Their classification confidences distinguish the classification quality of the pseudo labels by a classification threshold. Both the model structures and the input features of the two heads are the same; consequently, there is a one-to-one correspondence between their regression boxes, whose IOUs could be used to determine the pseudo labels localization quality filtered by the IOU threshold. When training the student model, the strong head learns from clean data (ground truth labels and high-quality pseudo labels), while the weak head learns from both clean data and dirty data (low-quality pseudo labels).

Regarding the whole framework, our teacher-student model consists of two training stages, a pre-training stage, and a semi-supervised training stage. In the pre-training phase, we train the teacher model with a small amount of ground truth data. We copy the initialization parameters of the pre-trained teacher model to the student model before the semi-supervised training. A teacher model is trained in STAC [1] to generate pseudo labels at once and no longer updates the generated pseudo labels, which makes the training of the student model all based on poor quality and non-updated pseudo labels, thus limiting its performance. In our approach, the teacher model dynamically generates pseudo labels for the student model, and the student model updates the parameters to the teacher model via EMA. During the semi-supervised training process, the teacher model’s parameters are continuously updated, generating higher-quality pseudo labels, which can better train student models.

In terms of the internal structure of the student and teacher models, unlike the ordinary FasterRCNN [5] model, we designed a decoupled two-head structure with a shared network. This design differs from using only a single head to train both high-quality and low-quality data or using two completely independent models to train both data separately and finally taking the mean value of their predictions. Compared with the former, the strong and weak heads decoupling in our scheme avoids the interference of low-quality pseudo-label data noise on the classification and regression results; compared with the latter, our approach draws on the idea of multi-task learning to mitigate overfitting. Here, high-quality data refers to ground truth labels and high-quality pseudo-labels, and low-quality data refers to low-quality pseudo-labels, which will be elaborated on later. In the shared part of the two, there is a CBAM [7] to further extract the common features from both channel and spatial dimensions.

3.2. Pre-Training Stage

We train the two-head model using a small amount of truth data in a fully supervised form, with all supervised data going through both strong and weak heads. We randomly initialize the strong and weak heads separately. Specifically, we first use the available supervised data

D s = {x_{i}^{l}, y_{i}^{l}}_{i = 1}^{N^{l}}

to optimize our model parameters with supervised losses

L_{\sup}

. The supervised losses as Equation (1) for object detection include four losses: the RPN classification loss, the RPN regression loss, the ROI strong head classification loss, the ROI strong head regression loss, the ROI weak head classification loss, and the ROI weak head regression loss:

L_{\sup}^{} = L_{c l s}^{R P N} + L_{r e g}^{R P N} + L_{c l s}^{S} + L_{r e g}^{S} + L_{c l s}^{W} + L_{r e g}^{W}

(1)

3.3. Semi-Supervised Training Stage

After the pre-training phase, semi-supervised training begins.

In the first stage of semi-supervised training, the teacher model infers the unlabeled images after weak data augmentation to generate pseudo labels and filter them. Specifically, first, the teacher model’s strong head generates candidate pseudo-label boxes

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

and confidence scores

C_{S} = {c_{i}^{s}}_{i = 1}^{N}

, and its weak head generates pseudo-label boxes

B_{W} = {b_{i}^{w}}_{i = 1}^{N}

and confidence scores

C_{W} = {c_{i}^{w}}_{i = 1}^{N}

. The IOUs

I O U_{S W} = {i o u_{i}^{S W}}_{i = 1}^{N}

of

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

and

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

are calculated. We select the pseudo-label boxes

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

and confidence scores

C_{S} = {c_{i}^{s}}_{i = 1}^{N}

obtained from the strong head inference as the pseudo labels for the next stage of semi-supervised training of the student model.

C_{S} = {c_{i}^{s}}_{i = 1}^{N}

are the basis for judging the classification quality of pseudo labels.

I O U_{S W} = {i o u_{i}^{S W}}_{i = 1}^{N}

are the quality basis for judging the pseudo-label regression boxes, which solves the problem of locating the quality measure of the pseudo-label regression boxes that were not addressed in other papers [1,3,34] because their single-head model structure has no way to find IOUs.

We use the following method to filter the pseudo-labels

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

and

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

. We adopt non-maximum suppression (NMS) to filter the candidate pseudo-label boxes to avoid duplicates. First,

C_{S} = {c_{i}^{s}}_{i = 1}^{N}

classify the pseudo labels into high-quality and low-quality classification labels by the confidence threshold τ;

I O U_{S W} = {i o u_{i}^{S W}}_{i = 1}^{N}

classify the pseudo labels

B_{S} = {b_{i}^{s}}_{i = 1}^{N}

into high-quality regression labels and low-quality regression labels by the IOU threshold ε. When a set of pseudo-labels is both high-quality classification and high-quality regression labels, we call them high-quality pseudo-labels. When a set of pseudo labels is both low-quality classification labels and low-quality regression labels, it is not eligible for the next semi-supervised training. The remaining pseudo labels are called low-quality pseudo labels. The above method completes the generation and quality distinction of pseudo labels.

The second stage of semi-supervised learning is training the student model with clean and dirty data. We sampled the ground truth and pseudo-labeled images equally and then performed strong data augmentation before delivering them into the student model for training. For the network sharing part, the student model’s RPN network generates proposal boxes for the images and then extracts ROI features based on the proposal boxes utilizing ROI Align [12]. These ROI features are extracted further through CBAM [7]. For the two-head structure of the student model, the features obtained in the previous step are fed into the strong-head and weak-head ROI networks, respectively, and the clean data (ground truth data and high-quality pseudo-labels data) are sent to the strong head only to calculate the loss, while both clean and dirty data (low-quality pseudo-labels data) are sent to the weak head to calculate the loss. The total ROI loss during semi-supervised training includes, classification and regression loss in the strong head of ROI, classification loss in the weak head of ROI, and regression loss in the weak head of ROI. The formula is shown below:

L_{s e m i}^{R O I} = L_{c l s}^{R O I - S} + L_{r e g}^{R O I - S} + L_{c l s}^{R O I - W} + L_{r e g}^{R O I - W}

(2)

The total loss of the student model in the semi-supervised phase is:

L_{s e m i}^{} = L_{c l s}^{R P N} + L_{r e g}^{R P N} + L_{s e m i}^{R O I}

(3)

At the end of each iteration of the student model, the teacher model parameters are updated by EMA from the students, and α is the update momentum:

θ_{t} \leftarrow (1 - α) θ_{t} + α θ_{s}

(4)

The teacher model, after parameter update, could generate new, higher quality pseudo labels during the next iteration, and these pseudo labels are given to the student model for subsequent training. Unlike the static pseudo labels in STAC [1], our pseudo labels are dynamically updated.

The above is the complete training process, and for the inference process, we only use the strong head of the teacher model.

3.4. Strong-Weak Heads

Semi-supervised object detection methods [1,2,3,4] using pseudo labels cannot get around two key issues: how to measure pseudo label quality and train pseudo labels rationally. In the previous section, we solved the pseudo label quality measurement problem by classifying pseudo labels into high-quality pseudo labels and low-quality pseudo labels based on the dual heads structure of the teacher’s model. Next, we use the two-head structure of the student model to solve the problem of reasonable training for pseudo labels and ground truth labels.

Pseudo labels differ from ground truth labels in that they are noisy and large in number. Treating the truth and pseudo labels equally during training biases the gradient updates toward incorrect pseudo labels noise. This approach cancels out the correct parameters learned from the truth data, which leads to performance degradation. We design strong and weak dual heads to address the negative effect of pseudo labels noise on the network parameters.

We combine all the pseudo and ground truth labels and train them with a weak head. This weak head can learn some classification and localization information from the pseudo label and truth data; however, the noise of the pseudo label in the network limits the performance of the coarse weak head. Therefore, we introduce a strong head to learn the truth data and high-quality pseudo labels separately to avoid the weight contamination of the fully connected layer by noise in the low-quality pseudo labels.

Specifically, the strong and weak heads share the rest of the network except for the ROI head: pseudo labels and ground truth labels are sampled in a 1:1 ratio at each iteration, essentially an oversampling strategy for the truth data. This approach can somewhat solve the sample imbalance problem of pseudo labels and ground truth labels. The strong head of the student model trains only with clean data (ground truth data and high-quality pseudo data), and its weak head trains with both clean and dirty data (low-quality pseudo data). Compared with the single-branch structure, the ROI strong and weak dual-head network of the student model computes separate losses for clean and dirty data, preventing the direct interference of low-quality pseudo labels noise on the weights of the strong head network. Compared with the two completely independent model structures, the shared part of the student network can extract common valid information from clean and dirty data to eliminate the supervisory inconsistency of the two independent model structures. Two completely independent model also have a considerably higher computational complexity than our method.

3.5. CBAM

To further enhance the feature extraction capability of the shared network part, we introduce CBAM [7] before ROIHead, as shown in Figure 2. We take the intermediate feature map

F \in ℜ^{C \times H \times W}

as the input with height H, width W, and dimension C attributes. The following Equation can express the whole process:

F^{'} = M_{C} (F) \otimes F

(5)

F^{″} = M_{S} (F^{'}) \otimes F^{'}

(6)

The CBAM [7] sequentially yields a channel attention map

M_{C} \in ℜ^{1 \times H \times W}

and a spatial attention map

M_{S} \in ℜ^{C \times 1 \times 1}

.

\otimes

represents point multiplication, the attention map is broadcasted accordingly during the multiplication process: channel attention values are broadcasted along the spatial dimension, and spatial attention values are broadcasted along the spatial dimension.

F^{″}

is the final feature output.

As shown in Figure 3, the input feature maps

F \in ℜ^{C \times H \times W}

are pooled, respectively, by maximum pooling layers and average pooling layers. Then, pooling results are sent into the shared fully connected layers. The outputs are added together and activated by sigmoid to obtain the channel attention map

M_{C} \in ℜ^{1 \times H \times W}

.

As shown in Figure 4, in the spatial attention module, firstly,

F^{'}

goes through maximum pooling and average pooling in the channel dimension. Then, the two pooling results are contacted and convolved. After sigmoid activation, the output is the spatial attention map

M_{S} \in ℜ^{C \times 1 \times 1}

.

3.6. Data Augmentation

The images fed to the teacher model are weakly augmented data, including only random horizontal flipping. The images fed to the student model are strongly augmented, including single-image augmentation and multiple-image augmentation. Single-image data augmentation is processed within a single image and includes random horizontal flips, photometric distortion, random Gaussian blur, and random erasure. Multiple-image augmentation combines two or more images to enhance a single image, including Mixup and Mosaic.

4. Experiments and Results

4.1. Dataset

We conducted experiments on two datasets, PASCAL VOC and MS-COCO. There are four experimental setups in total: (1) 1%/2%/5%/10% randomly selected data from MS-COCO train2017 as labeled data and the rest as unlabeled data; (2) MS-COCO train2017 (~118 k images) as labeled data and MS-COCO additional (~123 k images) as unlabeled data; the above model performance are evaluated on MS-COCO2017 val; (3) VOC07 train (5011 images) as labeled data and VOC12 trainval (11,540 images) as unlabeled data; (4) VOC07 train (5011 images) as labeled data and VOC12 trainval (11,540 images) plus 20 classes of VOC categories from MS-COCO train2017 as unlabeled data, and model performance are evaluated on the VOC07 test.

4.2. Evaluation Metrics

IOU thresholds are set from 0.5 to 0.95, with 0.05 as the interval. They determine whether the object is detected as foreground or background. The mean average precision (mAP) evaluation metric is used to evaluate the experimental results. The average precision (AP) is the area under a two-dimensional curve with recall rate as the horizontal axis and precision as the vertical axis. The mAP is the average AP of all categories, as shown in Equation (7):

m A P = \frac{\sum_{q = 1}^{Q} A P (q)}{Q}

(7)

4.3. Experimental Details

In this paper, we follow STAC [1] using a Faster RCNN [5] with a feature pyramid network, using ResNet-50 [45] initialized by ImageNet pre-training as the backbone network. We use confidence threshold τ = 0.7 and IOU threshold ε = 0.6 as the basis for pseudo-label filtering and use AP50:95 (denoted as mAP) as the evaluation metric and inference on the strong head of the teacher model. Batchsize is 32 with 16 labeled and 16 unlabeled images, and we set the learning rate to 0.02 and use SGD as the optimizer. The EMA coefficient is set to α = 0.0001, and 8 NVIDIA TeslaV100 GPUs are used for all experiments.

4.4. Experimental Results on COCO

Table 1 shows the mAP results for IOU = 0.5:0.95 evaluated on MS-COCO 2017 val at different data ratios. The different ratios indicate labeled data and the rest indicate unlabeled data, where COCO-full indicates the full amount of MS-COCO train2017 as labeled data and MS-COCO additional as unlabeled data. The offsets in the table are from the results of multiple sets of experiments with different random seeds. As can be seen from the table, at 1%/2%/5%/10% of labeled data, we all improve about 10 mAP after semi-supervised training compared to the supervised baseline. At 2%/10% of labeled data and COCO-full, our experiments have optimal results. In summary, the experimental results on MS-COCO validate that our method is effective and significantly superior when the labeled data accounts for a smaller percentage.

4.5. Experimental Results on VOC

As shown in Table 2, the superiority of our method is more significant in two experimental settings. When the training set is composed of VOC07 train as labeled data and VOC12 trainval as unlabeled data, our method can improve the AP from 49.7 to 52.5 compared to Unbiased Teacher. When the training set is composed of VOC07 train as labeled data and VOC12 trainval plus MS- COCO train2017 in the VOC category of 20 classes as unlabeled data, we improve the AP from 50.3 to 53.5.

4.6. Ablation Experiments

For each of the following studies of ablation experiments, we used 10% COCO as the data configuration.

4.6.1. Multi-Head Ablation Experiment

Our method produces four sets of predictions:

A strong head of the teacher model.
A weak head of the teacher model.
A strong head of the student model.
A weak head of the student model.

The prediction results of these heads are shown in Table 3. From this table, it is clear that the strong and weak head performance of the teacher model is stronger than the strong and weak head performance of the student model, respectively; the strong head performance is better than the weak head performance for both the teacher and student models. This is why we adopt the strong head inference of the teacher model. As can be seen from Table 3, CBAM has a 0.2–0.3 point improvement for all heads.

4.6.2. Data Augmentation Ablation Experiment

We investigate a variety of data augmentations. As shown in Table 4, our method benefits from stronger augmentation. Others refer to random flips and Gaussian blur. Of all the data augmentation methods, Mixup is the most effective, improving performance from 31.3 mAP to 32.3 mAP, and Mosaic further improves by 0.3 mAP.

4.7. Visualization

Figure 5 shows the visualization results. The red boxes represent the prediction, and the blue boxes represent the ground truth. From top to bottom, Figure 5 shows the prediction results of the baseline (row 2, column 5 in Table 1), CSD (row 3, column 5 in Table 1), STAC (row 4, column 5 in Table 1), and our method (row 8, column 5 in Table 1). From the four sets of images, it is evident that our method’s location quality is higher than the other three methods, such as the truck, the bus, and the fire hydrant. Compared with STAC [1] and CSD [38], we can improve the pseudo label localization quality and train these pseudo label boxes appropriately. As seen from the two sets of images in the figure’s third and fourth columns, our method can detect densely distributed objects, such as cars in the fourth column. However, our performance needs to be improved for tiny objects with few pixels, such as the “tiny” person in the fourth column.

5. Conclusions

The most important contribution of this paper is to address two core problems not tackled by other self-training methods [1,2,3,4] in SSOD. Other approaches filter pseudo labels only by classification confidence, so their quality of pseudo label localization is poor. Their strategy to mix pseudo and ground truth labels for training further aggravates the training because of the noise in the pseudo labels. We propose a semi-supervised object detection algorithm using a teacher-student model with strong-weak heads for solving the pseudo label quality measurement and pseudo data training problems. The quality measurement problem of pseudo label localization is solved by the strong and weak heads of the teacher model. The shared part of the student model is used to learn the common features in the pseudo and ground truth data, and the strong and weak heads of the student model are decoupled to reduce the negative impact of noise on classification and regression in pseudo labels. We have conducted sufficient experiments on PASCAL VOC and MS-COCO datasets and achieved superior performance. We reach 52.5 mAP (+1.8) on the VOC dataset and 53.5 mAP (+3.2) with MS-COCO train2017 as additional unlabeled data. Our method also improves about 1.0 mAP on the MS-COCO dataset with the experimental configurations of 10% COCO and COCO-full as labeled data. The visualization results visually illustrate the superiority of the localization quality of our method compared with others. We will focus on how to improve the detection performance of tiny objects in future work.

Author Contributions

Conceptualization, X.C. and W.Q.; funding acquisition, W.Q.; investigation, H.L.; methodology, X.C.; software, X.C.; writing—original draft, X.C. and W.Q.; writing—review and editing, X.C. and F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Scientific Research Foundation of Zhejiang University City College (No.X-202204, No.X-202106).

Data Availability Statement

In this paper, two datasets, PASCAL VOC and MS-COCO, are used for semi-supervised object detection. You can find them at the following link. MS-COCO: https://cocodataset.org/. PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.-L.; Zhang, H.; Lee, C.-Y.; Pfister, T. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar]
Liu, Y.-C.; Ma, C.-Y.; He, Z.; Kuo, C.-W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar]
Zhou, Q.; Yu, C.; Wang, Z.; Qian, Q.; Li, H. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4081–4090. [Google Scholar]
Yang, Q.; Wei, X.; Wang, B.; Hua, X.-S.; Zhang, L. Interactive self-training with mean teachers for semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5941–5950. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Hashimaa, S.M.; Mahmoud, I.I.; Elazm, A.A. Experimental comparison among Fast Block Matching Algorithms (FBMAs) for motion estimation and object tracking. In Proceedings of the 2011 28th National Radio Science Conference (NRSC), Cairo, Egypt, 26–28 April 2011; pp. 1–8. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Lv, H.; Yan, H.; Liu, K.; Zhou, Z.; Jing, J. Yolov5-ac: Attention mechanism-based lightweight yolov5 for track pedestrian detection. Sensors 2022, 22, 5903. [Google Scholar] [CrossRef] [PubMed]
Yin, G.; Yu, M.; Wang, M.; Hu, Y.; Zhang, Y. Research on highway vehicle detection based on faster R-CNN and domain adaptation. Appl. Intell. 2022, 52, 3483–3498. [Google Scholar] [CrossRef]
Sumit, S.S.; Awang Rambli, D.R.; Mirjalili, S.; Ejaz, M.M.; Miah, M.S.U. Restinet: On improving the performance of tiny-yolo-based cnn architecture for applications in human detection. Appl. Sci. 2022, 12, 9331. [Google Scholar] [CrossRef]
Vecvanags, A.; Aktas, K.; Pavlovs, I.; Avots, E.; Filipovs, J.; Brauns, A.; Done, G.; Jakovels, D.; Anbarjafari, G. Ungulate Detection and Species Classification from Camera Trap Images Using RetinaNet and Faster R-CNN. Entropy 2022, 24, 353. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 1792–1800. [Google Scholar]
Wu, H.; Hu, Y.; Wang, W.; Mei, X.; Xian, J. Ship fire detection based on an improved YOLO algorithm with a lightweight convolutional neural network model. Sensors 2022, 22, 7420. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Xiao, D.; Liu, Y.; Wu, H. An algorithm for automatic identification of multiple developmental stages of rice spikes based on improved Faster R-CNN. Crop J. 2022, 10, 1323–1333. [Google Scholar] [CrossRef]
Bachman, P.; Alsharif, O.; Precup, D. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 16–21 June 2013; p. 896. [Google Scholar]
Berthelot, D.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Sohn, K.; Zhang, H.; Raffel, C. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv 2019, arXiv:1911.09785. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
Miyato, T.; Maeda, S.-i.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1979–1993. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kuo, C.-W.; Ma, C.-Y.; Huang, J.-B.; Kira, Z. Featmatch: Feature-based augmentation for semi-supervised learning. In European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 479–495. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 113–123. [Google Scholar]
Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.-Y.; Shlens, J.; Le, Q.V. Learning data augmentation strategies for object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 566–583. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3060–3069. [Google Scholar]
Kim, J.-H.; Shim, H.-J.; Jung, J.-W.; Yu, H.-J. A Supervised Learning Method for Improving the Generalization of Speaker Verification Systems by Learning Metrics from a Mean Teacher. Appl. Sci. 2021, 12, 76. [Google Scholar] [CrossRef]
Xiong, F.; Tian, J.; Hao, Z.; He, Y.; Ren, X. SCMT: Self-Correction Mean Teacher for Semi-supervised Object Detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 1488–1494. [Google Scholar]
Gao, J.; Wang, J.; Dai, S.; Li, L.-J.; Nevatia, R. Note-rcnn: Noise tolerant ensemble rcnn for semi-supervised object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–October–2 November 2019; pp. 9508–9517. [Google Scholar]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based semi-supervised learning for object detection. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
Jeong, J.; Verma, V.; Hyun, M.; Kannala, J.; Kwak, N. Interpolation-based semi-supervised learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11602–11611. [Google Scholar]
Li, Y.; Huang, D.; Qin, D.; Wang, L.; Gong, B. Improving object detection with selective self-supervised self-training. In European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 589–607. [Google Scholar]
Misra, I.; Shrivastava, A.; Hebert, M. Watch and learn: Semi-supervised learning for object detectors from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3593–3602. [Google Scholar]
Tang, P.; Ramaiah, C.; Wang, Y.; Xu, R.; Xiong, C. Proposal learning for semi-supervised object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2291–2301. [Google Scholar]
Tang, Y.; Wang, J.; Gao, B.; Dellandréa, E.; Gaizauskas, R.; Chen, L. Large scale semi-supervised object detection using visual and semantic knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2119–2128. [Google Scholar]
Zheng, S.; Chen, C.; Cai, X.; Ye, T.; Tan, W. Dual Decoupling Training for Semi-Supervised Object Detection with Noise-Bypass Head. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), Virtual, 22 February–1 March 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. The overview of the proposed framework.

Figure 2. The structure of the CBAM.

Figure 3. The structure of the Channel Attention Module.

Figure 4. The structure of the Spatial Attention Module.

Figure 5. Visualization of prediction quality.

Table 1. Experimental results on the COCO dataset.

Method	1% COCO	2% COCO	5% COCO	10% COCO	COCO-Full
Baseline	9.05 ± 0.16	12.70 ± 0.15	18.47 ± 0.22	23.86 ± 0.81	37.63
CSD [38]	11.12 ± 0.15	14.15 ± 0.13	18.79 ± 0.13	22.76 ± 0.09	38.52
STAC [1]	13.97 ± 0.35	18.25 ± 0.25	24.38 ± 0.12	28.64 ± 0.21	39.21
ISMT [4]	18.88 ± 0.74	22.43 ± 0.56	26.37 ± 0.24	30.53 ± 0.52	39.64
Instant-teaching [3]	18.05 ± 0.20	22.45 ± 0.15	26.75 ± 0.05	30.40 ± 0.05	40.20
Unbiased Teacher [2]	20.75 ± 0.12	24.30 ± 0.07	28.27 ± 0.15	31.50 ± 0.10	41.30
Ours	18.70 ± 0.32	24.55 ± 0.22	28.10 ± 0.20	32.50 ± 0.35	42.01

Table 2. Experimental results on the VOC dataset.

Method	Labeled	Unlabeled	AP	AP50
Baseline	VOC07	-	42.6	72.6
CSD [38]	VOC07	VOC12	42.7	76.7
STAC [1]			44.6	77.5
ISMT [4]			46.2	77.2
Instant-teaching [3]			48.7	78.3
Unbiased Teacher [2]			49.7	77.4
Ours			52.5	81.0
CSD [38]	VOC07	VOC12+MSCOCO20cls	43.6	77.1
STAC [1]			46.0	79.1
ISMT [4]			49.6	77.8
Instant-teaching [3]			50.8	78.8
Unbiased Teacher [2]			50.3	78.8
Ours			53.5	82.1

Table 3. Ablation experimental results of strong and weak heads for teacher-student models.

Head	mAP	CBAM/mAP
Teacher-Weak	31.9	32.1
Teacher-Strong	32.2	32.5
Student-Weak	29.9	30.1
Student-Strong	30.3	30.5

Table 4. Data augmentation ablation experiment results.

Method					AP
Others	√	-	-	-	31.1
CutOut	√	√	-	-	31.3
Mixup	√	√	√	-	32.3
Mosaic	√	√	√	√	32.6

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, X.; Luo, F.; Qi, W.; Liu, H. A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads. Electronics 2022, 11, 3849. https://doi.org/10.3390/electronics11233849

AMA Style

Cai X, Luo F, Qi W, Liu H. A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads. Electronics. 2022; 11(23):3849. https://doi.org/10.3390/electronics11233849

Chicago/Turabian Style

Cai, Xiaowei, Fuyi Luo, Wei Qi, and Hong Liu. 2022. "A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads" Electronics 11, no. 23: 3849. https://doi.org/10.3390/electronics11233849

APA Style

Cai, X., Luo, F., Qi, W., & Liu, H. (2022). A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads. Electronics, 11(23), 3849. https://doi.org/10.3390/electronics11233849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Semi-Supervised Object Detection Algorithm Based on Teacher-Student Models with Strong-Weak Heads

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Semi-Supervised Image Classification

2.3. Semi-Supervised Object Detection

3. Methodology

3.1. Method Overview

3.2. Pre-Training Stage

3.3. Semi-Supervised Training Stage

3.4. Strong-Weak Heads

3.5. CBAM

3.6. Data Augmentation

4. Experiments and Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Details

4.4. Experimental Results on COCO

4.5. Experimental Results on VOC

4.6. Ablation Experiments

4.6.1. Multi-Head Ablation Experiment

4.6.2. Data Augmentation Ablation Experiment

4.7. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI