End-to-End Object Detection with Enhanced Positive Sample Filter

Song, Xiaolin; Chen, Binghui; Li, Pengyu; Wang, Biao; Zhang, Honggang

doi:10.3390/app13031232

Open AccessArticle

End-to-End Object Detection with Enhanced Positive Sample Filter

by

Xiaolin Song

¹,

Binghui Chen

²,

Pengyu Li

²,

Biao Wang

² and

Honggang Zhang

^1,*

¹

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Independent Researcher, Beijing 100000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(3), 1232; https://doi.org/10.3390/app13031232

Submission received: 11 December 2022 / Revised: 12 January 2023 / Accepted: 12 January 2023 / Published: 17 January 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Discarding Non-Maximum Suppression (NMS) post-processing and realizing fully end-to-end object detection is a recent research focus. Previous works have proved that the one-to-one label assignment strategy provides the chance to eliminate NMS during inference. However, this strategy might also result in multiple predictions with high scores due to the inconsistency of label assignment during training. Thus, how to adaptively identify only one positive sample as a final prediction for each Ground-Truth instance remains important. In this paper, we propose an Enhanced Positive Sample Filter (EPSF) to filter out the single positive sample for each Ground-Truth instance and lower the confidence of other negative samples. This is mainly achieved with two components: a Dual-stream Feature Enhancement module (DsFE) and a Disentangled Max Pooling Filter (DeMF). DsFE makes full use of representations trained with different targets so as to provide rich information clues for positive sample selection, while DeMF enhances the feature discriminability in potential foreground regions with disentangled pooling. With the proposed methods, our end-to-end detector achieves a better performances against existing NMS-free object detectors on COCO, PASCAL VOC, CrowdHuman and Caltech datasets.

Keywords:

end-to-end object detection; Enhanced Positive Sample Filter; Dual-stream Feature Enhancement; Disentangled Max Pooling Filter

1. Introduction

Object detection is a fundamental computer vision task. Existing mainstream object detectors have achieved excellent performances, while most of them [1,2,3,4,5] utilize the non-maximum suppression (NMS) operation, which hinders the deployment of an end-to-end detection pipeline.

NMS is a post-processing algorithm that eliminates duplicate detection results. Since the mainstream training strategy is to generate multiple positive samples for one Ground-Truth (GT) object, resulting in dense predictions, NMS serves as a necessary component for obtaining sparse predictions during inference. However, NMS is essentially a manually designed algorithm that relies on hyper-parameters, which limit models’ generalization capacity. Especially in crowd scenes, close-by bounding boxes with high confidences are likely to be false positives that hurt detection accuracy. To handle this problem, Soft NMS [6], Learnable NMS [7] and other variants [8,9,10] are proposed. However, in real-world industrial deployment, it is inconvenient to use any post-processing strategies, and a fully end-to-end detector is more attractive.

Recently, many works [11,12,13,14] have made efforts to discard NMS post-processing and establish a fully end-to-end object detection pipeline. One type of methods formulate object detection as a set prediction problem that directly generates sparse predictions, where DETR [13] and Sparse RCNN [14] are the two most representative works. DETR introduces transformer architecture to object detection. Sparse RCNN establishes a thoroughly sparse detection pipeline based on Faster RCNN [1] architecture. However, when considering the facts of computation costs, training efficiency and deployment, this kind of set prediction method might not be the optimal choice for actual industrial applications.

The other type of method [11,12] finds that one-to-one label assignment is the key factor for solving the NMS-free problem. They design similar one-to-one label assignment strategies to adaptively choose only one positive sample for each GT object and the remaining samples are considered negative during training. Besides the one-to-one label assignment strategy, DeFCN [12] also designs a 3D max filtering (3DMF) module for positive sample selection and an auxiliary loss to facilitate the training of the classification task. Thus, it shows a better performance than OneNet [11]. However, the input of the 3DMF module is the feature maps learned by the regression branch, which are highly related and sensitive to the scales of the input instance. When performing 3D max filtering, responses of larger objects will suppress those of relatively smaller ones, and sometimes the noise in responses will affect the actual correct information, resulting in wrong predictions. Additionally, we notice that the core idea of 3DMF is utilizing features in adjacent levels to improve the discriminability of convolution in local regions. Specifically, for one feature point, the max pooling operator finds the maximum value of its adjacent feature points in a certain range to update its value. However, the maximum value may be not reliable since there exist false positives and some outliers. In this way, some false predictions will have more severely negative effects in a local region. Both DeFCN and OneNet exploit a one-stage detection framework, which provides the chance to eliminate NMS in actual industrial deployment. We try to tackle the above-mentioned problems and make further steps towards better performance.

In this paper, we attempt to establish a high-performance end-to-end object detector that can be easily deployed in industrial applications. The proposed detector is built on top of FCOS [3], which is the state-of-the-art one-stage anchor-free detector. Meanwhile, we utilize the one-to-one label assignment strategy for enabling the training of one-to-one classification. To provide richer information, we add another binary classification branch regardless of category that just aims at distinguishing foreground samples from backgrounds. The learning of classification will constrain the model to focus most on the intrinsically high-level representations of objects instead of the scale-aware representation. Thus, we can concatenate features from the regression branch and the newly added binary classification branch to produce a more robust feature representation, which can provide compensatory clues for positive sample selection and we call this Dual-stream Feature Enhancement (DsFE). To enhance the feature distinction in potential foreground regions, we propose a Disentangled Max Pooling Filter (DeMF), where a residual stream, a first max pooling stream and a second max pooling stream are constructed to jointly help generate a more reliable local feature representation. Specifically, the second max pooling steam introduces local secondary-maximum values to revise the activations in local regions. The proposed DsFE and DeMF are combined together to provide an adaptive positive sample filter on the predicted C-d (C = 80 for COCO, C = 20 for PASCAL VOC, C = 1 for CrowdHuman and Caltech) classification logits to obtain final NMS-free classification outputs. We call the combined module an Enhanced Positive Sample Filter (EPSF). Equipped with it, our end-to-end object detector outperforms existing one-stage end-to-end detectors without introducing much complexity. Our main contributions can be summarized as follows:

We propose an end-to-end object detector that eliminates NMS post-processing and realizes fully end-to-end detection. We specifically design an Enhanced Positive Sample Filter (EPSF) for the adaptive positive sample selection in local areas, which is realized by two components, i.e., DsFE and DeMF;
We design a Dual-stream Feature Enhancement module (DsFE) that extracts rich information from features learned with different targets for one-to-one classification;
We design a Disentangled Max Pooling Filter (DeMF) to enhance the feature discriminability in potential foreground areas via disentangled max pooling;
Our proposed end-to-end object detector achieves a competitive performance against many state-of-the-art detectors. Extensive experiments on COCO [15], PASCAL VOC [16], CrowdHuman [17] and Caltech [18] datasets validate the effectiveness of the proposed method.

2. Related Work

2.1. Object Detection

Object detection is one of the most challenging tasks in computer vision, and has witnessed great progress in recent years [19,20,21,22,23]. Existing CNN-based object detectors can be roughly divided into anchor-based and anchor-free detectors. Anchor-based detectors [1,2,4,24,25,26] utilize a set of anchor boxes with different sizes and aspect ratios to predict object localizations and categories, where anchor boxes can be viewed as training samples. These anchor boxes are classified as positives or negatives via the classification task and regressed offsets to obtain the refined bounding box localizations. They can be further classified into two-stage and one-stage frameworks according to architecture designs. Two-stage detectors [1] mostly generate coarse region proposals at the first stage and then pass cropped proposal features to a downstream network at the second stage to obtain refined object localizations and categories. On the other hand, one-stage detectors [2,4,24] directly predict object categories and localizations based on multi-scale features. They are usually more computationally efficient than two-stage ones while achieving a competitive performance on mainstream benchmarks. However, performances of these anchor-based detectors are sensitive to hyper-parameters, e.g., sizes, aspect ratios and numbers of anchor boxes. This limits the generalization ability of models since detection tasks in different scenes may require different settings. Additionally, the training efficiency is influenced by the large computation complexity involved with the calculation of intersection-over-union (IoU) between anchor boxes and GT bounding boxes during the label assignment step. Moreover, anchor-based detectors are too difficult to deploy in actual industry applications. To this end, some works [3,27,28,29] attempt to discard pre-defined anchors in fully convolutional one-stage pipelines. Most of them utilize pixel-wise prediction of object localization and categories, where center points or points inside GT regions are viewed as positive training samples with extra predicted offsets to obtain target bounding boxes. The most representative anchor-free detector might be FCOS [3], which is widely employed in industry.

Nevertheless, all the above detectors produce redundant detection results, which makes NMS a necessary post-processing step for duplicate removal. NMS hinders the pipeline from end-to-end training, which limits the generalization capacity. For industrial deployment, the new trend is to employ an end-to-end detection pipeline.

2.2. End-to-End Object Detector

Many approaches are proposed to establish fully end-to-end detection pipelines. One stream of approaches formulate object detection as a set prediction problem, which are also called query-based approaches. DETR [13] firstly introduces transformer architecture to object detection and makes the decoder directly output sparse detection results without any post-processing. The input of the encoder is a set of learnable object queries, which model the relations between each object and global image context. Due to the dense information interaction manner, DETR [13] suffers from high computation complexity, slow convergence duration and relatively poor performance on small objects. Then, deformable DETR [30] is proposed to limit the attention field of each query to a small set of sampling locations around a reference point. Additionally, some other variants of DETR [31,32,33,34] also make efforts to improve DETR.

However, these transformer-based methods require large amounts of training data if deployed in practical applications, which involves large development and computation costs. Meanwhile, Sparse-RCNN [14] proposes a sparse set of proposal boxes to replace the dense candidates from RPN and makes extracted RoI features interact with associated proposal features in dynamic heads for a final prediction. Thus, the whole pipeline is in a pure sparse manner. Based on Sparse-RCNN, ref. [35] proposes a progressive prediction method to improve the performance in handling crowed scenes. Though it achieves a high performance, industrial applications still expect a simpler and more efficient detection pipeline. To this end, OneNet [11] and DeFCN [12] attempt to realize end-to-end detection in a one-stage framework without resorting to a self-attention mechanism or multiple-stage refinement. They utilize one-to-one label assignment strategies on one-stage detectors, where each GT instance is assigned only one positive training sample according to the results of classification and regression simultaneously. Since one-to-one assignment may inevitably introduce ambiguity and bring challenges to feature representation learning, DeFCN [12] proposes auxiliary loss and 3D max filtering to help training, which achieve better performances than OneNet [11].

These methods make it possible to perform end-to-end detection in one-stage frameworks, but there is still room for improvement. To meet the high efficiency requirements of industrial deployment, we built an end-to-end detector in a one-stage anchor-free manner. We utilized the one-to-one label assignment strategy as in DeFCN [12] to preliminarily discard NMS. To further improve the detection performance, we proposed an Enhanced Positive Sample Filter (EPSF) to conduct an adaptive selection among candidate predictions of each GT instance.

3. Method

In this section, we first introduce the overall detection pipeline. Next, we present the details of our proposed Enhanced Positive Sample Filter (EPSF), which consists of a Dual-stream Feature Enhancement module (DsFE) and a Disentangled Max Pooling Filter (DeMF).

3.1. Overall Pipeline

The overall detection pipeline is presented in Figure 1. Following FCOS [3], we first constructed a Feature Pyramid Network (FPN) [36] on a backbone network (ResNet50 [37]) to generate multi-scale feature maps. We denote feature maps from five pyramid levels as

P_{3}

,

P_{4}

,

P_{5}

,

P_{6}

,

P_{7}

, which have strides 8, 16, 32, 64, 128, respectively, and will be passed to a shared head for final detection. All feature maps from different pyramid levels have 256 channels. The shared head has three branches: a regression branch, an objectiveness branch and a classification branch. The regression and classification branches remain the same as in FCOS [3]. To provide credible and robust representations of objects, we constructed another binary classification branch called the objectiveness branch, which aimed to identify whether the prediction at each location belonged to foreground or background objects. All three branches were constructed with four convolution layers as well as another convolution layer for final outputs. The regression branch produces a 4-d vector

t_{x, y} = (l, t, r, b)

for each location (x,y) involved with a foreground sample, which encodes distances of the current location to four sides of corresponding GT box. With the predictions, we can obtain the coordinates of the predicted bounding boxes. The objectiveness branch generates a 1-d vector for each location, which represents the probability that the involved prediction is a foreground object. It should be noted that the output of the objectiveness branch does not contribute to inference directly, it is only used for providing rich information. The classification branch generates C-d (C = 80 for COCO, C = 20 for PASCAL VOC, C = 1 for CrowdHuman and Caltech) probability logits for each location to denote the probabilities that the prediction belongs to each labeled category. The proposed Enhanced Positive Sample Filter generates a 1-d vector

m_{x, y}

for each location. Its output serves as a selection mask to filter out the positive sample. The final detection results ‘Dets’ can obtained as follows:

\begin{matrix} Dets = H (Φ (I)) = {Reg, Obj, Cls, EPSF} (Φ (I)) = {B, P}, \end{matrix}

where I represents the input image and

Φ (\cdot)

denotes the backbone network with FPN. The generated

{B, S}

represents the set of predicted bounding boxes

B

and corresponding classification probabilities

P

.

H (\cdot)

represents the shared detection head that consists of four components, i.e.,

Reg (\cdot)

,

Obj (\cdot)

,

Cls (\cdot)

and

EPSF (\cdot)

, which denotes the regression branch, objectiveness branch, classification branch and the proposed Enhanced Positive Sample Filter, respectively. They can be formulated as follows:

\begin{matrix} Reg (Φ (I)) = {B, f_{r e g}}, \\ Obj (Φ (I)) = {P_{o b j}, f_{o b j}}, \\ Cls (Φ (I)) = {P_{c l s}}, \\ EPSF (f_{r e g}, f_{o b j}) = M_{e p s f}, \\ P = P_{c l s} \cdot M_{e p s f}, \end{matrix}

where

f_{r e g}

and

f_{o b j}

denote feature maps from the fourth convolutional layers in regression and objectiveness branches, respectively.

M_{e p s f}

represents the selection mask generated by EPSF.

The overall training objectives can be formulated as follows:

L = L_{c l s} (P_{c l s}) + L_{o b j} (P_{o b j}) + L_{r e g} (B) + L_{o t o} (P),

where

L_{c l s}

,

L_{o b j}

and

L_{r e g}

are applied to classification, objectiveness and regression branches, respectively. Specifically,

L_{o t o}

is applied to the products of outputs from classification and the proposed Enhanced Positive Sample Filter.

L_{c l s}

,

L_{o b j}

and

L_{o t o}

are a focal loss in [2] while

L_{r e g}

is the IoU loss in [38].

3.2. Label Assignment

Conventional NMS-based detection pipelines use the one-to-many label assignment strategy that assigns multiple positive samples for each GT object instance during training and then rely on the NMS post-processing module to remove duplicated predictions during inference. In this work, we aimed to discard NMS post-processing and realize fully end-to-end detection. To achieve this goal, the model should learn to perform one-to-one classification, which means a single positive sample should be distinguished from several near-by candidates for each GT instance. The one-to-one label assignment strategy [11,12] has been proved to be an effective solution to tackle this new challenge, which assigns only one positive sample for each GT instance during training. However, it provides fewer positive training samples and may affect the representation learning.

To address the dilemma, we utilized both one-to-one and one-to-many label assignment rules during training. Concretely, for classification and objectiveness branches, we offered sufficient supervision to enable the training of strong classifiers with robust feature representation capacities by using a one-to-many label assignment strategy to compute training targets of each location on feature maps, where each GT object instance is assigned with multiple positive samples. The one-to-many label assignment facilitates representation learning, but the massive positive training samples may lead to duplicate predicted bounding boxes. To meet the final one-to-one classification target, we proposed an Enhanced Positive Sample Filter (EPSF) that selects a single positive sample for each instance among several potential true positives with high scores produced by the classification branch. For each location, the outputs of EPSF and the classification branch are multiplied to obtain final predictions, whose training targets are produced by the one-to-one label assignment strategy. In this paper, we utilized the prediction-aware assignment strategy that was proposed by [12] for both one-to-one and one-to-many label assignment, which dynamically assigns samples according to the quality of classification and regression simultaneously. For different branches, we computed qualities based on different predictions. The details are presented in Figure 2.

3.3. Enhanced Positive Sample Filter

To filter out one best positive sample for each GT instance, we designed an Enhanced Positive Sample Filter (EPSF). First, we proposed a Dual-stream Feature Enhancement module (DsFE) to provide rich information clues so that the filter can make more confident decisions in different local regions. Then, we proposed a Disentangled Max Pooling Filter (DeMF) to enhance local features. Both of these two modules construct parallel streams, where each stream produces complementary results. The ensemble of these streams is more reliable and robust to different situations. The process of EPSF can be formulated as follows:

\begin{matrix} EPSF = DeMF (DsFE (f_{r e g}, f_{o b j})) = M_{e p s f} . \end{matrix}

Details of DsFE and DeMF will be introduced in the following parts.

3.3.1. Dual-Stream Feature Enhancement Module

Reviewing the literature, one solution for positive sample selection in local regions is a 3DMF module [12], which takes feature maps from the regression branch (the

(H, W, 256)

feature maps before the last convolution layer) as input. In other words, it makes decisions on positive sample selections based on the regression features that are supervised by regression loss. However, the training targets of the regression branch are offsets from four boundaries of the assigned GT box, where target offsets of small objects are relatively smaller than larger ones. As a result, the feature maps from the regression branch may obtain relatively lower responses in small object regions, which could be suppressed by large ones. Moreover, selecting only one positive sample for each instance is essentially a difficult problem due to the existence of inevitable ambiguity and noises. Therefore, we required stronger feature representations to provide richer information clues for positive sample selection. To this end, we constructed a Dual-stream Feature Enhancement module (DsFE) to help the downstream DeMF make more adaptive decisions in local regions. The diagram is depicted in Figure 1. One stream was extracted from the subnet of the regression branch, which catches localization information. The other stream comes from the newly constructed objectiveness branch. This new branch aims at distinguishing all foreground samples from backgrounds, where the binary classification objectives enable it to do well in extracting high-level representations and encoding invariances of object sizes. Two streams with different training objectives learn feature representations from different perspectives. We propose to combine the complementary features from two streams by concatenation, which is a simple but effective operation and makes a good balance between localization and classification clues. Intuitively, the resulting feature representations can serve as a more robust feature for adaptive positive sample selection.

3.3.2. Disentangled Max Pooling Filter

In addition to performing positive sample selection based on richer features, we also need to make full use of these features to generate adaptive predictions in different local regions. Our goal was to select only one positive sample for each GT instance. Concretely, the selected positive sample should achieve a higher classification score than the other samples involved with the same GT instance, where the larger score gap is better. In this work, we aim to produce a selection mask to adjust classification scores generated by the classification branch, which was trained with one-to-many assignment strategy. It is challenging for a fully convolutional network since convolutional is a linear operation with a transitional invariance property and tends to produce similar features for predictions related to the same GT instance. The 3DMF module [12] has made an attempt to utilize a rank-based non-linear 3D max filter to compensate for the discriminability of convolutions in the local region. Similar solutions were also used in [27,28]. In detail, it [12] enhances features in each level via performing 3D max pooling on multi-scale adjacent features, whose key idea is searching for the maximum value in a range of local regions across adjacent levels to update the value in the current location. Although it has been proved to be effective, it suffers from at least two drawbacks: (1) Locations in feature maps with the local maximum values may be false positives. These errors may be broadcast via the max pooling operation, which may lead to more false positives; (2) Some true positives with lower responses than false positives cannot be concerned by the max pooling operation. However, it can provide correct information in local regions.

To handle the above-mentioned drawbacks, we proposed a Disentangled Max Pooling filter (DeMF) as presented in Figure 3. The pipeline of DeMF contains three disentangled streams: residual, first max pooling and second max pooling streams. The predicted positive sample selection mask

M_{e p s f}

can be obtained by:

\begin{matrix} M_{e p s f} = Sigmoid (Conv (ReLU (GN (F_{r e s} + {MP}_{1 s t} + {MP}_{2 n d})))), \end{matrix}

where

F_{r e s}

,

{MP}_{1 s t}

and

{MP}_{2 n d}

denote the above-mentioned three streams, respectively, which are combined by an element-wise addition and several operations (i.e., Group Normalization, ReLU function and a convolutional layer, and the Sigmoid function). Next, we will introduce three streams in detail.

The residual stream keeps the original information learned by the backbone convolutional neural network. It is a necessary component that maintains the chances for all samples to contribute to the optimization of the model in the fully end-to-end training pipeline. The first max pooling stream leverages local maximum values to enhance the distinction of corresponding local regions. We have mentioned that local maximum values are sometimes not reliable due to the existence of outliers, noises or some other false positives. To revise the potential errors to some extent, the second max pooling stream introduces the local secondary-maximum values so as to provide additional references to activations in local regions. Moreover, since samples with secondary-maximum values are also likely to be true positives, they should be focused on more during training. The ensemble of three streams provides more reliable and robust feature representations that help with more confident positive sample selection.

Concretely, the residual stream was constructed by a shortcut connection from the input feature

X_{L}

(features in Lth level of FPN). As for the first max pooling stream, we firstly applied a convolution layer to generate

x_{l}

(the index l represents in the lth level) from the original feature. Then, we employed the bilinear operator to interpolate the features

x_{l - 1}

from the

(l - 1)

th level and

x_{l + 1}

from the

(l + 1)

th level as the same size of

x_{l}

. We denote the interpolated adjacent features to

x_{l - 1}^{'}

and

x_{l + 1}^{'}

. Then, we applied a 3D max pooling operator on {

x_{l - 1}^{'}

,

x_{l}

,

x_{l + 1}^{'}

} to generate an enhanced feature of current level

x_{l}^{'}

. Then, the second max pooling takes the output of first stage

x_{l}^{'}

as input. We found the peak locations that were picked up by the first max pooling stream and suppressed these locations by resetting their values to zero. In the same way of the first max pooling stream, we collected features of adjacent levels and applied a 3D max pooling operation again. We denote the resulting feature as

\tilde{x_{l}^{'}}

. Finally, outputs of three streams (

X_{L}

,

x_{l}^{'}

and

\tilde{x_{l}^{'}}

) were fused by element-wise addition. Several layers, including Group Normalization, ReLU, a convolution and Sigmoid, were constructed behind to generate the final selection mask of our proposed DeMF. The components in DeMF are all differentiable and light-weighted, and can be easily implemented and embedded into industry projects.

4. Experiments

4.1. Datasets

To demonstrate the effectiveness of the proposed method, experiments were performed over four popular detection datasets, including the COCO dataset [15], PASCAL VOC dataset [16], CrowdHuman dataset [17] and Caltech dataset [18]. For COCO, which contains 80 different object categories and is collected from various scenes, all models were trained on the COCO train2017 set ( 118k images) and evaluated on the val2017 set (5k images). For PASCAL VOC, we evaluated our model on the PASCAL VOC2007 dataset, which consists of 5011 trainval images and 4952 test images over 20 object categories. To show the detection performance in crowd scenes, we also evaluated our methods on the CrowdHuman dataset. For CrowdHuman, models were trained on a standard training set (15,000 images) and evaluated on a validation set (4370 images). Caltech is also a popular detection dataset with a total of 10 hours of video derived from urban driving environments. The training and test sets contain 42,500 and 4024 images, respectively.

4.2. Implementation Detail

The experiments were implemented using the cvpods toolbox [40]. For fair comparisons, we used pre-trained ResNet50 [37] as a backbone, as well as FPN [36]. For COCO and CrowdHuman, models were trained by 36 and 32 epochs, respectively. All experiments were trained by SGD with the initial learning rate of

0.01

, momentum of

0.9

, weight decay of

10^{- 4}

and mini-batch size of 16. The hard device was a Tesla-V100 GPU.

DetectionProcess during Inference. The general process of detecting objects during inference was as follows: Given an input image, we forwarded it through the detection pipeline. For each location

(x, y)

on the feature maps, we obtained classification scores

p_{x, y}

from the classification branch and the selection mask

m_{x, y}

from EPSF. Meanwhile, the coordinates of the predicted bounding box in each location could be obtained from the predictions of the regression branch. Then, we set

p_{x, y} * m_{x, y}

as the final scores, which indicate the probabilities that predicted the box in this location contains objects of the concerned categories. For evaluation, we directly used these predicted bounding boxes with final scores without NMS post-processing.

4.3. Comparisons with State-of-the-Arts

To demonstrate the superiority of our proposed methods, we present comparisons with the state-of-the-art methods in this section.

Comparisons on COCO: We compared the proposed detector with other state-of-the-art object detectors on the COCO val2017 set. Table 1 summarizes the comparisons. Our detector obtained superior performances compared to the most related NMS-free detector DeFCN and OneNet. Additionally, our proposed NMS-free detector outperforms other NMS-based detectors, e.g., surpassing Faster RCNN by 1.8% AP, FCOS by 0.6% AP and RetinaNet by 3.3% AP. These results firmly demonstrate the effectiveness of our proposed methods.

Comparisons on PASCAL VOC: The proposed detector is compared with the state-of-the-art object detectors on the PASCAL VOC2007 test set in Table 2. For fair comparisons, all models were trained on the PASCAL VOC2007 trainval set and tested on the PASCAL VOC2007 test set. It can be seen that our proposed detector surpasses other mainstream object detectors by a remarkable margin, which further validates the effectiveness of our proposed methods.

Comparisons on CrowdHuman: In crowded scenarios, end-to-end detection without NMS becomes more challenging. Compared with COCO, the CrowdHuman dataset is of much higher crowdedness, and contains approximately

22.6

pedestrians on average per image against

7.2

in COCO. To further show the performance of our proposed methods in crowd handling, we evaluated our model on the CrowdHuman dataset. As shown in Table 3, our end-to-end detector significantly outperforms several state-of-the-art detectors. Compared with one-stage NMS-free detectors DeFCN and OneNet, our detector achieves {2.9% mMR, 1.7% AP, 1.2% Recall} and {2.2% mMR, 0.1% AP, 0.1% Recall} gains, respectively.

Comparisons on Caltech: To further evaluate the effectiveness and robustness of the proposed detector, it was compared with the state-of-the-art on the Caltech test set. Following the official setting, we present comparisons in terms of mMRs on three subsets with different difficulty levels. The reasonable set (R) and the reasonable heavy occlusion set (HO) contain pedestrian objects with visibility of [0.65, 1] and [0.2, 0.65], respectively. Their union set is denoted as R + HO. As shown in Table 4, our detector achieves the best performance on all subsets, which validates its great robustness in terms of handling samples in different situations.

Advantages In summary, the proposed detector enjoys the following advantages:

The proposed detector shows superiority in performance against other NMS-free detectors and many mainstream NMS-based detectors. Thus, we think it is likely to serve as a strong alternative to current mainstream detectors;
The proposed detector shows a consistent excellent performance on these mainstream detection datasets including with different scenes, which demonstrates its robustness in terms of handling various situations;
The proposed end-to-end detector was designed in a one-stage anchor-free manner without involving the heuristic attention mechanism and NMS post-processing step. The entire detection pipeline is simple and efficient, and can be easily deployed in real-world industrial applications.

4.4. Ablation Study

Components of Enhanced Positive Sample Filter To validate the effectiveness of our proposed EPSF, we conducted an ablation study on its components, i.e., DsFE and DeMF. Table 5 summarizes the results. The second line shows the baseline results without using DsFE and DeMF. We first built DeMF upon the baseline, where the input of DeMF is from the regression branch. It can be seen that DeMF can raise consistent performance gains across all metrics. Then, we further added DsFE and obtained further improvements. It can be seen that our entire end-to-end object detector achieves a significant performance gain over the baseline, which confirms the effectiveness of EPSF.

Architectures of Dual-stream Feature Enhancement Module To validate the effectiveness of our proposed DsFE, we compared five alternative selection ways, whose corresponding diagrams are depicted in Figure 4. Table 6 summarizes the comparison results. From this table, one can observe that Figure 4e achieve the best performance by performing positive sample selection based on concatenated features from regression and objectiveness branches, showing that merging both localization and objectiveness information provides the chance to produce many robust positive sample proposals.

Architectures of Objectiveness Branch The objectiveness branch can be constructed in two ways as shown in Figure 5. We conducted experiments based on the two architectures and corresponding results are presented in Table 7. One can observe that constructing the objectiveness branch as an independent branch directly on top of FPN obtains a better performance than attaching it to the regression subnet. This is because their learning targets are different and the expected learned knowledge has large discrepancies. As a result, two independent branches perform the best.

Label Assignment in Objectiveness Branch. In order to learn good object representations via the objectiveness branch, a one-to-many label assignment strategy was employed. We tried to explore its best hyper-parameters, i.e., the average number of assigned positive samples. A larger number can provide more sufficient supervision to learn strong and robust representation, while a smaller number may facilitate the training of the proposed EPSF. As shown in Table 8, we find that averagely assigning 16 positive samples to each GT instance achieves the best overall performance.

Architectures of Disentangled Max Pooling Filter. We evaluated different architectures of DeMF and present the results in Table 9. Firstly, we attached a third max pooling stream that employs the same architecture as the second one and the performance declines severely. This may be because the extra third max pooling stream brings diffused regions higher values and makes the final feature representation less discriminative in foreground local regions. Moreover, we modified the 3D max pooling operation in the second max pooling stream to the 2D max pooling, which only cares about the feature in the current level. The performance is inferior to that of the 3D max pooling version, which reflects that the cross-level enhancement is effective since different levels may learn complementary information.

Different Kernel Sizes of Max Pooling in Disentangled Max Pooling Filter. We also evaluated different kernel sizes of the 3D max pooling operation in the second max pooling stream of DeMF. As shown in Table 10, a 3 × 3 kernel obtains the highest performance. The phenomenon confirms that the secondary maximum value within a limited range of local areas is helpful for feature enhancement.

4.5. Qualitative Results

We present some qualitative detection results of our proposed end-to-end object detector on COCO val2017 set in Figure 6. It can be seen that our detector can generate satisfactory detection results without NMS post-processing. In different scenes, each object is detected uniquely and precisely by our proposed model. It indicates that our model can make adaptive decisions of positive sample selections in local areas. The selected best bounding box for each instance obtains a high confidence score while other duplicated boxes achieve lower confidence scores and are removed in the final detection results.

5. Conclusions

This paper presents an end-to-end object detector without heuristic NMS post-processing. We realized that by establishing an Enhanced Positive Sample Filter to filter out only one positive sample for each GT instance and to lower the confidences of other negative samples, which consists of two components: a Dual-stream Feature Enhancement Module (DsFE) and a Disentangled Max Pooling Filter (DeMF). We have experimentally validated that DsFE can provide rich and robust feature representations by combining knowledge learned with different targets. The enhanced feature representations serve as strong decision-making clues for positive sample selection in local areas. DeMF can also effectively enhance the feature discriminability in potential foreground regions by introducing disentangled max pooling streams. Based on these two components, an Enhanced Positive Sample Filter can make confident decisions about positive sample selections in local areas. Extensive experiments have validated the effectiveness of our proposed methods. As shown in the experiments, our proposed end-to-end detector achieves a competitive performance against alternative methods. Specifically, the entire end-to-end detection pipeline was designed with a one-stage anchor-free mechanism, which can be easily deployed in industrial applications. End-to-end detection is a new trend in other specific detection tasks, e.g., face detection and vehicle detection, which involve different new challenges. However, that is not the focus of this paper. We will make explorations to solve the NMS-problem in more specific detection tasks in future work. We also hope that the proposed methods can inspire researchers to handle other tasks.

Author Contributions

Conceptualization, X.S.; Investigation, X.S.; Methodology, X.S., B.C. and H.Z.; Software, X.S.; Supervision, H.Z.; Validation, B.C.; Writing—original draft, X.S.; Writing—review & editing, B.C., P.L. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by National Natural Science Foundation of China under Grant No. 62076034.

Acknowledgments

We hereby give specifical thanks to Alibaba Group for their contribution to this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2888–2897. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Adaptive nms: Refining pedestrian detection in a crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6459–6468. [Google Scholar]
Huang, X.; Ge, Z.; Jie, Z.; Yoshie, O. Nms by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10750–10759. [Google Scholar]
Sun, P.; Jiang, Y.; Xie, E.; Shao, W.; Yuan, Z.; Wang, C.; Luo, P. What makes for end-to-end object detection? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 9934–9944. [Google Scholar]
Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15849–15858. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (accessed on 7 April 2007).
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 743–761. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Gong, J.; Chen, D. Object Detection Using Multi-Scale Balanced Sampling. Appl. Sci. 2020, 10, 6053. [Google Scholar] [CrossRef]
Zhang, Y.; Kong, J.; Qi, M.; Liu, Y.; Wang, J.; Lu, Y. Object Detection Based on Multiple Information Fusion Net. Appl. Sci. 2020, 10, 418. [Google Scholar] [CrossRef] [Green Version]
Jiang, J.; Xu, H.; Zhang, S.; Fang, Y. Object Detection Algorithm Based on Multiheaded Attention. Appl. Sci. 2019, 9, 1829. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Li, D.; Song, Y.; Gao, Q.; Wang, Z.; Liu, C. Single-Shot Object Detection with Split and Combine Blocks. Appl. Sci. 2020, 10, 6382. [Google Scholar] [CrossRef]
Liu, X.; Chen, H.X.; Liu, B.Y. Dynamic Anchor: A Feature-Guided Anchor Strategy for Object Detection. Appl. Sci. 2022, 12, 4897. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Sun, Z.; Cao, S.; Yang, Y.; Kitani, K.M. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtually, 11–17 October 2021; pp. 3611–3620. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-Training for Object Detection With Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
Zheng, M.; Gao, P.; Zhang, R.; Li, K.; Wang, X.; Li, H.; Dong, H. End-to-end object detection with adaptive clustering transformer. arXiv 2020, arXiv:2011.09315. [Google Scholar]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtually, 11–17 October 2021; pp. 3621–3630. [Google Scholar]
Zheng, A.; Zhang, Y.; Zhang, X.; Qi, X.; Sun, J. Progressive End-to-End Object Detection in Crowded Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 857–866. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam The Netherlands, 15–19 October 2016; pp. 516–520.
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhu, B.; Wang, F.; Wang, J.; Yang, S.; Chen, J.; Li, Z. CVPODS: All-in-One Toolbox for Computer Vision Research. 2020. Available online: https://github.com/Megvii-BaseDetection/cvpods (accessed on 3 December 2022).
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9759–9768.
Zhang, S.; Yang, J.; Schiele, B. Occluded Pedestrian Detection Through Guided Attention in CNNs. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Cai, Z.; Fan, Q.; Feris, R.; Vasconcelos, N. A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Zhang, L.; Lin, L.; Liang, X.; He, K. Is Faster R-CNN Doing Well for Pedestrian Detection? In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016.
Zhang, S.; Benenson, R.; Schiele, B. CityPersons: A Diverse Dataset for Pedestrian Detection. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Pang, Y.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]

Figure 1. The overall network architecture, where C3 to C5 represent feature maps from the backbone network and P3 to P7 denote feature maps from FPN that will be passed to a shared detection head. The head contains three branches: regression branch, objectiveness branch and classification branch. The proposed Enhanced Positive Sample Filter (EPSF) is applied to the features from the regression branch and objectiveness branch so as to select a single positive sample for each GT instance. EPSF contains two modules, namely Dual-stream Feature Enhancement (DsFE) and Disentangled Mx pooling Filter (DeMF).

Figure 2. The diagram of label assignments.‘POTO’ and ‘POTM’ denote prediction-aware one-to-one and one-to-many label assignments, respectively, which are based on the qualities of both classification and regression. The green dashed lines connect the predictions to ‘POTO’ or ‘POTM’, which means the qualities are computed based on these predictions. The green solid lines connect ‘POTO’ or ‘POTM’ to predictions, which represents ‘POTO’ or ‘POTM’ generating training targets for these predictions. The ‘obj subnet, ‘reg subnet’ and ‘cls subnet’ mean the four convolutional layers on top of FPN feature maps in objectiveness, regression, classification branches, respectively.

Figure 3. The diagram of proposed Disentangled Max Pooling Filter (DeMF). ‘GN’ indicates the group normalization [39].

Figure 4. The (a–e) are diagrams of alternative architectures of DsFE module. The ‘reg subnet’, ‘cls subnet’ and ‘obj subnet’ mean the four convolutional layers on top of FPN feature maps in regression, classification and objectiveness branches, respectively.

Figure 5. The (a) and (b) are diagrams of alternative architectures of objectiveness branch. The pink cubes represent convolutional layers.

Figure 6. Qualitative detection results of our proposed end-to-end object detector on COCO val2017 set.

Table 1. Performance comparisons with the state-of-the-art object detectors on COCO val2017 set.

Methods	Epochs	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
NMS-based Detectors:
Faster RCNN [1]	36	40.2	61.0	43.8	24.2	43.5	52.0
RetinaNet [2]	36	38.7	58.0	41.5	23.3	42.3	50.3
FCOS [3]	36	41.4	59.9	44.8	26.1	44.9	52.7
NMS-free Detectors:
OneNet [11]	36	38.9	57.3	42.3	23.9	41.9	49.5
DeFCN [12]	36	41.4	59.5	45.7	26.1	44.9	52.0
Ours	36	42.0	60.0	46.3	26.3	45.2	52.8

Table 2. Performance comparisons with the state-of-the-art object detectors on PASCAL VOC2007 test set. ‘NMS’ column indicates whether the method uses NMS.

Methods	NMS	mAP
Faster RCNN [1]	✓	73.1
RetinaNet [2]	✓	74.0
SSD [24]	✓	71.6
FCOS [3]	✓	73.0
DeFCN [12]	×	73.1
Ours	×	74.8

Table 3. Performance comparisons with the state-of-the-art object detectors on CrowdHuman val set. Note that the lower mMR is better.

Methods	Epochs	mMR↓	AP $_{50}$	Recall
NMS-based Detectors:
Faster RCNN [1]	-	50.4	85.0	90.2
RetinaNet [2]	32	57.6	81.7	88.6
FCOS [3]	32	54.9	86.1	94.2
ATSS [41]	32	49.7	87.2	94.0
AdaptiveNMS [9]	-	49.7	84.7	91.3
NMS-free Detectors:
DETR [13]	300	80.1	72.8	82.7
Deformable DETR [30]	32	54.0	86.7	92.5
OneNet [11]	50	48.2	90.7	97.6
DeFCN [12]	32	48.9	89.1	96.5
Ours	32	46.0	90.8	97.7

Table 4. Performance comparisons on Caltech test set. ‘NMS’ column indicates whether the method uses NMS. Numbers refer to mMR. Note that the lower mMR is better.

Method	NMS	R↓	HO↓	R + HO↓
FasterRCNN + ATT [42]	✓	10.3	45.2	18.2
MS-CNN [43]	✓	10.0	59.9	21.5
RPN + BF [44]	✓	9.6	74.4	24.0
FasterRCNN [45]	✓	9.2	57.6	20.0
MGAN [46]	✓	6.8	38.2	13.8
FCOS [3]	✓	6.9	34.1	14.2
DeFCN [12]	×	7.1	34.4	14.3
Ours	×	5.8	31.6	12.6

Table 5. The effect of components of the proposed Enhanced Positive Sample Filter on COCO val2017 set. ‘DsFE’ and ‘DeMF’ indicate using the Dual-stream Feature Enhancement and Disentangled Max Pooling Filter respectively.

DsFE	DeMF	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
×	×	39.1	56.2	42.9	24.4	42.5	49.6
×	✓	41.5	59.6	45.7	25.1	44.9	52.0
✓	✓	42.0	60.0	46.3	26.3	45.2	52.8

Table 6. Results of DsFE module with different architectures on COCO val2017 set. The (a)–(e) indexes in the first column denote corresponding architectures that are presented in Figure 4.

Methods	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
(a)	41.5	59.6	45.7	25.1	44.9	52.0
(b)	41.6	59.7	45.7	25.5	44.8	51.6
(c)	41.3	59.4	45.2	26.2	44.6	51.5
(d)	41.4	59.5	45.3	24.3	44.5	52.2
(e)	42.0	60.0	46.3	26.3	45.2	52.8

Table 7. Results of alternative architectures of objectiveness branch on COCO val2017 set. The (a) and (b) indexes in the first column denote corresponding architectures that are presented in Figure 5.

Methods	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
(a)	41.4	59.6	45.7	25.9	44.6	52.7
(b)	42.0	60.0	46.3	26.3	45.2	52.8

Table 8. Results of different settings of label assignment in objectiveness branch on COCO val2017 set. ‘One-to-K’ represents averagely assigning K number of positive samples to each GT instance.

Label Assignment	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
one-to-9	41.8	59.7	46.0	25.6	45.1	53.2
one-to-16	42.0	60.0	46.3	26.3	45.2	52.8
one-to-25	41.7	59.7	45.9	25.5	45.0	52.8

Table 9. Results of DeMF with different architectures on COCO val2017 set. ‘DeMF-3MP’ denotes adding an extra third max pooling stream with the same pipeline as the second one. ‘DeMF-2nd2dMP’ represents replacing the 3D max pooling operation in the second max pooling stream with a 2D max pooling.

Methods	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
DeMf	42.0	60.0	46.3	26.3	45.2	52.8
DeMF-3MP	41.2	59.2	45.2	26.5	44.4	52.8
DeMF-2nd2dMP	41.6	59.8	45.8	25.7	44.7	52.8

Table 10. Results of different kernel sizes of the 3D max pooling operation in the second max pooling stream of DeMF on COCO val2017 set.

Kernel Size	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
1x1	41.5	59.3	45.3	25.2	45.1	52.2
3x3	42.0	60.0	46.3	26.3	45.2	52.8
5x5	41.5	59.7	45.5	25.0	45.0	52.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, X.; Chen, B.; Li, P.; Wang, B.; Zhang, H. End-to-End Object Detection with Enhanced Positive Sample Filter. Appl. Sci. 2023, 13, 1232. https://doi.org/10.3390/app13031232

AMA Style

Song X, Chen B, Li P, Wang B, Zhang H. End-to-End Object Detection with Enhanced Positive Sample Filter. Applied Sciences. 2023; 13(3):1232. https://doi.org/10.3390/app13031232

Chicago/Turabian Style

Song, Xiaolin, Binghui Chen, Pengyu Li, Biao Wang, and Honggang Zhang. 2023. "End-to-End Object Detection with Enhanced Positive Sample Filter" Applied Sciences 13, no. 3: 1232. https://doi.org/10.3390/app13031232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Object Detection with Enhanced Positive Sample Filter

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. End-to-End Object Detector

3. Method

3.1. Overall Pipeline

3.2. Label Assignment

3.3. Enhanced Positive Sample Filter

3.3.1. Dual-Stream Feature Enhancement Module

3.3.2. Disentangled Max Pooling Filter

4. Experiments

4.1. Datasets

4.2. Implementation Detail

4.3. Comparisons with State-of-the-Arts

4.4. Ablation Study

4.5. Qualitative Results

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI