1. Introduction
Vision-based object detection in traffic scenes plays a crucial role in an autonomous driving system. With the rapid development of deep learning in recent years, the performance of deep learning-based object detection has improved significantly. Two-dimensional object detection in an autonomous driving system can automatically recognize and locate the spatial position and category of the traffic object by learning the features of the traffic object and perceiving the traffic scenario to ensure the safety of the vehicle. It can be divided into anchor-based and anchor-free detection with the development of the convolutional neural network (CNN). In anchor-based detection, a set of preset size anchors are placed, and the positions and sizes of the anchors are regressed based on the ground truth. By contrast, anchor-free detectors directly detect objects without the anchors being set manually, which has a performance advantage over anchor-based detectors.
However, single-stage (dense) detectors that include anchor-based and anchor-free detectors still have many problems that affect performance, such as the definition of positive and negative samples, and the design of the loss function, which lead to sample imbalance problems. Additionally, the authors found that, in autonomous driving, the aspect ratios of different objects have different distributions, as shown in
Figure 1. The inaccurate localization of objects can lead to the incorrect estimation of distance by the monocular ranging module, thus affecting the safe operation of autonomous driving. Anchor-based detectors (Faster R-CNN [
1], and RetinaNet [
2]) regress the object size using a large number of preset anchor boxes to make the network robust to aspect ratios. This mechanism has some drawbacks: (1) the anchor-based method performance is highly dependent on the selection of the number, size, and aspect ratio of the anchor. Anchor-based detectors need to fine-tune these hyperparameters. (2) In anchor-based detection, a set of anchor boxes needs to be densely placed. Most anchors are redundant and have little effect on model training. (3) Different detection tasks have different detection objects with large scale and shape variations, corresponding to the distribution of different anchor aspect ratios. It reduces the generalization ability of anchor-based detection method, for different anchor boxes need to be designed for new detection tasks. (4) In anchor-based detection, the computational cost increases due to the calculation of the intersection-over-union (IoU). Anchor-free detectors (CornerNet [
3], CenterNet [
4], and FCOS [
5]) directly predict objects using key point or center point regression without preset anchor boxes. This mechanism can eliminate the hyperparameters related to the anchor, achieve similar performance to anchor-based detectors with less computational cost, and achieve better generalization ability. However, in regression using key points or the center point, prior knowledge of the aspect ratio is lacking, which increases the model training time required to fit the distribution.
Additionally, ATSS [
6] indicated that the performance gap between anchor-based and anchor-free detectors is mainly affected by the definition of positive and negative samples. Inappropriate sample definition and sampling methods aggravate the imbalance of positive and negative samples. In
Figure 2, the yellow box is the ground truth, the red box is the positive sample, and the remaining gray boxes are negative samples. Each point on the feature map generates multiple anchor boxes in anchor-based methods, such as YOLO v3 [
7]. YOLO v3 only uses the anchor box with the maximum IoU value for the ground truth as the positive sample, and the remainder of the anchor boxes as negative samples. By contrast, anchor-free methods, such as CenterNet, only use the center point of the object as the positive sample and the remaining points as negative samples. The under-fitting of positive samples and over-fitting of negative samples make the model unable to thoroughly learn the features of objects, which affects the detection performance.
In this paper, an aspect-aware anchor-free detector called AspectNet is proposed to solve the above problems. It supervises the object width and height by adding an aspect prediction head at the end of the detector, which can learn different distributions of aspect ratios between different objects. Simultaneously, a new sample definition method and a loss function are proposed to alleviate the problem of sample imbalance.
The contributions of this paper are as follows:
- (1)
A novel aspect-aware anchor-free detector is proposed to fit the distribution of the aspect ratios of different objects to alleviate the influence of variable scales on model robustness.
- (2)
The sample definition method is improved to alleviate the problem of positive and negative sample imbalance.
- (3)
A loss function is designed to strengthen the learning weight of the center point of the network, which solves the imbalance problem.
The remainder of this paper is structured as follows: In
Section 2, recent advances in object detection methods are discussed.
Section 3 contains information, including specific details and improvement methods. In
Section 4, the implementation of the proposed method and its comparison with existing approaches are discussed. In
Section 5, the proposed method is summarized, and the future research direction is presented.
2. Related Work
Traditional object detection uses HOG [
8] or DPM [
9] to extract the image features, then feeds the features into a classifier, such as SVM [
10]. Because of its low performance, it was replaced by deep learning convolutional networks in the deep learning era. Generally, deep CNNs can be roughly categorized into two types: anchor-based approaches and anchor-free approaches.
2.1. Anchor-Based Approaches
Anchor-based approaches can be divided into two main types of pipelines: multi-stage detection and single-stage detection. Multi-stage detection methods filter regions of interest in the image and extract the foreground area from preset dense candidates, using region proposal algorithms [
11,
12] in the first stage. The bounding boxes of objects are constantly regressed in the subsequent stages, such as the R-CNN series [
1,
11,
12,
13], which is time consuming. Complex multi-stage models cannot meet the requirements of real-time detection. Different from multi-stage methods that use traditional sliding windows and proposals, single-stage detection methods inherit the ideas of anchor boxes but directly detect the object and regress the bounding boxes in a single-shot manner without any region proposals, which avoids the repeated calculation of the feature map and makes the anchor boxes directly on the feature map, which speeds up the detection process dramatically, for example, YOLOv3, RetinaNet, and SSD [
14].
These methods need to preset a large number of anchors manually, which increases the calculation in the network, and the excessive number of samples aggravates the imbalance of positive and negative samples, as described above. Additionally, the misaligned problem between anchors and features affects the network’s performance. In previous studies, researchers often used IoU to determine the samples. For example, if the IoU between the anchor boxes and ground truth boxes of a sample is in [0.5, 1], then the sample is positive, whereas if the IoU is in [0, 0.5), then the sample is negative. The anchor with the maximum ground truth IoU is selected for matching. These hyperparameters need to be manually fine-tuned for different detection tasks, and detection tasks are susceptible to hyperparameter selection. The matching definition of positive and negative samples has a significant impact on the model’s performance.
2.2. Anchor-Free Approaches
Anchor-free approaches can directly detect objects without preset anchors and can be divided into two directions: key point-based methods and center-based methods. Most are single stage because both methods are dense detectors. Key point-based methods use multi-key point joint expression to predict objects. For example, ExtremeNet [
15] uses the four extreme points and the center point to make predictions. RepPoints [
16] adaptively learns a set of points to represent boxes. FoveaBox [
17] uses the center point, and upper left and lower right corners. PLN uses the center point and four corner points. CornerNet uses a pair of corner points to regress the bounding anchor based on key point-based methods. However, it requires an extra distance metric to store the pair of corner points that belong to the same object, which requires complicated post-processing. Compared with key point-based methods, structure center-based methods are more concise and can achieve better performance [
18]. Therefore, the proposed model AspectNet uses the center-based approach to detect objects. Center-based methods use the center point for fitting to predict objects. YOLOv1 [
19] uses points near the object’s center to regress the bounding box instead of using anchor boxes. However, YOLOv1 has the characteristics of high precision and low recall because it only uses a small number of positive samples to learn features, and was quickly replaced by the anchor-based detector YOLOv2 [
20]. CenterNet models the bounding box regression problem as the problem of learning the center point and distance between the center point and the corresponding bounding box, and directly uses the center point regression method to predict objects without NMS for post-processing. The structure is simple and effective. However, CenterNet’s positive samples are only defined as the center point of the ground truth, and the remaining points are negative samples. This simple approach to defining positive and negative samples causes CenterNet to have a severe imbalance of positive and negative samples, which also causes the overfitting of regression branches. FCOS is a pixel-wise object detection algorithm based on FCN [
21], which alleviates the problem of positive and negative sample imbalance by adding a centeredness branch. It uses the feature pyramid network (FPN) for multi-scale prediction. FCOS defines the points in the bounding boxes of the object as positive samples, and the remaining points are negative samples. Compared with CenterNet, FCOS has a larger number of positive samples. However, FCOS does not define the ignored area of the sample, and the focal loss makes FCOS pay more attention to the noisy points around the bounding box. Additionally, the position of the bounding box edge is used as a positive sample for which it is difficult to obtain accurate results for prediction.
The excessive attention paid to difficult negative samples makes the loss unstable and difficult to decrease, hence not robust. The proposed method alleviates this problem by redefining sample selection and redesigning the loss function. Additionally, the aspect ratio branch of the proposed method can make up for the lack of prior knowledge of the aspect ratio in the anchor-free method.
3. Proposed Method
The proposed structure of AspectNet is shown in
Figure 3. Different from other anchor-free models, an aspect-aware prediction head is designed and added to the end of the detector connecting the regression branch and classification branch to predict the aspect ratio of bounding boxes. The sample definition method and loss function of AspectNet are proposed to alleviate the sample imbalance problem.
3.1. Aspect Prediction for AspectNet
The unstable prediction of the aspect ratio of the bounding box can lead to wrong judgments by the autonomous driving decision-making module, as shown in
Figure 4. Different objects have different aspect ratio distributions. For example, the expected value of the aspect ratio of the bounding box of the pedestrian is larger than that of the bounding box at the rear of the vehicle in front. The under-fitting of the model to the aspect ratio distribution of different objects makes model prediction unstable.
An aspect-aware head is added at the end of the detector to predict the aspect ratio of objects that concatenates the classification and regression branches. This can solve the robustness problem mentioned above, as shown in
Figure 3. The aspect-aware head improves the localization ability of the network. By contrast, learning the distribution of different aspect ratios can improve the model’s classification ability.
Specifically, by labeling the aspect ratio of objects, the model can learn the aspect ratio distribution of each specific category. Pedestrians have different weights, and the mean distribution is higher. Cyclists have a large distribution variance due to their different riding postures. Although vehicles have different viewing angles (e.g., aspect ratios of frontal and side views of vehicles are different), the overall aspect ratio distribution is lower than pedestrians and cyclists. By distinguishing the aspect ratio distribution of different categories, the model can locate the position of objects and classify them more accurately. During training, the aspect prediction head is trained synchronously with the other two branches and indirectly improves model performance by backpropagating the weight updates to these two branches so that the model can extract the aspect ratio features of the objects. During inference, the confident detection score of the predicted objects adds to the disturbance to the aspect ratio based on the existing predicted classification probability. The detection confident
score is defined as
where
is the predicted aspect ratio of the bounding box
,
is the aspect ratio calculated from predicted position information from the regression branch,
is the predicted classification probability. The larger the aspect ratio difference between the aspect-aware head and the regression branch, the smaller the confidence of the predicted bounding box.
The aspect ratio
is defined as
where
are the coordinates of the lower right corner point of the bounding box, and
are the coordinates of the upper left point. The feature maps are concatenated from the classification branch and regression branch. The entire aspect prediction head only consists of two convolution layers to keep the efficiency achieved by the RELU [
22] activation layer.
3.2. Sample Definition Method
The single-stage detector has always been plagued by the imbalance of positive and negative samples. The reason that the accuracy of the two-stage detector is higher than that of the single-stage detector is because the two-stage detector uses RPN or other selection methods in the early stage to alleviate the imbalance of positive and negative samples. The performance gap between anchor-free and anchor-based detectors is caused by the difference in the sample definition method of positive and negative samples. Therefore, selecting an appropriate method for determining positive and negative samples is the key to designing object detectors. Most center-based anchor-free methods select one ground truth center point as a positive sample or select all points in the ground truth box as positive samples. The former causes serious imbalance problems, whereas the latter causes the model to focus too much on complex samples, which results in the loss being unstable and difficult to decrease. In this paper, a sample definition method is proposed based on a center-based anchor-free detector for object detection tasks in autonomous driving.
All ground truth center points
are mapped to a heatmap
, where
W and
H are the width and height of the input image
, respectively,
R is a down-sampling factor, and
C is the number of categories that use an aspect-aware Gaussian kernel
given as follows and demonstrated in
Figure 5:
where
is an object size-adaptive standard deviation,
w and
h are the width and height of the object bounding box, respectively, and
is a low-resolution equivalent of each ground truth center point
.
represents the degree of influence of this point on the center point of the object.
corresponds to the center point of the object, whereas
is the background. Positive samples are defined as the points
and negative samples as the points
. Points
are ignored sample points, where
and
are hyperparameters that control the number of positive samples. In this paper,
are chosen. By proposing the positive sample area, the ignored sample area, and the negative sample area, our model can effectively pay attention to the relationship between the features of the object itself and the background without paying too much attention to the edges of the object, which can also avoid the overfitting of object bounding box edges. The definition of positive samples is also related to the aspect ratio of objects, making the sample definition method more efficient and concentrated. Thus, the problem of the positive and negative sample imbalance can be alleviated.
3.3. Loss Function
The loss function of the proposed method can be divided into three parts that correspond to three heads: the classification part, regression part, and aspect-aware part. The classification loss function is aspect-aware logistic regression based on the focal loss. As shown in the proposed sample definition method, is an aspect-aware Gaussian kernel, indicating the aspect ratio of the bounding box. The heatmap is a pixel-wise decrease from the center point of the bounding box to the surrounding region, and the decreasing speed is related to the aspect ratio.
The classification loss function is defined as
where
and
are the hyperparameters of the focal loss, and
N is the sum of the number of positive and negative samples. As shown in
Figure 6, the value of the center point of the image is equal to the maximum value 1. The positive samples are set according to the focal loss when
. The ignored area is defined as
. The model does not calculate the loss for this area and does not perform backpropagation to make the model not pay too much attention to the edge of the bounding box. For negative samples, controllable variable
is added. For the negative sample point near the center point, the penalty is relatively large, and the weight of loss
is relatively small, which indicates that the distinction between positive and negative at this position is more blurred.
The bounding boxes use CIoU [
23] to regress the spatial coordinates. As illustrated in
Figure 7, the regression objectives
,
,
, and
are computed for each location on all feature levels. Therefore, the predicted coordinates of the upper left and lower right corner points are represented as
, where
. The CIoU resolves the issue of no overlap between the bounding box and ground truth, which results in more stable bounding box regression because it considers the bounding box distance, overlap rate, scale, and penalty terms. Additionally, this may help to avoid divergence throughout the training phase.
The CIoU loss function includes an impact term
based on the DIoU [
23], which takes the length-to-width ratio of the predicted and ground truth boxes into account:
where
β is a trade-off parameter and
v is a parameter used to determine the aspect ratio consistency. Additionally,
ρ() denotes the distance between the center points of the bounding box and ground truth, and
c denotes the diagonal length of the smallest enclosing box that encompasses both boxes.
The binary cross-entropy loss is adopted for the aspect-aware head loss.
represents the predicted aspect ratio for each detected bounding box, and
is the aspect ratio of the corresponding ground truth box. During training, the gradients from the aspect-aware head loss function can update the weights of regression and the classification branch using backpropagation, which promotes the detection performance of the proposed model. The loss function of the aspect-aware head is defined as
where
is the binary cross-entropy loss,
is the sigmoid function, and
is the number of positive samples. Additionally,
is the total loss function obtained by adding three sub-loss functions (
: classification head,
: aspect-aware head, and
: regression head).
and
are calculated exclusively for positive samples.
3.4. Multi-Scale Detection
As shown in
Figure 8, overlapping ground truths may produce uncertainty, where the same point represents the position of different objects, which is difficult to resolve during training. In this paper, a multi-scale prediction approach is demonstrated to solve the problem of semantic diversity. Following the FPN [
24] and the pyramid attention network [
25], a method for detecting objects of various sizes using different levels of feature layers is proposed. A pyramid is created using a five-scale feature map denoted by the subscripts
.
and
are extracted from
and top-down convolution is performed to mitigate the deterioration caused by increasing the depth of the convolutional layers.
are each processed using 3 × 3 convolutions with two strides from
. Objects of different scales are mapped on different feature layers for prediction to avoid object occlusion and overlap. Multi-level detection distributes information throughout the many feature layers, which may increase the efficiency of the feature maps and hence detection performance.
5. Conclusions
In this paper, an anchor-free-based object detection method in autonomous driving was proposed. The main contribution is that an aspect-aware prediction head is added at the end of the detector. The accuracy of the detection on BCTSDB can be improved by adding aspect-aware prediction head in RetinaNet and FCOS. The improved sample definition method was used to alleviate the problem of sample imbalance; the comparison results on BCTSDB datasets show that the detection accuracy can be improved by replacing the original sample definition method in YOLO V3, FCOS and Retinanet models. The proposed loss function was added to strengthen the learning weight of the center point, and the validation experiment shows that this can improve the detection accuracy. An overall validation on public datasets demonstrated that the proposed method can achieve a significant improvement in detection accuracy. The AP50 and AP75 of the proposed method are 97.3% and 93.4% on BCTSDB, and the average accuracies of car, pedestrian, and cyclist are 92.7%, 77.4% and 78.2% on KITTI, respectively, which indicates that the proposed method achieves better results compared to other methods. The proposed method improved detection accuracy, but it still encountered many challenges when applied to real traffic scenarios. In autonomous driving scenarios, higher accuracy is required, and this can be further improved in future work. The experiment in this paper is trained on public datasets and real traffic scenes facing challenging with complex lighting and weather factors. This issue can be addressed by considering Transformers and domain adaptation in future work.