We ran a K-means clustering algorithm to perform clustering analysis on the PASCAL VOC dataset and KITTI dataset. Then we proposed a mathematical derivation method based on IOU to determine the number and the aspect ratio dimensions of candidate anchor box for each scale of the improved network. Finally, to enhance the detection performance of the YOLO V3, we improved the structure of YOLO V3.
3.1. Appropriate Size for Anchor Boxes
YOLO V3 introduced the idea of anchor boxes used in Faster R-CNN. Anchor boxes are a set of initial candidate boxes with a fixed width and height. The choice of the initial anchor boxes will directly affect the detection accuracy and the detection speed. Instead of choosing anchor boxes by hand, YOLO V3 runs K-means clustering on the dataset to find good priors automatically. The clusters generated by K-means can reflect the distribution of the samples in each dataset, which can make it easier for the network to get good predictions. In this paper, Avg IOU is used as a metric of target clustering analysis. The objective function of clustering Avg IOU is as follows:
is the sample, namely the ground truth of the target.
is the center of the cluster.
is the numbers of samples in the
cluster center.
is the total number of samples.
is the numbers of clusters.
is the intersection over Union of the clusters and the sample.
We applied K-means clustering on the PASCAL VOC and KITTI dataset, respectively. The
Figure 3 shows the average IOU we got with different value of
. With the increase of
, the change of the objective function became more and more stable. Considering Avg IOU and the number of detection layers, we selected 12 anchor boxes. The width and the height of the corresponding clusters on PASCAL VOC dataset and KITTI dataset are shown in
Table 1.
After K-means clustering on the dataset, cluster centers are generated. YOLO V3 divides up the clusters evenly across scales. This may cause some clusters to be placed at inappropriate scales because the clusters are arbitrarily allocated. It is essential to determine what size of the anchor box is suitable for each scale of the network. Inspired by the method of the proposal generation [
33], we used mathematical derivation based on IOU to help select the appropriate size of the anchor boxes for each scale.
There are two extreme cases, as shown in
Figure 4. The red box represents the anchor box and the black box represents the ground truth box, and the green box is the grid cell of the feature map. If
is the side length of an anchor box and
is the side length of the ground truth box, and
is the side length of the grid cell in the feature map.
is the numbers of downsampling layers.
Consider the extreme case in
Figure 4a: We assume the anchor boxes and the ground truth boxes are quadrate and the anchor box is bigger than the ground truth box (
). The IOU between the anchor box and the ground truth box can be defined as
is ground truth box. The common metric to decide the prediction results is to see if the IOU is greater than 0.5 (
). Then we can get the result as follows
Consider the extreme case in
Figure 4b: The center of the anchor box is in the upper left corner of the grid cell of the feature map and the center of the ground truth box is in the bottom right corner of the grid cell. We assume that half the side length of the anchor boxes and the ground truth box is longer than the length of the grid cell in the feature map.
The IOU between the anchor box and the ground truth box can be expressed by (6)
, the equal sign in the inequality (7) is true. From (3), the IOU between anchor box and the ground truth box in
Figure 4a can be 1. We hope the IOU in (6) is greater than 0.5 so we can get the inequality to appear as follows
Then we can get the side length
and the area of an anchor box with the numbers of the downsampling
. This is shown in
Table 2.
With the results in
Table 1, we got the appropriate size and area of the anchor boxes for each scale of the output detection layer. However, the clusters generated by K-means on each dataset are not quadrate. Like the cluster (63, 32) generated by K-means on KITTI dataset, the height of the anchor box is much longer than that of the width. What is more, the height of the anchor meet the demand of output detection layer which is downsampled by 4× and the width of the anchor box meet the demand of the detection layer, which is downsampled by 8× according to
Table 1. To solve this problem, suppose the height and the width of the anchor box are
, we compare the value of
to determine which scale is suitable for this anchor. The cluster (63, 32) is suitable for the output detection layer which is downsampled by 8× (
So the principle to select suitable anchor boxes for each scale of the output detection layer can be concluded as follows:
According to the principle above, we can allocate the clusters on PASCAL VOC dataset to each suitable scale. This is shown in
Table 3.
The allocation of each cluster on KITTI dataset can be shown in
Table 4.