**4. Benchmark Experiments**

In this section, we first introduce the benchmark dataset and the evaluation metrics. We then detail the model configuration and discuss the experimental results.

### *4.1. Dataset*

VisDrone2019 dataset [39] consists of 288 videos with 261,908 frames and 10,209 static images that do not match the frames of videos. Data is collected from unmanned aerial vehicles such as DJI Mavic, Phantom series (3, 3A, 3SE, 3P, 4, 4A, 4P). The videos and images are collected at different times of day. The frames in the videos have the highest resolution of 3840 × 2160 and the still image is 2000 × 1500. Some images in the dataset are shown in the Figure 3. VisDrone2019 includes ten predefined categories of objects: *pedestrian, person car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle.* Only the training data is released by the contest organizers. In this paper, we use 56 clips of VID-train data, with 24,313 frames of the VisDrone2019 dataset as training data; and 7 clips of VID-val, with 2860 frames for model evaluation.

**Figure 3.** Some exemplary frames of videos in the VisDrone2019 dataset [39].

The number of unbalanced objects between classes is as follows. In VID-train, *pedestrian* (234,305), people (94,396), *bicycle* (40,255), car (505,301), *van* (46,940), *truck* (30,498), tricycle (28,338), *awning-tricycle* (13,011), *bus* (9653) and *motor* (102,819). In VID-val, *pedestrian* (32,404), *people* (17,908), bicycle (6842), *car* (31,821), van (6842), *truck* (1359), *tricycle* (3769), *awning-tricycle* (1718), *bus* (264) and *motor* (12,025). The class distribution of Visdrone VID is depicted in Figure 4. As a quick glimpse through the training set, there are the wide discrepancy in terms of the weather, the light source direction and the time interval as well as the drone motion in Figure 5.

**Figure 4.** The class distribution in Visdrone VID-Train, Val Dataset.

**Figure 5.** Some example images of the challenge.

### *4.2. Evaluation Metrics*

In this work, we use the Average Precision (AP) measurement [3,40], the commonly used metric to assess object detection accuracy. Given two bounding boxes, one for ground truth (the actual class label) and one for the detection result (the predicted class label), we use the Intersection over Union (IoU) to calculate the similarity between the two boxes and the score of the predicted box. It is computed as the intersected area (*Si*) divided by the union (*Sj*) of the two areas. An IoU threshold *η* indicates whether the prediction is an object or not. If the actual class label is the predicted class label and IoU > *η*, it is considered a positive else it is considered a negative.

$$IoI = \frac{\mathbb{S}\_{\bar{i}} \cap \mathbb{S}\_{\bar{j}}}{\mathbb{S}\_{\bar{i}} \cup \mathbb{S}\_{\bar{j}}} \tag{6}$$

The AP computes the average precision value for recall value over 0 to 1. The mean Average Precision (mAP) is computed by taking the average over the AP of all classes. Precision is the proportion of the predicted bounding boxes matching actual ground truth. Recall is the proportion of ground-truth objects being correctly detected. For object detection, we report the performance results with AP (IoU = 0.50), AP (IoU = 0.75). The AP [3] summarizes the shape of the precision/recall curve, and is defined as the mean precision at a set of 11 equally spaced recall levels [0, 0.1, . . . , 1]:

$$AP = \frac{1}{11} \times \sum\_{r \in \{0, 0.1, \dots, 1\}} p\_{interp}(r) \tag{7}$$

where:

$$p\_{interp}(r) = \max\_{\vec{r}:\vec{r}\geq r} p(\hat{r}) \tag{8}$$

### *4.3. Model Configuration*

To make a fair comparison, namely each detection model at its best, we adjust the model's parameters following the recommendation of [41–43]. We provide the detailed configuration as below.


Table 1 summarizes the detailed configuration. All models are trained or finetuned by using GeForce RTX 2080 Ti 11GB GPU run on Ubuntu 16.04.5 LTS OS.


**Table 1.** Configuration of SNIPER, RetinaNet, YOLOv3 and SSD.

### *4.4. Results*

As aforementioned, we benchmark six state-of-the-art methods: Faster R-CNN, RFCN, SSD with default parameters and SNIPER, YOLOv3, RetinaNet with adjusted parameters which is presented in detail in Table 1. Table 2 shows the training time of the methods. RFCN and YOLOv3 take the least training time. Meanwhile, SSD requires a remarkable time for training. Table 3 shows the runtime performance of different methods. SSD and YOLOv3 achieve the fastest running time among the rest. In the meantime, RFCN only processes 1.75 frames per second. One-stage object detectors clearly run faster than the two-stage ones.

**Table 2.** Training time of Faster R-CNN, RFCN, SNIPER, RetinaNet, YOLOv3, SSD with VisDrone2018 on GeForce RTX 2080 Ti GPU.


**Table 3.** Runtime performance of Faster R-CNN, RFCN, SSD, YOLO, SNIPER, RetinaNet with VisDrone2019 on GeForce RTX 2080 Ti GPU.


Figure 6 visualizes the detection results of benchmarking methods. As a closer look, Tables 4 and 5 show the detailed results of six methods and the average performance with a threshold of IoU set as 0.5 and 0.75, respectively. In accordance with Table 4 (IoU = 0.5), CenterNet, RetinaNet and SNIPER are the only three algorithms achieving more than 25% mAP score. YOLOv3 ranks in the fourth with more than 20% mAP score. We observe that SSD is inferior, and RFCN performs better than Faster R-CNN. SSD performs the worst, only producing 10.80% mAP score. However, SSD ranks in the third with 9.10% on the *bus* class. As seen in Table 5 (IoU = 0.75), SNIPER, CenterNet and RetinaNet are the only three algorithms achieving more than 11% mAP score. RFCN ranks in the fourth with more than 8% mAP score. We observe that YOLOv3 performs the worst (3.20% mAP score). In the meantime, Faster R-CNN performs better than SSD.

**Figure 6.** Visualization results of different object detection methods. Color legend: car, truck, bicycle, van, moto, pedestrian, people, bus, tricycle, awning-tricycle. Best view in high 400% resolution.

Regarding the performance, YOLOv3 has good performance with 7.5 FPS and 25.08 mAP (IoU = 0.5) then it drops rapidly to 3.2 mAP (IoU = 0.75) because YOLOv3 is does not perform well at localization. Instead, YOLOv3 is well-known for its runtime performance. We have performed the detection of YOLOv3 and realized the small confidence score for each object (<30%) due to the similarity of features. Therefore, we literally set the low confidence for the object detection demand with this YOLOv3. Meanwhile, CenterNet, RetinaNet and SNIPER achieve better detection results.

**Table 4.** The AP (IoU = 0.50) scores on the VisDrone2019 Validation set of each object category. The top three results are highlighted in red, blue and green fonts.



**Table 5.** The AP (IoU = 0.75) scores on the VisDrone2019 Validation set of each object category. The top three results are highlighted in red, blue and green fonts.

### *4.5. Analysis of Feature Maps Extraction*

Figure 7 depicts feature maps of Faster R-CNN, RFCN, SSD, YOLO, SNIPER, RetinaNet, and CenterNet, on a real aerial image. We extracted feature maps at the final convolution layer. These feature maps offer deeper insights into how different methods capture object features from the aerial viewpoint.

As also seen from Figure 7, SNIPER, RetinaNet as well as CenterNet produce optimal feature maps. We observe that the object shapes are well captured with larger values, whereas the background is with smaller values. This is because focal loss is adept at learning imbalanced classes (foreground/background) while chip mining is extracted from a proposal network trained for a short training schedule, which identifies regions where objects are likely to be present. Simultaneously, keypoint estimation of CenterNet is considered to be the essential factor that facilitates finding center points and regresses to all other object properties, such as size, location, orientation.

Regarding YOLO and SSD: in the feature map obtained by SSD, the object shapes are not clear, and the edges of objects are not preserved. This explains why this method detects aerial objects inaccurately which is ascribed to shallow layers in a neural network. Simultaneously, the features maps extracted by YOLO is better than SSD's ones where object regions are more prominent from the background. However, the edges of the objects are still blurred.

As far as Faster R-CNN and RFCN feature maps are concerned, the object shapes are well preserved but are not clearly distinguished from the background. This is due to the variance of various angles of images, which is considered to be an obstacle of early feature extractors.

(**e**) SSD (**f**) YOLOv3 (**g**) RetinaNet (**h**) CenterNet **Figure 7.** Visualization of feature maps from different state-of-the-art object detection models.

### *4.6. Discussion*

SSD is a unified object detector, which adopts a multi-scale approach. SSD uses a VGG16 network as a feature extractor and adds eight convolutional layers and ten layers separately, it also uses convolutional layers to reduce spatial dimension and resolution. To detect multi-scale objects, SSD makes independent object detections from multiple feature maps. Aspect ratios in SSD which are used as the anchor box scaling factors, so we widen the ratios range to ensure most objects could be captured. The higher resolution feature maps are responsible for detecting small objects, the first layer for object detection is conv4\_3 which has a spatial dimension of 38 × 38, a pretty large reduction from the original input image. Furthermore, small objects can only be detected in left most feature maps. However, those maps contain low-level features, like edges or color patches that are less informative for classification. Shallow layers in a neural network may not generate enough high-level features to predict small objects [27]. Therefore, SSD usually performs worse for small objects compared to other detection methods. Although SSD is the penultimate detector, it achieves the second with 12.7% on the *bicycle* class, the third with 5.5% on the *people* class, 9.1% on the *awning-tricycle* class, 9.10% on the *bus* class and 4.6% on the *motor* class.

SSD is competitive with Faster R-CNN, RFCN on more substantial objects, which has poor performance on small objects. Faster R-CNN combines the Region Proposal Network (RPN) into Fast R-CNN. RPN produces box proposals based on the feature extractor. These box proposals are used to crop features from the same intermediate feature map. They fed to the remainder of the feature extractor to predict a class and refine box for each proposal. RFCN is similar to Faster R-CNN. RFCN crops features from the last feature layer before prediction to reduce the amount of computation. RFCN proposed a position-sensitive mechanism to keep translation variance for localization representations. Faster R-CNN has a mAP of 17.51%. Faster R-CNN ranks in the third with 16.96% on the *truck* class. RFCN has much better performance than Faster R-CNN and SSD, producing 19.55% AP. RFCN achieves the first with 28.10% on the *truck* class and 13.28% on the *bus* class, which ranks in the third with 19.45% on the *people* and 15.21% on the *awing-tricycle* class.

As far as the outstanding detectors are concerned, CenterNet, RetinaNet and SNIPER are the three algorithms that top the statistics in both IoU thresholds (0.5 and 0.75). CenterNet ranks first with 32.28%, followed by RetinaNet with 28.26% in case the threshold of IoU = 0.5. Paradoxically, in case the threshold of IoU = 0.75, which favors high accurate results, the figure for SNIPER overtakes that for CenterNet and achieves the best performance.

In particular, CenterNet achieves outstanding results in the benchmark dataset with both IoU thresholds. Please note that CenterNet object detector builds on keypoint estimation networks, finds object centers, and regresses to their size. The experimental results show that CenterNet works well with small IoU threshold, 0.5. Regarding SNIPER, the valid range, boxes, which the square root of their area lies in each range are marked as valid in that scale. Therefore, we increased the valid range to significantly detect objects in various sizes (small, medium and large objects) as shown in Table 1. Simultaneously, chip mining plays an important role in eliminating regions that are likely to contain the background and this measure could adapt to each viewpoint hence alleviating the drawback of diverse scales. As a result, these enhancements cooperate with the pyramid feature map to surpass other detectors in terms of average precision. In particular, at the 0.75 IOU threshold, SNIPER outperforms YOLOv3, with 16.96% and 3.2% respectively. This is mainly because YOLOv3 is inferior in terms of localization. Regarding Retina, by changing anchors, RetinaNet has an increment in terms of AP, 2.3% for *people* class. A scale adjustment has widened the variety of scaling factors to use per anchor location which could improve detection for the diverse size objects. Concurrently, the focal loss is designed to address a severe imbalance between foreground and background classes during training, as a result, this approach could tackle the problems of the unbalanced dataset, in which the number of training samples for car and bus classes outnumbers those from other classes.

### **5. Conclusion and Future Work**

In this paper, we experimented the state-of-the-art object detection methods, namely Faster R-CNN, RFCN, SSD, YOLO, SNIPER, RetinaNet, and CenterNet, on aerial images. Among them, CenterNet, SNIPER and RetinaNet achieve the best performance in terms of average precision. Concurrently, YOLO is considered to be the optimal choice for real-time object detection applications which require the high FPS and moderate precision in detecting object. We notice the main challenges in the problem, for example, occlusion, scale, and class imbalance. From the aerial view, many objects are occluded, and their sizes are varied. We also notice the class imbalance issue during the training process. For example, most of the detectors perform much better on the car and pedestrian classes than on the awning-tricycle, tricycle, and bus classes due to more instances collected in the car and pedestrian classes.

In the future, we would like to investigate the fusion of different object detectors to even boost the state-of-the-art performance. In addition, we are interested in the task of aerial image segmentation. Obviously, the bounding boxes provided by the object detectors are very useful for the segmentation task. We also consider adopting the use of transfer learning [44] to consolidate and enhance the efficiency of training time.

**Author Contributions:** K.N. is responsible for discussion, paper writing and revising. N.T.H. and P.C.N. focus on training model, benchmark evaluation, and paper writing. K.-D.N. and N.D.V. are in charge of ideas, evaluation and paper writing. T.V.N. is responsible for ideas, discussion, paper structure, paper writing and revising. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number C2018-26-03.

**Acknowledgments:** We are grateful to the NVIDIA corporation for supporting our research in this area. The authors would like to thank the editors and the reviewers for their professional suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
