**1. Introduction**

Object detection is a fundamental yet difficult task in image processing and computer vision research. It has been an important research topic for decades. Its development in the past two decades can be regarded as an epitome of computer vision history [1]. Since it plays a principal role in understanding and absorbing the contexts of images, therefore, object detection is considered to be a prerequisite measure that offers the computer to detect various objects. Giving a testing image, object detection could localize the coordinates of the objects and assign the corresponding labels to the objects in terms of the object category, i.e., human, dog, or cat. The coordinates of a detected object represent the object's bounding box [2,3]. Object detection has many applications in robot vision, autonomous driving, human-computer interaction, intelligent video surveillance. The deep-learning technology has brought significant breakthroughs in recent years. In particular, these techniques have produced remarkable development for object detection. Object detection can detect a specific instance, i.e., Obama's face, Eiffel Tower, Golden Gate Bridge; or objects of specific categories, i.e., humans, cars, bicycles. Historically, object detection has mainly directed on the detection of a single category, for example, person class [4]. In recent years, the research community has started moving towards other

categories than the well-known categories like person, cat, or dog. Here are some common challenges that object detectors face on aerial images: viewpoints, illuminations, scale variations, perspectives, intra-class variations, low resolutions, and occlusions. For example, the main challenges in pedestrian detection come from crowded scenes with heavy overlaps, occlusion, and low-resolution images.

Generic object detection has received significant attention. There are many competition benchmarks, i.e., PASCAL-VOC [5,6], ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [2], MS-COCO [7], and VisDrone-DET [8]. There are several notable studies on specific object detection like face detection [9], pedestrian detection [10] and vehicle detection [11]. Recently, the research community has focused on deep learning and its applications towards the object recognition/detection tasks. In the past few years, Convolutional Neural Networks (CNNs) [12,13] have brought breakthroughs in speech, audio, image, and video processing. CNNs have driven notable progress in visual recognition and object detection. Many successful CNN architectures, e.g., OverFeat [14], R-CNN [15], Fast R-CNN [16], Faster R-CNN [17], SSD [18], RFCN [19], YOLO [20], YOLOv2 [21], Faster R-CNN [17], RetinaNet [22], YOLOv3 [23], and SNIPER [24] have performed well on the task of object detection. For example, SNIPER, RetinaNet and YOLOv3 are the top models for object detection on MS-COCO dataset [7] with mAP 46.1%, 40.8%, 33.0%, respectively.

The object detection has now been widely used in many practical scenarios [1]. Its use cases ranging from protecting personal security to boosting productivity in the workplace. While the challenges of normal viewpoints have been considered to be the prevalence, in recent years, there has been increasing interest in flying drones and their applications in healthcare, video surveillance, search-and-rescue, and agriculture. The drones are now common devices that enable us to record or capture many scenes as the bird's-eye view. Visual object detection is an essential component in the drone application. However, object detection is a very challenging task since video sequences or images captured by drones vary significantly in terms of scales, perspectives, and weather conditions. Aerial images are often noisy and blurred due to the drone motion. The ratio of object size to the video resolution is also small. Therefore, in this paper, we investigate the performance of state-of-the-art object detectors in the aerial images. Please note that this paper is the extension of our earlier version which is the best paper in MITA 2019 conference [25]. Our contributions are three-fold.


The remainder of the paper is organized as follows. In Section 2, the related works are presented. Section 3 and Section 4 present the benchmarked methods and the experimental results, respectively. Finally, Section 5 concludes our work.

### **2. Related Work**

### *2.1. CNN Models*

CNN-based architectures have been backbones in many detection frameworks. Popular architectures include AlexNet [12], ZFNet [26] VGGNet [27], GoogLeNet [28], ResNet [29] and DenseNet [30]. We briefly introduce these models as follows.

Known as a pioneering work, AlexNet [12] consists of eight layers: five convolutional (conv1, conv2, conv3, conv4 and conv5) layers and three fully connected (fc6, fc7 and fc8) layers. The fc8 is a SoftMax classifier. The convolutional layers are connected directly: conv3, conv4 and conv5. The convolutional layers are connected via an Overlapping Max-Pooling layer: conv1–conv2, conv2–conv3, conv5–fc6. AlexNet used 11 × 11, 5 × 5 and 3 × 3 kernels. Later, Reference [26]

proposed ZFNet by modifying AlexNet. ZFNet uses 7 × 7 kernels. The small kernels retain more information than the big kernels. By proposing a deeper network, VGGNet [27] outperforms AlexNet in the image classification task. Similarly, GoogLeNet [28] won the ILSVRC2014 competition by increasing the number of layers. In particular, it includes 22 layers: 21 convolutional layers and a fully connected layer.

The number of layers is increasing in CNN architectures. The CNNs require a lot of computational resources. There are several problems: gradient vanishing, exploding, and degrading. Degradation occurs when we add more layers into deep networks, the accuracy becomes saturated and then decrease quickly. To overcome this problem, ResNet [29] introduces many residual blocks. In a residual block, each layer is fed directly to the layers about 2–3 hops away using skip-connections.

DenseNet [30] was proposed by Huang et al. in 2017. It includes many dense blocks. A dense block consists of composite layers which are densely connected together. The input of one layer is the output of all previous layers, so input information is shared.

### *2.2. Object Detection Methods*

Object detection methods are mainly divided into one-stage frameworks and two-stage frameworks. Two-stage frameworks are more accurate than one-stage frameworks, but one-stage frameworks usually achieve real-time detection. The two-stage approach includes two steps: the first stage creates region proposals, the second stage classifies region proposals. The one-stage approach predicts object regions and object classes at the same time.

**CNN-based Two-Stage frameworks**: Two-stage frameworks mainly include R-CNN [15], SPP-Net [31], Fast R-CNN [16], Faster R-CNN [17], RFCN [19], Mask R-CNN [32], and SNIPER [24].

R-CNN [15] is a method for detecting objects based on the ImageNet pre-trained model. R-CNN uses Selective Search algorithm to generate region proposals. Then, these regions are warped and fed into the pre-trained model to extract high-level features. Finally, several SVM classifiers are trained based on these features to identify object classes. Fast R-CNN [16] was introduced to solve some R-CNN's limitations, i.e., the computational speed. Fast R-CNN feeds the whole image into ConvNet to create convolutional feature map instead of 2000 regions as R-CNN.

Faster R-CNN [17] proposes a Region Proposal Network (RPN) to detect region proposals instead of Selective Search, which is used in R-CNN and Fast R-CNN. Faster R-CNN is 10× faster than Fast R-CNN, and 250× faster than R-CNN in reference time. RFCN introduces the positive sensitive score map, which improves speed but remains accurate compared to Faster R-CNN. Mask R-CNN extends Faster R-CNN with instance segmentation and introduces Align Pooling.

**CNN-based one-stage frameworks**: The most common examples of one-stage object frameworks are YOLO, YOLOv2, YOLOv3 [23], SSD [18], and RetinaNet. You Only Look Once (YOLO [20]) is one of the first approaches to build a one-stage detector. Unlike R-CNN family, YOLO does not use a region proposal component. Instead, it learns to regress bounding-box coordinates and class probabilities directly from image pixels. This significantly boosts the speed of the detecting process. Single-Shot MultiBox Detector (SSD [18]) is also a one-stage detector which also aims at high speed object detection. However, unlike YOLO, SSD adopts a multi-scale approach. It then adds many convolutional layers decreasing in size sequentially. This can be regarded as a pyramid representation of an image, where earlier levels contain feature maps that are useful to detect small objects and deeper levels are expected to detect larger objects. Each of these layers has a set of predefined anchor boxes (also known as default boxes or prior boxes) for every cell. The model will learn and predict the offsets corresponding to correct anchor boxes. The approach has made a successful attempt on creating an efficient detector for objects in various sizes while maintaining a low inference time.

### **3. Benchmarked State-of-the-Art Object Detection Methods**

In this section, we provide further details of the benchmarked object detection methods. Figure 1 visualizes the methods in the chronological order. Meanwhile, Figure 2 depicts the framework structures of different state-of-the-art object detectors.

**Figure 1.** Timeline of state-of-the-art object detection methods. The benchmarked methods are marked in red and boldfaced font.

**Figure 2.** The benchmarked methods adopted in this paper: Faster R-CNN, SSD, RFCN, RetinaNet, SNIPER, YOLOv3, and CenterNet.

### *3.1. Faster R-CNN*

Faster R-CNN [17] is an extension of the R-CNN [15] and Fast R-CNN [16] methods for object detection. R-CNN requires a forward pass of the CNN for around 2000 region proposals (ROI) for every single image. Later, Fast R-CNN was able to solve the problem of R-CNN by sharing the computation of convolution between different proposals (feature map). The detection process is sped up but still depends on the region proposal method (Selective Search). Region proposals were generated by additional methods, i.e., Selective Search or Edge Box. To solve this problem, Ren et al. [17] introduced the Region Proposal Network (RPN).

Faster R-CNN consists of two main components, namely the RPN and the Fast R-CNN detector. RPN initializes squared reference boxes of aspect ratios and diverse scales at each convolutional feature map location. Each squared box is mapped to a feature vector. The feature vector is fed into two fully connected layers, an object category classification layer, and a box regression layer. Faster R-CNN enables highly efficient region proposal computation because RPN shares convolutional features with Fast R-CNN. With an image of arbitrary size as an input, RPN is trained end-to-end to generate high-quality region proposals as output. The Fast R-CNN detector also uses the ROI pooling layer to extract features from each candidate box and performs object classification and bounding-box regression. The entire system is a single, unified network for object detection.

### *3.2. RFCN: Region-Based Fully Convolutional Networks*

A limitation of Faster RCN is that it does not share computations after ROI pooling. The amount of computation should be shared as much as possible. Faster R-CNN overcomes the limitations of Fast R-CNN but it still contains several non-shared fully connected layers that must be computed for each of hundreds of proposals. Region-based Fully Convolutional Network [19] was proposed as an improvement to Faster R-CNN. It consists of shared, fully convolutional architectures. In RFCN, fully connected layers after ROI pooling are removed, all other layers are moved prior to the ROI pooling to generate the score maps. RFCN infers 2.5 to 20 times faster than Faster R-CNN, yet it still maintains a competitive accuracy.

**Backbone Architecture**. The incarnation of RFCN in [19] is based on ImageNet pre-trained ResNet-101 model. To compute feature maps, the average pooling layer and the *fc* layer are removed. Instead, RFCN only uses the convolutional layers, and attaches a randomly initialized 1024-d 1 × 1 convolutional layer to reduce dimension at the last convolutional block in ResNet-101. In addition, RFCN uses the *k*2(*C* + 1) channel convolutional layers to generate scope maps.

**Position-sensitive score maps and Position-sensitive ROI pooling**. RFCN regularly divides each ROI rectangle into *<sup>k</sup>* <sup>×</sup> *<sup>k</sup>* bins, then each bin has a size of <sup>≈</sup> *<sup>w</sup> <sup>k</sup>* <sup>×</sup> *<sup>h</sup> <sup>k</sup>* for an ROI rectangle of a size *<sup>w</sup>* <sup>×</sup> *<sup>h</sup>*. For each category, the last convolutional layer is built to create *<sup>k</sup>*<sup>2</sup> score maps. Inside the (*i*, *<sup>j</sup>*)-th bin (<sup>0</sup> <sup>≤</sup> *<sup>i</sup>*, *<sup>j</sup>* <sup>≤</sup> *<sup>k</sup>* <sup>−</sup> <sup>1</sup>), a position-sensitive ROI pooling operation pools only over the (*i*, *<sup>j</sup>*)-th score map:

$$r\_{\varepsilon} \left( i, j \mid \ominus \right) = \sum\_{(x, y) \in \mathit{bin}(i, j)} z\_{b, j, \varepsilon} \left( \mathbf{x} + \mathbf{x}\_{0 \prime} y + y\_0 \vert \ominus \right) / n \tag{1}$$

For the *<sup>c</sup>*-th category, *rc*(*i*, *<sup>j</sup>*) is the aggregated response in the (*i*, *<sup>j</sup>*)-th bin; *zb*,*j*,*<sup>c</sup>* is a score map among *<sup>k</sup>*2(*<sup>C</sup>* + <sup>1</sup>) maps; (*x*0, *<sup>y</sup>*0) represents the ROI's top left corner; and the number of pixels in the bin is marked as *<sup>n</sup>*. is the set of learnable parameters; and the (*i*, *<sup>j</sup>*)-th bin is located at - *i w k* ≤ *x* ≤ - (*i* + 1) *<sup>w</sup> k* and *j h k* ≤ *y* ≤ (*j* + 1) *<sup>h</sup> k* .

The *k*<sup>2</sup> position-sensitive scores are averaged to obtain a (*C* + 1)-dimensional vector for each ROI *rc* () <sup>=</sup> <sup>∑</sup>*i*,*<sup>j</sup> rc* (*i*, *<sup>j</sup>*|). Then the SoftMax responses for categories are computed *sc* () <sup>=</sup> *<sup>e</sup>rc* ()/ <sup>∑</sup>*<sup>C</sup> c* <sup>=</sup><sup>0</sup> *<sup>e</sup> r c* () .

The RFCN resolves bounding-box regression similar to Fast R-CNN with *k*2(*C* + 1)-d convolutional layer, and appends a sibling 4*k*2-d convolutional layer additionally. The position-sensitive ROI pooling makes a 4*k*2-d vector for each ROI. Then, a bounding box as *t* = (*tx*, *ty*, *tw*, *th*) uses a 4-d vector aggregated by average voting.

**Training**. The loss function is defined on each ROI and calculated by the sum of cross-entropy losses and box regression loss:

$$L(\mathbf{s}, t\_{\mathbf{x}, y, w, h}) = L\_{cls}(\mathbf{s}\_c^\*) + \lambda \begin{bmatrix} \mathbf{c}^\* \ \mathbf{s} \ \mathbf{0} \end{bmatrix} L\_{reg}(\mathbf{t}, \mathbf{t}^\*), \tag{2}$$

where *<sup>c</sup>*<sup>∗</sup> is the ground-truth label of ROI. For classification, *Lcls*(*s*<sup>∗</sup> *<sup>c</sup>* ) = <sup>−</sup> log(*s*<sup>∗</sup> *<sup>c</sup>* ) denotes the cross-entropy losses, *Lreg* denotes the bounding-box regression loss and *t* ∗ is the ground-truth box. If the argument is valid, [*c*<sup>∗</sup> > 0] receives a value of 1, otherwise, 0.

**Inference**. The feature maps are the results of calculations on an image with a single scale of 600 shared between RPN and RFCN (as showed in Figure 2). Then, the RFCN part evaluates score maps and regresses bounding boxes based on ROIs which are proposed by the Region Proposal Network (RPN) part.

### *3.3. SNIPER: Scale Normalization for Image Pyramids with Efficient Resampling*

SNIPER is an effective, multi-scale training method for identification, object detection and object separation [24]. Instead of processing pixels based on the pyramid (SN), SNIPER treats the context areas around the ground truths (called chips) at an appropriate scale. This greatly increases the speed during training when it operates on low-resolution chips. Relying on the memory efficient design, SNIPER benefits from mass standardization during the training process without having to synchronize standardized statistics on the GPU.

### 3.3.1. Chip Generation

SNIPER generates chips *Ci* at multiple scales {*s*1,*s*2, ..., *si*, ...,*sn*} in the image. For each scale, the image is first re-sized to width (*Wi*) and height (*Hi*). On this canvas, *K* × *K* pixel chips are placed at equal intervals of d pixels.

### 3.3.2. Chip Selection

All favorable chips are chosen greedily to cover the highest amount of valid ground-truth boxes. If it is completely enclosed inside a chip, a ground-truth box is said to be covered. Although positive chips cover all the positive instances, a significant portion of the background is not covered by them. In multi-scale training architecture, each pixel is processed at all scales in the picture. A naive strategy is to use object suggestions to define areas where objects are likely to present. If there are no region proposals in an image, it is considered to be background.

### *3.4. SSD: Single-Shot Detector*

SSD [18] is a one-stage solution, which has tremendously reduced inference time and resulted in an accurate, high speed detector that can be used for real-time video processing.

### 3.4.1. Base Network VGG-16

SSD is built on top of VGG-16 base network [27] that focuses on simplicity and depth. In particular, the model uses 16 convolutional layers with only 3 × 3 filters to extract features. As the model goes deeper, the number of filter doubles after each max-pooling layer. Noticeably, the convolutional layers of the same type are combined as shown in Figure 2. At the end, three fully connected layers are followed by 4096 channels. The last one contains 1000 channels for each class and is concatenated with a SoftMax layer to return the detection results. The model works well on classification and localization tasks and has achieved 89% mAP on PASCAL-VOC 2007 dataset. VGG-16 has been one of the most interesting models to research even though it is not as fast as the newer ones. Its architecture has been reused in many models because of its valuable extracted features.

### 3.4.2. Model Architecture

SSD extends the pre-trained VGG-16 model (on ImageNet [10]) by adding new convolutional layers conv8\_2, conv9\_2, conv10\_2, conv11\_2 in addition to using the modified conv4\_3 and fc\_7 layers to extract useful features. Each layer is designed to detect objects at a certain scale using *k* anchor boxes, where 4*k* offsets and *c* class probabilities are computed by using 3 × 3 filters. Thus, given a feature map with a size of *<sup>m</sup>* <sup>×</sup> *<sup>n</sup>*, the total number of filters to be used is *kmn*(*<sup>c</sup>* <sup>+</sup> <sup>4</sup>). The anchor boxes are chosen manually. Here, we use the original formula, as follows, to calculate anchor box scales at different levels.

$$s\_k = s\_{\rm min} + \frac{s\_{\rm max} - s\_{\rm min}}{m - 1}(k - 1) \tag{3}$$

*<sup>k</sup>* <sup>∈</sup> [1, *<sup>m</sup>*] where *smin* <sup>=</sup> 0.2 and *smax* <sup>=</sup> 0.9.

### *3.5. RetinaNet*

RetinaNet [22] is another one-stage detector. It aims to tackle the class imbalance problem between foreground and background remaining in one-stage detector. RetinaNet uses two main techniques: FPN backbone and focal loss as the loss function. FPN is built on top of a convolutional neural network and is responsible for extracting convolutional feature maps from the entire image. By using focal loss, RetinaNet changes weights in the loss function, focuses on hard, misclassified examples, which improves the prediction accuracy. With ResNet (FPN) as a backbone for feature extraction and two specific subnetworks for classification and bounding-box regression, RetinaNet has achieved state-of-the-art performance.

### 3.5.1. Class Imbalance

As a one-stage detector, RetinaNet has a much larger set of candidate object locations which is regularly sampled across an image (∼100 k locations), covering spatial positions, scales and aspect ratios tightly. The easily classified background examples still dominate the training procedure. Bootstrapping or hard example mining is typically used as a solution for this problem. However, they are not efficient enough. To solve this, RetinaNet proposes a new loss function which can adaptively tune the contributed weights of object classes during training.

### 3.5.2. Focal Loss

Focal loss is computed by adding (<sup>1</sup> <sup>−</sup> *pi*)*<sup>γ</sup>* to cross-entropy loss as a modulating factor.

$$\sum\_{i=1}^{k} \left( y\_i \log(p\_i) (1 - p\_i)^{\gamma} \right) + (1 - y\_i) \log(1 - p\_i) p\_i^{\gamma} \right) \tag{4}$$

### 3.5.3. RetinaNet Detector Architecture

RetinaNet adopts ResNet for deep feature extraction. ResNet builds a rich multi-scale feature pyramid from an input image of single resolution by using Feature Pyramid Network (FPN) [33]. It combines low-resolution, high-resolution, and semi-weak characteristics through a top-down pathway and lateral connections.

### *3.6. YOLO: You Only Look Once*

YOLO is an object detection system targeted for real-time processing. Recently, the third version of YOLO has been published, YOLOv3 is extremely fast and accurate. In mAP measured at 0.5 IOU YOLOv3 is on par with focal loss but about 4 × faster.

YOLOv3 takes an input image to predict 3D tensors respectively to three scales and each scale is divided into N × N grid cells. During training, each grid cell considers a class that it likely is and be responsible for detecting that class. Simultaneously, each grid cell is assigned with 3 initial prior boxes with various sizes. Finally, non-max suppression is applied to select the best boxes.

### 3.6.1. Feature Extraction

YOLOv3 uses a variant of Darknet, which originally has a 53-layer network trained on ImageNet. According to [23], Darknet-53 is better than ResNet-101 and 1.5 × faster. Darknet-53 has a performance similar to ResNet-152 and is 2 × faster.

### 3.6.2. Detection at Three Scales

YOLOv3 is different from its predecessors since it performs the detection process at three different scales. In YOLOv3, the detection is done by applying 1 × 1 detection kernels on feature maps of three different sizes at three different places in the network. YOLOv3 makes prediction at three scales, which are precisely given by down sampling the dimensions of the input image by 32, 16 and 8, respectively. Detections at different layer helps address the issue of detecting small objects, a common issue in YOLOv2.

### 3.6.3. Objective Score and Confidences

The object score illustrates the probability that an object is contained inside a bounding box and its value ranging from 1 to 0. A sigmoid is applied to compute the objectness scores. In terms of class confidences, they depict the probabilities of the detected object which belongs to a particular class. In YOLOv3, Non-maximum Suppression (NMS) is used to decide a class score and it is meant to alleviate the problem of multiple detections of the same object.

### *3.7. CenterNet*

One-stage and two-stage detection have limitations: anchor box is designed with manual proportions that are easily affected by data and fixed during training. This requires a high computation cost, but the anchors are not always accurate. To address that, recently a series of anchor-free methods [34–36] are proposed. CenterNet [37] is a one-stage detector, anchor-free method. Reference [37] proposed a new center-based framework based on a single Hourglass network without FPN structure [38]. The object is represented by the central point of the bounding box. Other information is calculated by regression such as object size, dimension, and pose.

### 3.7.1. Object as Points

CenterNet considers the center point of an object as a prerequisite to localize the bounding box. As a result, Reference [37] use a keypoint estimator *<sup>y</sup>* to predict all center points and single a single size prediction for all object categories to alleviate the computational burden.

### 3.7.2. From Points to Bounding Boxes

CenterNet identifies the peak points in the heatmap before detecting all the values meet or greater than its 8-related neighbors and keep the top 100 points. Subsequently, each keypoint location is given by an integer coordinate (*xi*, *yi*) and the key point estimator value as a measure of its detection confidence is applied to produce the bounding box as below:

$$(\hat{\mathbf{x}}\_{i} + \delta \hat{\mathbf{x}}\_{i} - \hat{w}\_{i}/2, \hat{y}\_{i} + \delta \hat{y}\_{i} - \hat{h}\_{i}/2, \hat{\mathbf{x}}\_{i} + \delta \hat{\mathbf{x}}\_{i} + \hat{w}\_{i}/2, \hat{y}\_{i} + \delta \hat{y}\_{i} + \hat{h}\_{i}/2),\tag{5}$$

where (*δxi*, *<sup>δ</sup>yi*) = *<sup>O</sup>xi*,*y<sup>i</sup>* is the offset prediction and (*wi*, *hi*) = *<sup>S</sup>xi*,*y<sup>i</sup>* is the size prediction [37].
