**1. Introduction**

Nowadays, as a fast-growing number of Unmanned Aerial Vehicles (UAVs) start carrying high-definition cameras, object detection technology in aerial images has been widely used in various practical applications, including agricultural planting [1,2], pedestrian tracking [3], urban security [4,5], inspecting buildings [6], search and rescue [7], and rare plant monitoring [8]. These applications all require accurate object detection in visible or infrared images taken by onboard cameras. However, detecting objects in UAV images is nontrivial. Varying purposes in different applications and the limited computing power of UAVs have brought challenges to this work. To solve these problems, object detection based on a Convolutional Neural Network (CNN) is gradually applied in UAV detection tasks.

The methods of object detection commonly used today are YOLO series [9–12] and Faster-RCNN [13]. They have achieved a good performance on large-scale datasets such as MS COCO [14], ImageNet [15], and VOC2007/2012 [16]. However, compared to these datasets, UAV images have the following features:

(1) UAV image datasets often provide higher resolution images, but the objects in these images are always in low resolution. For example, the image size in general image datasets VOC2007/2012 and MS COCO is approximately 500 × 400 and 600 × 400,

**Citation:** Wan, Y.; Zhong, Y.; Huang, Y.; Han, Y.; Cui, Y.; Yang, Q.; Li, Z.; Yuan, Z.; Li, Q. ARSD: An Adaptive Region Selection Object Detection Framework for UAV Images. *Drones* **2022**, *6*, 228. https://doi.org/ 10.3390/drones6090228

Academic Editors: Daobo Wang and Zain Anwar Ali

Received: 30 July 2022 Accepted: 28 August 2022 Published: 31 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). *drones* respectively. However, in the UAV image dataset VisDrone2021-DET [17], the image size is 2000 × 1500 while the object size is only about 50 × 50 pixels.


To solve these issues, many researchers have attempted to change the structure of the object detection network. Extended from YOLOv5 [19], an improved network named YOLOv5-TPH [20] added a transformer model with attention mechanisms on the detection head of YOLOv5. It trained a 1536 × 1536 high-resolution network on VisDrone2021-DET and achieved 35.74% mAP. However, the high-resolution network and transformer model cost huge computing resources.

Another common solution is to partition a UAV image into several uniform subregions and then detect each of them. However, it cannot guarantee effective improvement by directly conducting uniform cropping or random cropping as this cannot locate key subregions. Based on the VisDrone2021-DET dataset, the system can achieve 43.5% mAP50 and 30.3% mAP when the sub-regions are obtained via sliding window search [21] and CNNbased clustered sub-network [22], respectively. However, they are either time consuming or training based. Although these detectors could achieve a better performance, they are inefficient at performing detection in every region. Because some regions only have largescale objects, detecting these regions does not improve the overall accuracy and efficiency. The object-dense sub-regions need to be located first.

This paper proposes the Adaptive Region Selection Detection framework (ARSD), a novel two-stage detection model that combines the Region Detection Network and the Self-adaptive Intensive Region Selecting Algorithm. ARSD aims to significantly reduce computation resource consumption while maintaining high object detection accuracy. The first stage of ARSD uses an Overall Region Detection Network, which can coarsely locate where the target is. The model then applies a Self-adaptive Intensive Region Selecting Algorithm to generate object-dense sub-regions by cluster objects detected in the first stage and sends them to the next stage for further exquisite detection. The last stage is the Key Region Detection Network, which detects the object-dense sub-regions. Based on the first stage, this stage is extended with an additional small detection head based on the original detection heads.

To sum up, the novelty of this paper is as follows:


In this way, the proposed framework can gradually reduce the computational complexity while maintaining high object detection accuracy.

The rest of the paper is organized as follows: Section 2 provides a comprehensive overview of the components of ARSD. The specific experimental details are given in Section 3, which demonstrates the performance of the proposed ARSD in various aspects, as well as comparisons with other works. Section 4 summarizes the experimental results. Finally, Section 5 concludes the paper and lists a collection of ongoing research and future work directions.

#### **2. Materials and Methods 2. Materials and Methods**  This section first introduces the overall framework of the proposed ARSD framework

work directions.

This section first introduces the overall framework of the proposed ARSD framework and then describes each module in detail. As shown in Figure 1, the ARSD framework consists of three parts. The first part is the Overall Region Detection Network (ORDN), which is used to roughly locate the objects. The Self-adaptive Intensive Region Selecting Algorithm (SIRSA), which consists of the Fixed Points Density-based Clustering Algorithm (FPDCA) and Adaptive Sub-regions Selection Algorithm (ASSA), is then adopted to properly select the object-dense sub-regions. Finally, the Key Region Detection Network (KRDN) is responsible for detecting the objects in sub-regions selected by SIRSA. The detected objects are combined with the results of ORDN. This framework has better detection accuracy for small targets in UAV images and reduces computing resources due to the filter of the sub-regions. and then describes each module in detail. As shown in Figure 1, the ARSD framework consists of three parts. The first part is the Overall Region Detection Network (ORDN), which is used to roughly locate the objects. The Self-adaptive Intensive Region Selecting Algorithm (SIRSA), which consists of the Fixed Points Density-based Clustering Algorithm (FPDCA) and Adaptive Sub-regions Selection Algorithm (ASSA), is then adopted to properly select the object-dense sub-regions. Finally, the Key Region Detection Network (KRDN) is responsible for detecting the objects in sub-regions selected by SIRSA. The detected objects are combined with the results of ORDN. This framework has better detection accuracy for small targets in UAV images and reduces computing resources due to the filter of the sub-regions.

The rest of the paper is organized as follows: Section 2 provides a comprehensive overview of the components of ARSD. The specific experimental details are given in Section 3, which demonstrates the performance of the proposed ARSD in various aspects, as well as comparisons with other works. Section 4 summarizes the experimental results. Finally, Section 5 concludes the paper and lists a collection of ongoing research and future

*Drones* **2022**, *6*, x FOR PEER REVIEW 3 of 17

**Figure 1.** The pipeline of ARSD. The ORDN predicts overall bounding boxes from the original image, which is used for the subsequent region selection. FPDCA can cluster the center point of the overall bounding boxes to obtain candidate sub-regions. ASSA then filters sub-regions that needed to be detected in KRDN. The light blue box represents the additional head of KRDN. Finally, the small object bounding boxes obtained in KRDN are merged with the overall bounding boxes by Weighted Boxes Fusion (WBF) [23] and generate the final result. **Figure 1.** The pipeline of ARSD. The ORDN predicts overall bounding boxes from the original image, which is used for the subsequent region selection. FPDCA can cluster the center point of the overall bounding boxes to obtain candidate sub-regions. ASSA then filters sub-regions that needed to be detected in KRDN. The light blue box represents the additional head of KRDN. Finally, the small object bounding boxes obtained in KRDN are merged with the overall bounding boxes by Weighted Boxes Fusion (WBF) [23] and generate the final result.

#### *2.1. Overall Region Detection Network (ORDN) and Key Region Detection Network (KRDN) 2.1. Overall Region Detection Network (ORDN) and Key Region Detection Network (KRDN)*

The ORDN module predicts object bounding boxes on the whole images, while KRDN predicts the object based on cropped object-dense sub-regions. After cropping subregions with SIRSA, KRDN can predict more accurate results in a high-resolution region. The results of ORDN and KRDN are merged by WBF and produce the final results. The ORDN module predicts object bounding boxes on the whole images, while KRDN predicts the object based on cropped object-dense sub-regions. After cropping sub-regions with SIRSA, KRDN can predict more accurate results in a high-resolution region. The results of ORDN and KRDN are merged by WBF and produce the final results.

It should be mentioned that although ORDN and KRDN are based on the same backbone structure, they differ in width and depth. Therefore, we can easily and inexpensively implement two detection networks with different precision and different time consumption. ORDN only needs to have an accurate recall rate, so it is designed to be more lightweight to save computing power. On the contrary, KRDN is wider and deeper than It should be mentioned that although ORDN and KRDN are based on the same backbone structure, they differ in width and depth. Therefore, we can easily and inexpensively implement two detection networks with different precision and different time consumption. ORDN only needs to have an accurate recall rate, so it is designed to be more lightweight to save computing power. On the contrary, KRDN is wider and deeper than ORDN. This enables KRDN to have greater accuracy than ORDN without changing the network structure.

For an anchor-based object detection network, anchor size is an important factor affecting the accuracy. The targets in UAV images are on different scales and most are small-scale targets. An additional detection head is added to KRDN based on the original detection head in ORDN. Combined with the original detection head, KRDN can easily locate small objects and reduce the adverse influence caused by different object scales. The added detection head is generated from the low-level feature map of the backbone and the high-resolution feature map obtained by up-sampling in FPN [24]. Generally, feature information will be reduced as the size of the feature map decreases during the processing of CNN in the backbone. Since the image feature information does not decline gradually in the large-size feature map and more features of small targets are retained, this detection head can help to effectively detect small targets.

The object detection is performed by KRDN after object-dense sub-regions are clustered and selected by SIRSA. KRDN detects the object-dense region and obtains more details about the objects. This paper combines the results from two stages using WBF [23]. *B<sup>o</sup>* and *B<sup>k</sup>* represent the coordinates of the bounding boxes obtained by ORDN and KRDN, respectively. The final result of the bounding box *B* is calculated by (1). The final confidence *C* is obtained by (2), where *C<sup>o</sup>* and *C<sup>k</sup>* represent the confidence values of the bounding boxes obtained by ORDN and KRDN, respectively.

$$B\_{(\mathbf{x},\mathbf{y})} = \frac{B\_{o(\mathbf{x},\mathbf{y})} \* \mathbb{C}\_o + B\_{k(\mathbf{x},\mathbf{y})} \* \mathbb{C}\_k}{2} \tag{1}$$

$$\mathbb{C} = \frac{\mathbb{C}\_o + \mathbb{C}\_k}{2} \tag{2}$$

#### *2.2. Self-Adaptive Intensive Region Selecting Algorithm (SIRSA)*

SIRSA is composed of two parts: (1) the Fixed Points Density-based Clustering Algorithm (FPDCA) is used to cluster the center point of the object predicted in ORDN to obtain candidate sub-regions; (2) the adaptive region selection algorithm selects object-dense sub-regions.

#### 2.2.1. Fixed Points Density-Based Clustering Algorithm (FPDCA)

This paper proposes a new clustering method named FPDCA, combining the advantages of two clustering algorithms: K-means [25] and Mean-Shift [26]. K-means is competent at generating fixed clustering centers, but it cannot use density information. Mean-Shift cannot confirm the number of clustering regions, which tends to generate a large number of small-size sub-regions or only one large sub-region. It consumes intensive computation power or cannot be applied to higher resolution images. However, Mean-Shift can perform region clustering based on object-dense information, which is neglected by K-means. Combining K-means and Mean-Shift can take both the density information and generating the fixed number of sub-regions into account.

As shown in Algorithm 1, the set of centers of the bounding box is denoted as *Q*, with a point *q* in *Q* randomly chosen as the initial cluster center. Point *q* is updated by calculating the motion vector based on the points in the surrounding circle of radius *r*. The algorithm moves to the next Point *q* and repeats the above step until all points in *Q* have been processed. If the number of cluster centers is greater than the required number of sub-regions *N*, K-Means is used to regroup these clusters into *N* clusters and then output the *N* final candidate sub-regions. If the number of clusters does not exceed *N*, these clusters are discarded and K-Means are used directly based on the original *Q* points to regroup to *N* clusters and the corresponding *N* final sub-regions.

#### 2.2.2. Adaptive Sub-Regions Selection Algorithm (ASSA)

A large number of candidate sub-regions are produced by FPDCA. To save computing resources on hardware platforms, this paper proposes the Adaptive Sub-regions Selection Algorithm (ASSA) to filter sub-regions that needed to be detected in KRDN. Sub-regions with higher ASSA scores will be selected for further detection by KRDN, while the other sub-regions will be discarded.

Four criteria are defined in ASSA to evaluate whether a candidate sub-region *p* requires further detection by KRDN, including regional target density, average confidence, the ratio of bounding boxes' total areas to the sub-region area, and average area of all bounding boxes in the sub-region. Regional target density is considered in [27], as defined in (3), where *L* denotes the number of the predicted boxes in *p* and *A* is the area of *p*. However, this definition may not necessarily be accurate in that the size of the original image should also be considered. As shown in Figure 2, when reducing the image size from 200 × 200

(left image) to 100 × 100 (right image), the regional target density should remain the same. However, the result does not conform to this using (3). This paper extends (3) by considering the size of the original image *S* in (4).

$$D = \frac{L^2}{A} \tag{3}$$

$$D = \frac{L^2 \ast \mathcal{S}}{A} \tag{4}$$

**Algorithm 1:** Fixed Points Density-based Clustering Algorithm

**Input:** *N*: number of sub-regions, *Q*: the set of bounding box centers, *r*: distance of algorithm, *ξ*: threshold of the distance of vector. **Output:** the set of sub-regions *S* 1: **for** *q*(*x*, *y*) *in Q but not in M* 2: *M* = *M*∪*q*(*x*, *y*) 3: **for** *q*(*x<sup>i</sup>* , *yi* ) *in Q but not in M* 4: *S<sup>k</sup>* (*q*) = {*y* : (*x* − *x<sup>i</sup>* ) <sup>2</sup> + (*<sup>y</sup>* <sup>−</sup> *<sup>y</sup><sup>i</sup>* ) <sup>2</sup> < *r* 2} 5: *M* = *M*∪*S<sup>k</sup>* (*q*) 6: *C<sup>i</sup>* = *Ci*∪*S<sup>k</sup>* (*q*) 7: *Vshi f t* = <sup>1</sup> *<sup>k</sup>* ∑ *q*∈*S<sup>k</sup>* (*q<sup>i</sup>* − *q*) 8: *q* = *q* + *Vshi f t* 9: **if** *Vshi f tnew* − *Vshi f told* <sup>&</sup>lt; *<sup>ξ</sup> put C<sup>i</sup> in C* 10: **break from line 3;** 11: **end if** 12: **end for** 13: **end for** 14: **if** *length*(*C*) > *N* 15: *S* = *Kmeans*(*N*, *C*) 16: **else** 17: *S* = *Kmeans*(*N*, *Q*) *as random center* 18: **end if** *Drones* **2022**, *6*, x FOR PEER REVIEW 7 of 17 indicators are converted by (12) to extremely large indicators and extremely large indicators are normalized by (13). ' min{ ,..., } max{ ,..., } min{ ,..., } *ij ij nj ij ij nj ij nj x xx <sup>x</sup> xx xx* <sup>−</sup> <sup>=</sup> <sup>−</sup> (12) ' max{ ,..., } max{ ,..., } min{ ,..., } *ij nj ij ij ij nj ij nj xxx <sup>x</sup> xx xx* <sup>−</sup> <sup>=</sup> <sup>−</sup> (13) 1 *ij ij n ij i <sup>x</sup> <sup>p</sup> x* = = , *i nj m* = = 1,..., , 1,..., (14) The results of SIRSA are shown in Figure 3. A white rectangle indicates candidate sub-regions, while the transparency of sub-regions indicates their ASSA score. The clearer

**Figure 2.** The left and right images should have the same target density. **Figure 2.** The left and right images should have the same target density.

The second criterion is the average confidence *M* as defined in (5), where *score<sup>i</sup>* represents the confidence of bounding box *i*. It is considered that the sub-region with low average confidence is due to the classification inaccuracy caused by the lightweight network of ORDN. It is believed that the smaller the value of *M*, the greater the accuracy gain.

**Figure 3.** Points of the same color indicate the center of the bounding box which belongs to the same

the sub-region, the higher its final ASSA score and the more it should be sent to KRDN.

$$M = \frac{\sum\_{i=1}^{L} score\_i}{L} \tag{5}$$

sub-region. The obscurer the sub-regions, the lower the ASSA score.

ASSA then considers the ratio of the sum area of all bounding boxes to the sub-region area *R* as defined in (6). The larger the value of *R*, the less background in this sub-region and the larger the area that contains objects. In addition, it can also reflect the overlapping degree of the objects to a certain extent, e.g., *R* > 1 means that there are overlaps of the object's bounding boxes in this sub-region. Larger *R* values indicate that the detected object is big or that many objects are detected within the sub-region.

$$R = \frac{\sum\_{i=1}^{L} area\_i}{S} \tag{6}$$

Finally, the ratio of the sum area of all bounding boxes to the number of bounding boxes is defined in (7). It reflects the average size of the objects in the sub-region.

$$E = \frac{\sum\_{i=1}^{L} area\_i}{L} \tag{7}$$

As defined in (8), a final score *s<sup>i</sup>* is computed for each sub-region by adding the above four indicators, each with a corresponding weight calculated using information entropy; *wj* is the weight of each indicator and *pij* is the indicator *j* of sub-region *i*. The sub-regions with high scores are identified as object-dense sub-regions and sent to the KRDN stage.

$$s\_i = \sum\_{j=1}^{m} w\_j \cdot p\_{ij} \tag{8}$$

The weight of each indicator *w<sup>j</sup>* is calculated by information entropy *e<sup>j</sup>* as defined in (9); *w<sup>j</sup>* is then computed by (10) and (11).

$$e\_j = -k \sum\_{i=1}^{n} p\_{ij} \ln(p\_{ij}), \; k = 1/\ln(n) > 0 \; e\_j \ge 0 \tag{9}$$

$$d\_{\hat{\jmath}} = 1 - e\_{\hat{\jmath}} \tag{10}$$

$$w\_j = \frac{d\_j}{\sum\_{j=1}^{m} d\_j} \tag{11}$$

Each final indicator *j* of sub-region *i*, denoted as *pij,* is calculated by (14) after normalizing the origin indicators using (12) and (13). The higher the regional target density, the more likely that it should be sent to KRDN and the higher the final score. These indicators are defined as extremely large indicators. On the contrary, the smaller the average confidence, the more likely it should be sent to KRDN. These indicators are defined as extremely small indicators. To standardize and normalize these indicators, extremely small indicators are converted by (12) to extremely large indicators and extremely large indicators are normalized by (13).

$$\mathbf{x}'\_{ij} = \frac{\mathbf{x}\_{ij} - \min\{\mathbf{x}\_{ij}, \dots, \mathbf{x}\_{nj}\}}{\max\{\mathbf{x}\_{ij}, \dots, \mathbf{x}\_{nj}\} - \min\{\mathbf{x}\_{ij}, \dots, \mathbf{x}\_{nj}\}} \tag{12}$$

$$\mathbf{x}'\_{ij} = \frac{\max\{\mathbf{x}\_{ij\prime}, \dots, \mathbf{x}\_{nj}\} - \mathbf{x}\_{ij}}{\max\{\mathbf{x}\_{ij\prime}, \dots, \mathbf{x}\_{nj}\} - \min\{\mathbf{x}\_{ij\prime}, \dots, \mathbf{x}\_{nj}\}} \tag{13}$$

$$p\_{ij} = \frac{\mathbf{x\_{ij}}}{\sum\_{i=1}^{n} \mathbf{x\_{ij}}}, \ i = 1, \dots, n, j = 1, \dots, m \tag{14}$$

*Drones* **2022**, *6*, x FOR PEER REVIEW 7 of 17

'

*ij*

'

*ij*

tors are normalized by (13).

The results of SIRSA are shown in Figure 3. A white rectangle indicates candidate sub-regions, while the transparency of sub-regions indicates their ASSA score. The clearer the sub-region, the higher its final ASSA score and the more it should be sent to KRDN. **Figure 2.** The left and right images should have the same target density.

indicators are converted by (12) to extremely large indicators and extremely large indica-

min{ ,..., }

max{ ,..., }

1

*ij ij n*

*<sup>x</sup> <sup>p</sup>*

= 

*ij i*

*x* =

*x xx <sup>x</sup> xx xx*

*xxx <sup>x</sup> xx xx*

max{ ,..., } min{ ,..., } *ij ij nj*

max{ ,..., } min{ ,..., } *ij nj ij*

The results of SIRSA are shown in Figure 3. A white rectangle indicates candidate sub-regions, while the transparency of sub-regions indicates their ASSA score. The clearer the sub-region, the higher its final ASSA score and the more it should be sent to KRDN.

*ij nj ij nj*

*ij nj ij nj*

<sup>−</sup> <sup>=</sup> <sup>−</sup> (12)

<sup>−</sup> <sup>=</sup> <sup>−</sup> (13)

, *i nj m* = = 1,..., , 1,..., (14)

**Figure 3.** Points of the same color indicate the center of the bounding box which belongs to the same sub-region. The obscurer the sub-regions, the lower the ASSA score. **Figure 3.** Points of the same color indicate the center of the bounding box which belongs to the same sub-region. The obscurer the sub-regions, the lower the ASSA score.

#### **3. Results**

#### *3.1. Datasets and Evaluation Metrics*

*Datasets.* The proposed approach is evaluated on the VisDrone2021-DET dataset. The VisDrone dataset is collected by drones in 14 different cities of China, at different heights and in different weather/light conditions. It contains a total of 10,209 images, which consists of 6471 training images, 548 validation images, and 3190 testing images. The objects in this dataset are mostly small and often clustered together. However, the dataset provides a high image resolution of approximately 2000 × 1500 pixels. Images are annotated with bounding boxes, including 10 predefined categories (pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle).

*Evaluation Metric.* The proposed method is evaluated using the same evaluation protocol as described in MS-COCO, which has 12 evaluation indicators for object detection. AP, AP50, and AP75 are selected as the criteria for state-of-the-art comparison. AP is used to calculate the average index of all categories, and it generally defaulted to mAP. AP50 and AP75 represent indicators with threshold IOUs of 0.5 and 0.75, respectively. This paper also adopted the widely used precision–recall (PR) curves as our evaluation metric. The method of [28] is used to calculate the criteria proposed above.

*Implementation Details.* The proposed methods are implemented using PyTorch. The backbone is pretrained on MS COCO. The following training of the model is completed on a server with one NVIDIA GeForce GTX 3080Ti GPU. The experiment uses Adam [29] to train ORDN and KRDN for 150 epochs, with the learning rate starting at 0.001.

#### *3.2. Model Scaling Scheme for Two-Stage Framework*

YOLOv5 has five networks of different sizes, namely YOLOv5x, YOLOv5l, YOLOv5m, YOLOv5s, and up-to-date YOLOv5n. Since it is the most notable and convenient one-stage

detector with a flexible structure, the experiment in this paper chooses YOLOv5 as the basic model for ORDN and KRDN.

We test YOLOv5n and YOLOv5m in different sizes in Section 3.5. The experiment demonstrates that the large-size YOLOv5n is more accurate than the small-size YOLOv5m within the same execution time. Because ORDN is required to locate the bounding box in time, we balance time consumption and accuracy, properly setting YOLOv5n as the basic model of ORDN. Considering the limited performance of the UAV hardware platform, YOLOv5X is unsuitable at this stage due to its high time consumption, despite having the highest performance. YOLOv5m is balanced in its performance and computational requirement, and is therefore adopted as the KRDN base model. Finally, the input size of ORDN is set to 736 × 736 and the sub-regions are resized to 384 × 384 before being fed into KRDN.

## *3.3. Qualitative Result*

To demonstrate the effectiveness of the proposed model ARSD, extensive experiments have been conducted to compare its results with the baseline method (YOLOv5) on the VisDrone2021 dataset, as shown in Figure 4. It can be noted that ARSD performs better in small-scale object detection, especially in object-dense sub-regions. In contrast, YOLOv5 misses many small-scale objects in the object-dense region. According to the partially enlarged image, there is almost no missing small-scale objects such as pedestrians. This is mainly because the object-dense sub-regions are on a small scale and do not need to resize dramatically before *Drones*  being sent to KRDN; thus, more features remain for more accurate detection. **2022**, *6*, x FOR PEER REVIEW 9 of 17

(**c**)

**Figure 4.** *Cont*.

method (**a**,**c**) on the validation set of the VisDrone2021-DET dataset.

*Drones* **2022**, *6*, x FOR PEER REVIEW 10 of 17

*Drones* **2022**, *6*, x FOR PEER REVIEW 10 of 17

**Figure 4.** The qualitative comparisons among the detection results from baseline (**b**,**d**) and our method (**a**,**c**) on the validation set of the VisDrone2021-DET dataset. Figure 5 shows the cluster results of the proposed FPDCA and Mean-Shift. The clus-**Figure 4.** The qualitative comparisons among the detection results from baseline (**b**,**d**) and our method (**a**,**c**) on the validation set of the VisDrone2021-DET dataset. (**d**) **Figure 4.** The qualitative comparisons among the detection results from baseline (**b**,**d**) and our

tering number of Mean-Shift is dynamic; sometimes it has only one cluster. If this one sub-

region is sent to KRDN, it cannot achieve any improvement in accuracy. In addition, it sometimes has too many clusters which will result in cost-intensive computing resources. Figure 5 shows the cluster results of the proposed FPDCA and Mean-Shift. The clustering number of Mean-Shift is dynamic; sometimes it has only one cluster. If this one sub-region is sent to KRDN, it cannot achieve any improvement in accuracy. In addition, it sometimes has too many clusters which will result in cost-intensive computing resources. Figure 5 shows the cluster results of the proposed FPDCA and Mean-Shift. The clustering number of Mean-Shift is dynamic; sometimes it has only one cluster. If this one subregion is sent to KRDN, it cannot achieve any improvement in accuracy. In addition, it sometimes has too many clusters which will result in cost-intensive computing resources.

**Figure 5.** The qualitative comparisons among the clusters result from different cluster methods on the test set of the VisDrone2021-DET dataset. For each testing image, we show the cluster results of the proposed method (**a**) and Mean-Shift (**b**). **Figure 5.** The qualitative comparisons among the clusters result from different cluster methods on the test set of the VisDrone2021-DET dataset. For each testing image, we show the cluster results of the proposed method (**a**) and Mean-Shift (**b**).

Since the traditional one-stage methods are easier to reproduce, we compare our

Table 1 lists AP50 values for ARSD and other one-stage methods in the detection of 10 different categories of objects. The results of the first five methods are derived from [30], with the remaining results obtained from our conducted experiments. The proposed method, ARSD, consists of YOLOv5n+YOLOv5mAH and performs much better than the other methods in each category, especially in small-scale categories such as Pedestrian

As shown in Table 2, AP↗, AP50↗, and AP75↗ highlight the outperformance of ARSD over other previous state-of-the-art detectors. Compared with GLASN and UCG-

**Tricycle Bus Motor Average** 

**AP50** 

using the VisDrone2021-DET dataset: (1) the detection accuracy of each category comparisons against different one-stage methods; (2) the detection accuracy and accuracy im-

provement comparisons against state-of-the-art two-stage framework methods.

**Table 1.** AP50 comparison of each class among ARSD and other one-stage methods.

YOLOV3 18.1 9.9 2 56.6 17.5 17.6 6.7 2.9 32.4 17 17.1 SlimYOLOv3 [31] 17.4 9.3 2.4 55.7 18.3 16.9 9.1 3 26.9 17 17.6 Faster-RCNN 21.7 12.7 11.5 63.2 37.8 29.9 22.5 12.3 50.6 28.4 29.1 FPN 33 25.8 13.9 69.4 40 34.3 27.4 13.4 49.1 37.6 35.6 YOLOv5m 45.2 35.7 13.7 77.8 41 37.9 20.7 10.8 50.9 30.1 36.8 YOLOv5l-TPH 53.5 29.7 25.9 87 55.3 61.5 34.9 31.2 73.5 50.6 50.3 **ARSD 68.8 56.8 40.68 88.17 61.53 53.74 49 26.19 72.78 61.3 57.9** 

*3.4. Quantitative Evaluation* 

**Method Pedestrian People Bicycle Car Van Truck Tricycle Awning-**

Net, ARSD improves by 2.1% and 4.8% (AP50).

and People.

#### *3.4. Quantitative Evaluation*

Since the traditional one-stage methods are easier to reproduce, we compare our method with them in each category; however, the comparison with state-of-the-art methods comes from their literature. Quantitative comparisons are conducted in two aspects using the VisDrone2021-DET dataset: (1) the detection accuracy of each category comparisons against different one-stage methods; (2) the detection accuracy and accuracy improvement comparisons against state-of-the-art two-stage framework methods.

Table 1 lists AP50 values for ARSD and other one-stage methods in the detection of 10 different categories of objects. The results of the first five methods are derived from [30], with the remaining results obtained from our conducted experiments. The proposed method, ARSD, consists of YOLOv5n + YOLOv5mAH and performs much better than the other methods in each category, especially in small-scale categories such as Pedestrian and People.

**Table 1.** AP50 comparison of each class among ARSD and other one-stage methods.


As shown in Table 2, AP%, AP50%, and AP75% highlight the outperformance of ARSD over other previous state-of-the-art detectors. Compared with GLASN and UCGNet, ARSD improves by 2.1% and 4.8% (AP50).

**Table 2.** Comparison with state-of-the-art object detection methods in the VisDrone2021-DET validation set.


#### *3.5. Ablation Study*

To validate the effectiveness of the additional detection head, different cluster methods, and a different number of sub-regions in detection tasks, this paper conducted extensive experiments on VisDrone2021-DET test-dev.


**Figure 6.** The comparison between the mAP of YOLOv5n, YOLOv5m, and YOLOv5m-AH for the same time consumption. **Figure 6.** The comparison between the mAP of YOLOv5n, YOLOv5m, and YOLOv5m-AH for the same time consumption.

To improve processing speed in terms of FPS (frames per second), ASSA is used to discard the sub-regions by 1/3, 1/2, and 0 from the original set of sub-regions. Considering the balance of computing resources and accuracy, ASSA chooses to discard 1/3 of the candidate sub-regions in the inference phase. The processing time is reduced by about a third after discarding 1/3 of sub-regions.

**Figure 7.** *Cont*.

**Figure 7.** The comparison of 768 × 768 YOLOv5m, 640 × 640 YOLOv5m-AH, and our method for each category. **Figure 7.** The comparison of 768 × 768 YOLOv5m, 640 × 640 YOLOv5m-AH, and our method for each category.

To improve processing speed in terms of FPS (frames per second), ASSA is used to discard the sub-regions by 1/3, 1/2, and 0 from the original set of sub-regions. Considering **Table 3.** Results with different cluster methods, clusters, and remaining sub-regions. Note: (x) indicates the ratio of sub-regions being discarded.


3 ARSD FPDCA 2 3220 3220 (0) 20.59 37.76

#### 4 ARSD FPDCA 2 3220 2146 (1/3) 19.16 35.63 **4. Discussion**

5 ARSD FPDCA 2 3220 1610 (1/2) 18.33 33.4 6 ARSD FPDCA 3 4830 4830 (0) 22.93 41.31 7 ARSD FPDCA 3 4830 3220 (1/3) 21.6 39.6 8 ARSD FPDCA 3 4830 2415 (1/2) 20.05 37.1 9 ARSD FPDCA 4 6440 6440 (0) 23.9 41.92 10 ARSD FPDCA 4 6440 4293 (1/3) 22.32 40.15 The aim is to improve the object detection accuracy for UAV images in high resolution with a lot of small targets in the images. This paper proposes a two-stage object detection framework for UAV images, with extensive performance evaluations conducted. The qualitative results in Section 3.3 and quantitative evaluations in Section 3.4 directly show that the proposed framework performs better than the state-of-the-art detection methods. Compared with GLASN, it improves AP by 2.54% and AP50 by 2.1%. The ablation study

includes four experiments. The first experiment performs tests on different network scales and the results demonstrate that YOLOv5n is the most suitable for ORDN. The second experiment shows that the network with an additional head performs better than others within the same processing time. The third experiment proves that this two-stage framework is more accurate in small-scale categories such as Pedestrians, People, and Bicycle. The final experiment analyses the influence of clustering algorithms and the number of clusters. The novelty mentioned in Section 1 has been verified by the above four experiments. The framework can improve the accuracy of object detection within the same processing time, especially in small-scale objects.

This method should balance the number of clusters and the ratio of the remaining sub-regions among all candidate sub-regions. If a higher number of clusters is set, the calculation time will undoubtedly increase, but the accuracy will also be improved. Our future work includes the study of selecting these parameters more precisely.

#### **5. Conclusions**

This paper proposes ARSD, an adaptive region selection detection framework, that is more efficient and more effective for UAV image object detection. The main idea is to locate object-dense sub-regions in high-resolution UAV images and feed these sub-regions into the second-stage detector. Finally, the results are merged by WBF. In addition, we developed an adaptive region selection algorithm which consists of the Fixed Points Density-based Clustering Algorithm and the Adaptive Sub-regions Selection Algorithm. FPDCA can generate a fixed number of sub-regions combined with density information, while ASSA can score each sub-region objectively. Additionally, a new detection head is added based on the original detection heads for better small object detection. The evaluations on VisDrone2021- DET datasets have demonstrated the effectiveness and adaptiveness of ARSD. We will extend this work to apply the proposed method to small-scale object detection in remote sensing and search and rescue.

**Author Contributions:** Conceptualization, Y.W., Y.H. (Yan Huang) and Y.H. (Yi Han); methodology, Y.W. and Y.Z.; software, Y.W., Y.H. (Yan Huang) and Q.Y.; validation, Y.W. and Y.H. (Yan Huang); formal analysis, Y.W., Y.Z. and Y.H. (Yan Huang); investigation, Y.W., Y.C., Z.L., Z.Y. and Q.L.; resources, Y.W. and Y.H. (Yan Huang); data curation, Y.W., Y.H. (Yi Han) and Z.L.; writing—original draft preparation, Y.W. and Y.H. (Yi Han); writing—review and editing, Y.W., Y.H. (Yi Han), Y.C., Z.L., Z.Y. and Q.L.; visualization, Y.W. and Q.Y.; supervision, Y.H. (Yi Han) and Z.Y.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by a grant from the National Natural Science Foundation of China (Grant No. 61801341). This work was also supported by the Research Project of Wuhan University of Technology Chongqing Research Institute (No. YF2021-06).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The simulation data used to support the findings of this study are available from the corresponding author upon request. The video of the experimental work can be found at the following link: https://drive.google.com/drive/folders/1eYFxNSaYYEhY\_M0tdM5 5oybCA6sMMgzG?usp=sharing (accessed on 25 August 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

