*Article* **Predicting Arbitrary-Oriented Objects as Points in Remote Sensing Images**

**Jian Wang, Le Yang and Fan Li \***

School of Information and Communications Engineering, Xi'an Jiaotong University, Xi'an 710049, China; wj851329121@stu.xjtu.edu.cn (J.W.); yangle15@xjtu.edu.cn (L.Y.)

**\*** Correspondence: lifan@mail.xjtu.edu.cn

**Abstract:** To detect rotated objects in remote sensing images, researchers have proposed a series of arbitrary-oriented object detection methods, which place multiple anchors with different angles, scales, and aspect ratios on the images. However, a major difference between remote sensing images and natural images is the small probability of overlap between objects in the same category, so the anchor-based design can introduce much redundancy during the detection process. In this paper, we convert the detection problem to a center point prediction problem, where the pre-defined anchors can be discarded. By directly predicting the center point, orientation, and corresponding height and width of the object, our methods can simplify the design of the model and reduce the computations related to anchors. In order to further fuse the multi-level features and get accurate object centers, a deformable feature pyramid network is proposed, to detect objects under complex backgrounds and various orientations of rotated objects. Experiments and analysis on two remote sensing datasets, DOTA and HRSC2016, demonstrate the effectiveness of our approach. Our best model, equipped with Deformable-FPN, achieved 74.75% mAP on DOTA and 96.59% on HRSC2016 with a single-stage model, single-scale training, and testing. By detecting arbitrarily oriented objects from their centers, the proposed model performs competitively against oriented anchor-based methods.

**Keywords:** object detection; remote sensing image; anchor free; oriented bounding boxes; deformable convolution

#### **1. Introduction**

With the development of modern remote sensing technology, a large number of remote sensing images with higher spatial resolution and richer content have been produced [1–4]. Object detection in remote sensing images has broad application prospects in many fields, such as environmental monitoring [5–7], disaster control [8,9], infrared detection [10,11], and the military. Benefiting from deep convolutional neural networks, considerable results have been achieved for the object detection task in natural images. However, due to the complex background, variable object scales, arbitrary orientations and shooting angles, object detection in aerial images is still a hot topic in the field of computer vision [12–16].

Compared with natural image datasets [17,18], remote sensing image detection mainly faces the following differences and challenges (Illustrated in Figure 1):


**Citation:** Wang, J.; Yang, L.; Li, F. Predicting Arbitrary-Oriented Objects as Points in Remote Sensing Images. *Remote Sens.* **2021**, *13*, 3731. https:// doi.org/10.3390/rs13183731

Academic Editors: Jukka Heikkonen, Fahimeh Farahnakian and Pouya Jafarzadeh

Received: 10 August 2021 Accepted: 15 September 2021 Published: 17 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

orientations and aspect ratios than horizontal bounding boxes in remote sensing images. This not only requires the detector to correctly locate and classify the object of interest, but also to accurately predict its direction;

3. Complex background and Drastic scale changes. Compared to natural images, remote sensing images have higher resolution, with more complex and variable backgrounds. A lot of objects to be detected are easily submerged in the background, which requires the detector to be effectively focused on areas of interest. Meanwhile, the scales of objects vary drastically in remote sensing images; for example, some vehicles and bridges are only within a few pixels, while soccer fields can comprise thousands of pixels in aerial images.

**Figure 1.** Examples of Low overlap and Densely arranged (**Left**), Arbitrary orientations of objects (**Middle**), and Drastic scale changes (**Right**) in remote sensing images.

The above difficulties make remote sensing image detection more challenging and attractive, while requiring natural image object detection methods to be adapted to rotated objects. However, most rotated object detectors place multiple anchors per location to get a higher IoU between pre-set anchors and object bounding boxes. Dense anchors ensure the performance of the rotation detectors while having a higher computational burden. Can these anchors be discarded in the rotated object detection process, in order to improve the computational efficiency and simplify the design of the model? We find that one major difference between remote sensing images and natural images is the small probability of overlap between objects having the same category. So, the large overlap between adjacent objects per location is rare in this situation, especially when using oriented bounding boxes to represent the rotated objects. Therefore, we hope the network could directly predict the classification and regression information of the rotated object from the corresponding position, such as an object center, which can improve the overall efficiency of the detector and avoid the need for manual designs of the anchors. Meanwhile, the networks need to have robust feature extraction capabilities for objects with drastic scale changes and accurately predict the orientation of rotated objects.

To discard anchors in the detection process, we convert the rotation object detection problem into a center point prediction problem. First, we represent an oriented object by the center of its oriented bounding box. The network learns a center probability map to localize the object's center through use of a modulated focal loss. Then, inspired by [19], we use the circular smooth label to learn the object's direction, in order to accurately predict the angle of an object and avoid regression errors due to angular periodicity at the boundary. A parallel bounding-box height and width prediction branch is used to predict the object's size in a multi-task learning manner. Therefore, we can detect the oriented objects in an anchor-free way.

Further, to accurately localize the object center under drastic scale changes and various object orientations, a deformable feature pyramid network (Deformable-FPN) is proposed, in order to further fuse the multi-level features. Specifically, deformable convolution [20,21] is used to reduce the feature channels and project the features simultaneously. After mixing the adjacent-level features using an add operation, we perform another deformable

convolution to reduce the aliasing effect of the add operation. By constructing the FPN in a deformable manner, the convolution kernel can be adaptively adjusted, according to the scale and direction of the object. Experiments show that our Deformable-FPN can bring significant improvements to detecting objects in remote sensing images, compared to FPN. In summary, the main contributions of this paper are as follows:


The remainder of this paper is organized as follows. Section 2 first describes the related works. Section 3 provides a detailed description of the proposed method, including center-point based arbitrary-oriented object detector and Deformable-FPN. The experiment results and settings are provided in Section 4 and discussed in Section 5. Finally, Section 6 summarizes this paper and presents our conclusions.

#### **2. Related Work**

#### *2.1. Object Detection in Natural Images*

In recent years, horizontal object detection algorithms in natural image datasets, such as MSCOCO [17] and PASCAL VOC [18], have achieved promising progress. We classify them as follows:

Anchor-based Horizontal Object Detectors: Most region-based two-stage methods [22–26] first generate category-agnostic region proposals from the original image, then use categoryspecific classifiers and regressors to classify and localize the objects from the proposals. Considering their efficiency, single-stage detectors have drawn more and more attention from researchers. Single-stage methods perform bounding box (bbox) regression and classification simultaneously, such as SSD [27], YOLO [28–30], RetinaNet [31], and so on [32–35]. The above methods densely place a series of prior boxes (Anchors) with different scales and aspect ratios on the image. Multiple anchors per location are needed to cover the objects as much as possible, and classification and location refinement are performed based on these pre-set anchors.

Anchor-free Horizontal Object Detectors: Researchers have also designed some comparable detectors without complex pre-set anchors, which are inspiring to the detection process. CornerNet [36] detects an object bounding box as a pair of keypoints, demonstrating the effectiveness of anchor-free object detection. Further, CenterNet [37] models an object as a single point, then regresses the bbox parameters from this point. Based on RetinaNet [31], FCOS [38] abandoned the pre-set anchors and directly predicts the distance from a reference point to four bbox boundaries. All of these methods have achieved great performance and have avoided the use of hyper-parameters related to anchor boxes, as well as complicated calculations such as intersection over union (IoU) between bboxes during training.

#### *2.2. Object Detection in Remote Sensing Images*

Object detection also has a wide range of applications in remote sensing images. Reggiannini et al. [5] designed a sea surveillance system to detect and identify illegal maritime traffic. Almulihi et al. [7] propose a statistical framework based on gamma distributions and demonstrate the effectiveness for oil spill detection in SAR images. Zhang et al. [8] analyze the frequency properties of motions to detect living people in disaster areas. In [10], a difference maximum loss function is used to guide the learning directions of the networks for infrared and visible image object detection.

Based on the fact that rotation detectors are needed for remote sensing images, many excellent rotated object detectors [19,39–46] have been developed from horizontal detection methods. RRPN [39] sets rotating anchors to obtain better region proposals. R-DFPN [47] propose a rotation dense feature pyramid network to solve the narrow width problems of the ship, which can effectively detect ships in different scenes. Yang et al. [19] converted an angle regression problem to a classification problem and handled the periodicity of the angle by using circular smooth label (CSL). Due to the complex background, drastic scale changes, and various object orientations problems, multi-stage rotation detectors [41–43] have been widely used.

#### **3. Method**

In this section, we first introduce the overall architecture of our proposed center point-based arbitrary-oriented object detector. Then, we detail how to localize the object's center and predict the corresponding angle and size. Finally, the detailed structure of Deformable-FPN is introduced.

#### *3.1. Overall Architecture*

The overall architecture of our methods, based on [37], is illustrated in Figure 2. ResNet [48] is used as our backbone, in order to extract multi-level feature maps (denoted as *C*3, *C*4, *C*5). Then, these features are sent to deformable feature pyramid networks to obtain a high-resolution, strong semantic feature map, *P*2, which is responsible for the following detection task. Finally, four parallel sub-networks are used to predict the relevant parameters of the oriented bounding boxes. Specifically, the Center Heatmap branch is used to predict the center probability, for localizing the object's center. A refined position of the center is obtained from the Center offset branch. The Orientation branch is responsible for predicting the object's direction by using the Circular Smooth Label, and the corresponding height and width are obtained from the Object size branch.

**Figure 2.** Overall architecture of our proposed center-point based arbitrary-oriented object detector.

#### *3.2. Detecting Arbitrary-Oriented Object by Its Center Point*

#### 3.2.1. Center Point Localization

Let *W* and *H* be the width and height of the input image. We aim to let the network predict a category-specific center point heatmap *<sup>Y</sup>*<sup>ˆ</sup> <sup>∈</sup> [0, 1] *W <sup>R</sup>* <sup>×</sup> *<sup>H</sup> <sup>R</sup>* <sup>×</sup>*C*, based on the features extracted from the backbone, where *R* is the stride between the input and feature *P*<sup>2</sup> (as shown in Figure 2), and *C* is the number of object categories (*C* = 15 in DOTA, 1 in HRSC2016). *R* was set to four, following [37]. The predicted value *Y*ˆ = 1 denotes a detected center point of the object, while *Y*ˆ = 0 denotes background.

We followed [36,37] to train the center prediction networks. Specifically, for each object's center (*px*, *py*) of class *<sup>c</sup>*, a ground-truth positive location (*p*˜*x*, *<sup>p</sup>*˜*y*)=( *px <sup>R</sup>* , *py <sup>R</sup>* ) is responsible for predicting it, and all other locations are negative. During training, equally penalizing negative locations can severely degrade the performance of the network; this is because, if a negative location is close to the corresponding ground-truth positive location, it can still represent the center of the object within a certain error range. Thus, simply dividing it as a negative sample will increase the difficulty of learning object centers. So, we alleviated the penalty for negative locations within a radius of the positive location. This radius, *r*, is determined by the object size in an adaptive manner: a pair of diagonal points within the radius can generate a bounding box exceeding a certain Intersection over Union (IoU) with the ground-truth box; the IoU threshold is set to 0.5 in this work. Finally, the ground-truth heatmap *Y* ∈ [0, 1] *W <sup>R</sup>* <sup>×</sup> *<sup>H</sup> <sup>R</sup>* <sup>×</sup>*<sup>C</sup>* used to reduce the penalty is generated as follows: We split all ground truth center points into *Y* and pass them through the Gaussian kernel *Kxyc*:

$$K\_{\rm xyc} = \exp\left(-\frac{(\mathbf{x} - \tilde{p}\_{\rm x})^2 + (y - \tilde{p}\_{\rm y})^2}{2\sigma\_p^2}\right) \tag{1}$$

$$
\sigma\_{\mathfrak{P}} = \mathfrak{r}/3. \tag{2}
$$

We use the element-wise maximum operation if two Gaussians of the same class overlap. The loss function for center point prediction is a variant of focal loss [31], formulized as:

$$L\_{\text{center}} = -\frac{1}{N} \sum\_{\mathbf{x}, \mathbf{y}, \mathbf{c}} \begin{cases} (1 - \hat{\mathbf{Y}}(\mathbf{x}, \mathbf{y}, \mathbf{c}))^a \log(\hat{\mathbf{Y}}(\mathbf{x}, \mathbf{y}, \mathbf{c})) & \text{if } \mathbf{Y}(\mathbf{x}, \mathbf{y}, \mathbf{c}) = 1\\ (1 - \mathbf{Y}(\mathbf{x}, \mathbf{y}, \mathbf{c}))^b \hat{\mathbf{Y}}(\mathbf{x}, \mathbf{y}, \mathbf{c})^a \log(1 - \hat{\mathbf{Y}}(\mathbf{x}, \mathbf{y}, \mathbf{c})) & \text{otherwise} \end{cases} \tag{3}$$

where *N* is the total number of objects in the image, and *α* and *β* are the hyperparameters controlling the contribution of each point (*α* = 2 and *β* = 4, by default, following [37]).

As the predicted *Y*ˆ has a stride of *R* with the input image, the center point position obtained by *Y*ˆ will inevitably have quantization error. Thus, a Center offset branch was introduced to eliminate this error. The model predicts *o*ˆ ∈ [0, 1] *W <sup>R</sup>* <sup>×</sup> *<sup>H</sup> <sup>R</sup>* ×2 , in order to refine the object's center. For each object's center *p* = (*px*, *py*), smooth L1 loss [26] is used during training:

$$L\_{offset} = \frac{1}{N} \sum\_{p} Smooth\_{L1} \left( \left. \partial\_{\beta \prime} \frac{p}{R} - \left\lfloor \frac{p}{R} \right\rfloor \right| \right) . \tag{4}$$

Then, combining *Y*ˆ and *o*ˆ, we can accurately locate the object's center.

#### 3.2.2. Angle Prediction for Oriented Objects

In this section, we first introduce the five-parameter long side-based representation for oriented objects and analyze the angular boundary discontinuity problem. Then, we detail the circular smooth label, in order to solve the boundary discontinuity problem and predict the angles of oriented objects.

**Representations for Oriented Objects.** As we discussed in Section 1, the use of oriented bounding boxes can better depict objects in remote sensing images. We use fiveparameter long side-based methods to represent the oriented objects. As shown in Figure 3, five parameters (*Cx*, *Cy*, *h*, *w*, *θ*) were used to represent an OBB, where *h* represents the long side of the bounding box, the other side is referred to as *w*, and *θ* is the angle between the long side and x-axis, with a 180◦ range. Compared to the HBB, OBB needs an extra parameter, *θ*, to represent the direction information.

**Figure 3.** Five-parameter long side-based representation for oriented objects.

As there are generally various angles of an object in remote sensing images, accurately predicting the direction is important, especially for objects with large aspect ratios. Due to the periodicity of the angle, directly regressing the angle *θ* may lead to the boundary discontinuity problem, resulting in a large loss value during training. As illustrated in Figure 4, two oriented objects can have relatively similar directions while crossing the angular boundary, resulting in a large difference between regression values. This discontinuous boundary can interfere with the network's learning of the object direction and, thus, degrade the model's performance.

**Figure 4.** An example of discontinuous angular boundary based on the five-parameter long side representation.

**Circular Smooth Label.** Following [19], we convert the angle regression problem into a classification problem. As the five-parameter long side-based representation has 180◦ angle range, each 1◦ degree interval is referred to a category, which results in 180 categories in total. Then, the one-hot angle label passes through a periodic function, followed by a Gaussian function to smooth the label, formulized as:

$$\text{CSL}(\mathbf{x}) = \begin{cases} \text{g}(\mathbf{x}) & \theta - r\_{\text{csl}} < \mathbf{x} < \theta + r\_{\text{csl}} \\ 0 & \text{otherwise} \end{cases} \tag{5}$$

where *g*(*x*) is the Gaussian function, which satisfies *g*(*x*) = *g*(*x* + *kT*), *k*∈*N*, *T* = 180; and *rcsl* is the radius of the Gaussian function, which controls the smoothing degree of the angle label. For example, when *rcsl* = 0, the Gaussian function becomes to pulse function and the CSL degrades into the one-hot label. We illustrate the CSL in Figure 5.

**Figure 5.** Visualization of the circular smooth label.

The loss function for the CSL is not the commonly used Softmax Cross-Entropy loss; as we use a smooth label, Sigmoid Binary Cross-Entropy is used to train the angle prediction network. Specifically, the model predicts ˆ *θ* ∈ [0, 1] *W <sup>R</sup>* <sup>×</sup> *<sup>H</sup> <sup>R</sup>* <sup>×</sup><sup>180</sup> for an input image, and the loss function is:

$$L\_{CSL} = \frac{1}{N} \sum\_{p} BCE(\theta\_{\mathcal{P}}, \theta\_{p}) , \tag{6}$$

where *θ<sup>p</sup>* is the circular smooth label for object *p* in the image.

#### 3.2.3. Prediction of Object Size

We have that (*Cx*, *Cy*, *h*, *w*, *θ*) represents the OBBs, using the center location and direction of each object obtained in Sections 3.2.1–3.2.2. The rest (i.e., the long side *h* and short side *w*) are predicted through the Object size branch shown in Figure 2. The model outputs *<sup>S</sup>*<sup>ˆ</sup> <sup>∈</sup> *<sup>R</sup> <sup>W</sup> <sup>R</sup>* <sup>×</sup> *<sup>H</sup> <sup>R</sup>* <sup>×</sup><sup>2</sup> for the object size. For each object *p*, with corresponding size label *sp* = (*hp*, *wp*), smooth L1 loss is used:

$$L\_{size} = \frac{1}{N} \sum\_{p} Smooth\_{L1}(\hat{\mathcal{S}}\_{\mathcal{P}}, \ln(\frac{S\_p}{R})).\tag{7}$$

Note that the smooth L1 loss used in this paper is (*δ* = <sup>1</sup> <sup>9</sup> by default):

$$Smooth\_{L1}(\mathbf{x}) = \begin{cases} \frac{1}{\sqrt{2}} \mathbf{x}^2 & \text{if } |\mathbf{x}| < \delta \\ \mathbf{x} - \frac{\delta}{2} & \text{otherwise.} \end{cases} \tag{8}$$

The overall training objective for our arbitrary-oriented object detector is:

$$L = L\_{center} + \lambda\_{angle} L\_{\rm CSL} + \lambda\_{size} L\_{size} + \lambda\_{offset} L\_{offset} \tag{9}$$

where *λangle*, *λsize*, and *λo f f set* are used to balance the weighting between different tasks. In this paper, *λangle*, *λsize*, and *λo f f set* are set to 0.5, 1, and 1, respectively.

#### *3.3. Feature Enhancement by Deformable FPN*

We aim to better localize the object's center and corresponding direction by building a pyramidal feature hierarchy on the network's output features. The feature maps extracted by the backbone are referred to as *C*3, *C*4, and *C*5, shown in Figure 2. These feature maps have different spatial resolutions and large semantic gaps. Low-resolution maps have strong semantic information, which has great representational capacity for object detection, especially for large objects (e.g., Soccer fields) in aerial images, while high resolution maps have relatively low-level features but can provide more detailed information, which is very important for detecting small objects. Due to the various orientations and large scale differences of objects in remote sensing images, the standard FPN [25] used to fuse these feature maps may not work well in this situation. The standard convolution kernel appears in a regular rectangular manner, which has the characteristic of translation invariance. Meanwhile, the resolutions of these feature maps differ, and the semantic information of objects is not strictly aligned to these feature maps. Therefore, using standard convolution to project these features before the add operation may harm the representation ability of oriented objects, which is essential to accurately localize the object's center and direction. However, Deformable convolution (DConv) can learn the position of convolution kernels adaptively, which can better project the features of oriented objects in the feature pyramid network. We detail the structure of Deformable FPN in the following, and demonstrate its effectiveness in Section 4.

#### 3.3.1. Structure of Deformable FPN

To verify the effectiveness of our method, we introduce three kinds of necks, including our Deformable FPN, to process backbone features to *P*2, which are subsequently sent to the detection head. Figure 6 shows detailed architectures of the three necks, using ResNet50 [48] as a backbone. A direct Top-down pathway is constructed without building the feature pyramid structure (Figure 6) but, instead, using deformable convolutions, as originally used by [37] for ResNet. Our proposed Deformable FPN is shown in Figure 6, while a commonly used FPN structure is shown in Figure 6. We keep the same channels of features in each stage, which are 256, 128, and 64 for features with stride 16, 8, and 4, respectively.

**Figure 6.** Different kinds of necks to process the backbone features: (**a**) A direct Top-down pathway without the feature pyramid structure; (**b**) our proposed Deformable FPN; and (**c**) standard FPN.


Comparing our Deformable FPN with DTS, we reuse the shallow, high-resolution features of the backbone, which provide more detailed texture information to better localize the object center and detect small objects, such as vehicles and bridges, in remote sensing images. Compared with FPN, by using deformable convolution—which adaptively learns the position of convolution kernels—it can better project the features of oriented objects. Moreover, applying transposed convolution, rather than nearest neighbor interpolation, to up-sample the features can help to better localize the centers.

#### 3.3.2. Deformable Groups

As we use deformable convolution in the feature pyramid structure, we discuss how larger Deformable groups in DConv can further enhance the representation power of the network in this section.

The deformable convolution used in this paper is DCNv2 [21]. For a convolutional kernel and *K* sampling locations, the deformable convolution operation can be formulized as follows:

$$\log(p) = \sum\_{k=1}^{K} \omega\_k \cdot \mathbf{x}(p + p\_k + \Delta\_{p\_k}) \cdot \Delta\_{m\_{k'}} \tag{10}$$

where *x*(*p*) and *y*(*p*) denote the feature at location *p* on input feature map *x* and output feature map *y*, respectively; the pre-set convolution kernel location is denoted as *pk* and *ω<sup>k</sup>* is the kernel weight; and Δ*pk* and Δ*mk* are the learnable kernel offset and scalar weight based on input feature, respectively. Take a 3 × 3 deformable convolutional kernel as an example: there are *K* = 9 sampling locations. For each location *k*, a two-dimensional vector(Δ*pk* ) is used to determine the offsets in the x- and y-axes, and a one-dimensional tensor is used for the scalar weight (Δ*mk* ). So, the network first predicts offset maps, which have 3*K* channels based on the input features, then uses the predicted offsets to find *K* convolution locations at each point *p*. Finally, Equation (10) is used to calculate the output feature maps. We illustrate this process in Figure 7a.

**Figure 7.** Illustration of 3 × 3 deformable convolution: (**a**) One deformable group; and (**b**) *n* deformable groups.

Note that all channels in the input feature maps share one group of offsets when the number of deformable groups is set to 1 (as shown in Figure 7a). Input features share these common offsets to perform the deformable convolution. When the number of deformable groups is *n* (*n* > 1), the networks first output *n* × 3*K*-channel offset maps, the input feature (*C* channels) is divided into *n* groups, where each group of features has *C*/*n* channels, and the corresponding 3*K*-channel offset maps are used to calculate the kernel offsets (as shown in Figure 7b). Finally, the output feature will be obtained by deformable convolution on the input feature. Different from the groups in the standard convolutional operation, each channel in the output features will be calculated on the entire input features only, with different kernel offsets. Increasing the number of deformable groups can enhance the representation ability of DConv, as different groups of input channels use different kernel offsets, and the network can generate a unique offset for each group of features, according to the characteristics of the input features.

#### **4. Experiments**

#### *4.1. Data Sets and Evaluation Metrics*

#### 4.1.1. DOTA

DOTA is a large-scale dataset for object detection in remote sensing images. The images are collected from different sensors and platforms. There are 2806 images, with scales from 800 × 800 to 4000 × 4000 pixels. The proportions of the training set, validation set, and testing set in DOTA are <sup>1</sup> <sup>2</sup> , <sup>1</sup> <sup>6</sup> , and <sup>1</sup> <sup>3</sup> , respectively. The DOTA dataset contains 15 common categories, with 188,282 instances in total. The full names (short names) for the categories are: Plane (PL), Baseball diamond (BD), Bridge (BR), Ground track field (GTF), Small vehicle (SV), Large vehicle (LV), Ship (SH), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball field (SBF), Roundabout (RA), Harbor (HA), Swimming pool (SP), and Helicopter (HC).

#### 4.1.2. HRSC2016

HRSC2016 is a dataset for ship detection in aerial images. The HRSC2016 dataset contains images of two scenarios, including ships at sea and ships inshore at six famous harbors. There are 436, 181, and 444 images for training, validation and testing, respectively. The ground sample distances of images are between 2 m and 0.4 m, and the image resolutions range from 300 × 300 to 1500 × 900.

#### 4.1.3. Evaluation Metrics

The Mean Average Precision (mAP) is commonly used to evaluate the performance of object detectors, where the AP is the area under the precision–recall curve for a specific category, which ranges from [0, 1]. It is formulized as:

$$Precision = \frac{TP}{TP + FP} \tag{11}$$

$$Recall = \frac{TP}{TP + FN} \tag{12}$$

$$mAP = \frac{1}{\mathbb{C}} \sum\_{c=1}^{\mathbb{C}} \int P\_c(\mathcal{R}\_c) d\mathcal{R}\_{c\prime} \tag{13}$$

where *C* is the number of categories, and *TP*, *FP*, and *FN* represent the numbers of correctly detected objects, incorrectly detected objects, and mis-detected objects, respectively.

#### *4.2. Implementation Details*

#### 4.2.1. Image Pre-Processing

The images in the DOTA dataset always have a high resolution. Directly training on the original high-resolution images does not reconcile with the hardware, due to limited GPU memory. Therefore, we cropped the images into sub-images of size 1024 × 1024, with an overlap of 256 pixels, and obtained 14,560 labeled images for training. We introduce two methods for testing in this paper. In the first method, we crop the testing images using the same size as used in the training stage (1024 × 1024 pixels) and, after inference on all sub-images, the final detection results are obtained by splicing all sub-image results. This method is commonly used for inference on the test images in the DOTA dataset; however, it may generate some false results at the cutting edge, leading to poor performance especially for some categories with large sizes (e.g., Ground field track and Soccer field). The second method involves cropping the testing images with a relatively high resolution (3200 pixels i this paper) during inference. We simply padded the images if the size of the original image is smaller than the crop size. By cropping the testing images at a relatively high resolution, a large number of images will not be cut and, so, the model can detect objects based on the complete instance, thus obtaining a more accurate evaluation result. Note that the only difference between the two methods is the crop size used for testing.

For the HRSC2016 dataset, we resized the long side of images to 640 pixels and kept the same aspect ratio as the original images. Thus, the short side of each image was different and smaller than 640 pixels. Then, we uniformly padded the resized images to 640 × 640 pixels, both for training and testing.

#### 4.2.2. Experimental Settings

All experiments were implemented in PyTorch. ImageNet [49]-pretrained ResNets were used as our default backbone. We used the Adam [50] optimizer to optimize the overall networks for 140 epochs. We set a batch size of 12 for DOTA and 32 for HRSC2016. The initial learning rates were 1.25 × <sup>10</sup>−<sup>4</sup> and 2 × <sup>10</sup>−<sup>4</sup> for DOTA and HRSC2016, with the learning rate dropped by 10 × at 100 and 130 epochs. We used a single-scale training strategy with input resolution of 1024 for DOTA and 640 for HRSC2016, as mentioned before, and the stride *R* was set to 4. The Gaussian radii *rcsl* for CSL were set to 4 and 6 for

DOTA and HRSC2016, respectively. Our data augmentation methods included random horizontal and vertical flipping, random graying, and random rotation. We did not use multi-scale training and testing augmentations in our experiments.

#### *4.3. Results*

#### 4.3.1. Effectiveness of Deformable FPN

Due to the wide variety of object scales, orientations and shapes, we chose DOTA as our main dataset for validation. We implemented a standard feature pyramid network (FPN), a direct Top-down structure (DTS), and our proposed Deformable FPN (De-FPN) as necks to process features from the ResNet50 backbone.

Results are shown in Table 1. We give the average precision of each category and total mAP. HRT denotes the high resolution testing discussed in Section 4.2.1. The building detector from FPN achieved 69.68% mAP, which is already a good performance for the DOTA dataset. However, the direct Top-down structure had 1.2% higher mAP than the FPN structure. Note that the DTS does not build a feature hierarchical structure inside the network, but had a better performance than FPN, indicating that the deformable convolution can better project features for rotating objects. Furthermore, the interpolation operation used to up-sample the features may harm the representation power for predicting object centers exactly.

Our Deformable FPN achieved a remarkable improvement of 1.23% higher mAP, compared with DTS, which indicates that Deformable FPN can better fuse the multi-level features and help the detector to accurately localize the rotating objects. Compared with FPN, the advantages of building a feature hierarchical structure in our way are evident. The improvement of up to 2.43% higher mAP was obtained through use of deformable convolution and transposed convolution within the FPN structure. Further, by using original high-resolution images during testing, our detector could obtain a more accurate evaluation result. Specifically, the high-resolution test boosted the mAP by 1.79%, 2.39%, and 1.65% for FPN, DTS, and De-FPN, respectively.

**Table 1.** Three kinds of necks are used to build arbitrary-oriented object detectors: Feature pyramid network (FPN), direct Top-down structure (DTS), and Deformable FPN(De-FPN). HRT denotes using High-Resolution crop during Testing. All models use ImageNet-pretrained ResNet50 as a backbone.


#### 4.3.2. Results on DOTA

We compared our results with other state-of-the-art methods in the DOTA dataset. We used ResNet50, ResNet101, and ResNet152 as backbones to construct our Arbitraryoriented anchor-free based object detector, denoted as CenterRot. The results are shown in Table 2. The DOTA dataset contains complex scenes, wherein object scales change drastically. Two-stage methods are commonly used in DOTA, in order to handle the imbalance between foregrounds and backgrounds in these complex scenes, such as ROI Transformer [42] and CAD-Net [51], which have achieved 69.59% and 69.90% mAP, respectively, when using ResNet101 as a backbone. Meanwhile, extremely large and small objects can appear in one image (as shown in Figure 1), such that multi-scale training and testing technologies are used to obtain a better performance, such as FADet [52], which obtained 73.28% mAP using ResNet101, and MFIAR-Net [53], which obtained 73.49% mAP using ResNet152 as the backbone. However, multi-scale settings need to infer one image

multiple times at different sizes and merge all results after testing, which leads to a larger computational burden during inference.

Our CenterRot converts the oriented object detection problem to a center point localization problem. Based on the fact that remote sensing images have less probability of overlap between objects with the same category, directly detecting the oriented object from its center can lead to a comparable performance with oriented anchor-based methods. Specifically, CenterRot achieved 73.76% and 74.00% mAP on the OBB task of DOTA, when using ResNet50 and ResNet101 as the backbone, respectively. Due to the strong representation ability of our Deformable FPN for rotated objects , CenterRot, equipped with larger deformable groups (*n* = 16 in Deformable FPN), achieved the best performance (74.75% mAP) when using ResNet152 as the backbone, surpassing all published single-stage methods with single-scale training and testing. Detailed results for each category and method are provided in Table 2.

**Table 2.** State-of-the-Art comparison with other methods in the oriented object detection task in the DOTA test set. AP for each category and overall mAP on DOTA are provided (the best result is highlighted in bold), where MS denotes multi-scale training and testing and \* denotes that larger deformable groups (*n* = 16 in Deformable FPN) were used.


#### 4.3.3. Results on HRSC2016

The HRSC2016 dataset has only one category—ship—where some of them have large aspect ratios and various orientations. Therefore, it is still a challenge to detect ships in this dataset. The results are shown in Table 3, from which it can be seen that our CenterRot achieved state-of-the-art performance consistently, without the use of a more complicated architecture, compared with the other methods. Specifically, CenterRot achieved 90.20% and 96.59% for mAP 07 and 12, respectively, where mAP 07 denotes using the 2007 evaluation metric, while mAP 12 denotes using the 2012 evaluation metric.


**Table 3.** State-of-the-art comparison of HRSC2016. mAP 07(12) means using the 2007(2012) evaluation metric.

#### 4.3.4. Visualization

The visualization results are presented using our CenterRot. The results for DOTA are shown in Figure 8 and those for HRSC2016 are shown in Figure 9.

**Figure 8.** Visualization of detection results on DOTA.

**Figure 9.** Visualization of detection results on HRSC2016.

#### **5. Discussion**

The proposed CenterRot achieved prominent performance in detecting rotated objects for both of the DOTA and HRSC2016 datasets. Objects with the same category have a lower probability of overlapping each other, so directly detecting rotated objects from their center is effective and efficient. We selected several categories in order to further analyze our method. As shown in Table 4, small vehicle, large vehicle, and ship were the most common rotated objects in DOTA, which always appeared in a densely arranged manner. Anchor-based methods operate by setting anchors with different angles, scales and aspect ratios per location, in order to cover the rotated objects as much as possible. However, it is impossible to assign appropriate anchors for each object, due to the various orientations in this situation. Our methods performed well in these categories especially, due to the fact that we converted the oriented bounding box regression problem into a center point localization problem. Less overlap between objects means fewer collisions between object centers, such that the networks can learn the positions of rotated objects from their center easier. We also visualized some predicted center heatmaps, as shown in Figure 10. Moreover, since the deformable FPN can better project features for rotated objects and the use of CSL to predict the object direction, our methods still performed well for objects with large aspect ratios, such as harbors and ships in HRSC2016.

**Table 4.** Comparison of selected categories in DOTA. All methods use ResNet152 as a backbone.


However, as we cut the original images, some large objects were incomplete during training, such as the soccer ball field, which may confuse our detector when localizing

the exact center, resulting in relatively poor performance in these categories. Due to this, we use the five-parameter long side-based representation for oriented objects, which will create some ambiguity when representing the square-like objects (objects with small aspect ratio). So, the model will produce a large loss value when predicting the angle and size of these objects and perform poorly in these categories, such as roundabout. Other oriented representations, such as the five-parameter acute angle-based method [19], will avoid this problem while suffering EoE problems. Therefore, it is still worth studying how to better represent the rotated objects.

Future works will mainly involve improving the effectiveness and robustness of the proposed methods in real-world applications. Different from the classical benchmark datasets, the objects in input images can vary much more frequently and can be affected by other conditions, such as angle of insolation. Moreover, as cloudy weather is very common, the cloud can occlude some objects. The anchor-free rotated object detection problem in such a circumstance is also worth studying.

**Figure 10.** Visualization of predicted center heatmaps for some categories.

#### **6. Conclusions**

In this paper, we found that objects within the same category tend to have less overlap with each other in remote sensing images, and setting multiple anchors per location to detect rotated objects may not be necessary. We proposed an anchor-free based arbitraryoriented object detector to detect the rotated objects from their centers and achieved great performance without pre-set anchors, which avoids complex computations on anchors, such as IoU. To accurately localize the object center under complex backgrounds and the arbitrary orientations of rotated objects, we proposed a deformable feature pyramid network to fuse the multi-level features and obtained a better feature representation for detecting rotated objects. Experiments on DOTA showed that our Deformable FPN can better project the features of rotated objects than standard FPN. Our CenterRot achieved a state-of-the-art performance, with 74.75% mAP on DOTA and 96.59% on HRSC2016, with a single-stage model, including single-scale training and testing. Extensive experiments

demonstrated that detecting arbitrary-oriented objects from their centers is, indeed, an effective baseline choice.

**Author Contributions:** Conceptualization, J.W., L.Y. and F.L.; methodology, J.W.; software, J.W.; validation, J.W. and L.Y.; formal analysis, J.W., L.Y. and F.L.; investigation, J.W.; resources, F.L.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W., L.Y. and F.L.; visualization, J.W.; supervision, L.Y. and F.L.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China grant number U1903213.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The DOTA and HRSC2016 datasets used for this study can be accessed at https://captain-whu.github.io/DOTA/dataset.html and https://sites.google.com/site/hrsc2016/ accessed on 10 August 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

