A Novel Transformer-Based Adaptive Object Detection Method

Su, Shuzhi; Chen, Runbin; Fang, Xianjin; Zhang, Tian

doi:10.3390/electronics12030478

Open AccessArticle

A Novel Transformer-Based Adaptive Object Detection Method

by

Shuzhi Su

^1,2,

Runbin Chen

¹,

Xianjin Fang

^1,2,* and

Tian Zhang

¹

School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan 232001, China

²

Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(3), 478; https://doi.org/10.3390/electronics12030478

Submission received: 7 December 2022 / Revised: 10 January 2023 / Accepted: 13 January 2023 / Published: 17 January 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

To accurately detect multi-scale remote sensing objects in complex backgrounds, we propose a novel transformer-based adaptive object detection method. The backbone network of the method is a dual attention vision transformer network that utilizes spatial window attention and channel group attention to capture feature interactions between different objects in complex scenes. We further design an adaptive path aggregation network. In the designed network, CBAM (Convolutional Block Attention Module) is utilized to suppress background information in the fusion paths of different-level feature maps, and new paths are introduced to fuse same-scale feature maps to increase the feature information of the feature maps. The designed network can provide more effective feature information and improve the feature representation capability. Experiments conducted on the three datasets of RSOD, NWPU VHR-10, and DIOR show that the mAP of our method is 96.9%, 96.6%, and 81.7%, respectively, which outperforms compared object detection methods. The experimental results show that our method can detect remote-sensing objects better.

Keywords:

object detection; remote sensing; deep learning; dual attention

1. Introduction

As the remote sensing image acquisition technology develops, a lot of high-resolution remote sensing images are used for object detection, which provides details about objects and helps to detect them better. However, the complex backgrounds and multi-scale objects in remote sensing images make it difficult to accurately identify and locate remote sensing objects.

In the past few years, various object detection algorithms based on convolutional neural networks (CNN) [1] were proposed and widely used in remote sensing fields. Faster R-CNN [2], which is a typical representative of two-stage object detection algorithms, has been employed for remote sensing object detection [3]. However, it suffers from the problem of low detection accuracy. To solve the problem, researchers have improved Faster R-CNN in several aspects. Ji et al. [4] designed a new feature fusion network and used ResNet50 as the backbone network, achieving better detection performance. Yan et al. [5] enhanced the feature extraction capability by replacing the backbone network of Faster R-CNN with ResNet101, which improved the detection accuracy. Dong et al. [6] used Soft-NMS [7] algorithm to process the bounding boxes in the region proposal network, reducing the missed detection. Zhang et al. [8] improved the anchors and used the Soft-NMS algorithm in post-processing, which enhanced the accuracy and recall. Faster R-CNN has higher detection accuracy, but it is time-consuming in the region selection stage, resulting in slower image processing. Researchers have improved various single-stage object detection algorithms to balance detection accuracy and speed. Xu et al. [9] used DenseNet [10] to replace Darknet53 in YOLOv3 [11]. Wang et al. [12] replaced the backbone network with ResNeXt. In addition, to improve the detection performance, Hong et al. [13] regenerated the anchors in YOLOv3 by the k-means++ algorithm and assigned four anchors for each scale of the feature maps. Cengil et al. [14] used the YOLOv5 algorithm with different parameters to detect the faces of cats and dogs and eventually achieved better results with a mAp of 94.1%. Zakria et al. [15] improved YOLOv4 [16] by setting different NMS algorithm thresholds for different classes of objects, reducing false and missed detections. Cengil et al. [17] used ShuffleNetv2 and YOLOv5 to detect helmet use and wear, achieving a recall value of 0.942 and an accuracy of 0.91. Additionally, a new anchors assignment is proposed to improve the adaptability of anchors. The above-mentioned CNN-based methods have higher detection accuracy compared to traditional algorithms. However, it seems that CNN-based methods have reached the bottleneck of progress.

A transformer has recently been employed in computer vision tasks [18,19,20,21,22,23] and achieves good performance. Applying the transformer model to the remote sensing field is a rewarding exploration. In order to identify and locate multi-scale remote sensing objects accurately in complex scenes, the ability to capture long-range information is needed for the backbone network. The weak inductive bias of the transformer makes it usually perform poorly on small datasets. Designing a good transformer model and making it work on downstream tasks such as object detection requires pre-training on large datasets, which is expensive. Therefore, selecting a transformer network with a pre-trained model as the backbone can be considered. In order to suppress background feature information in remote sensing images and fuse more effective feature information in feature maps, the attention module is necessary to be used in the feature fusion path to process the feature information. Additionally, to further improve detection performance, a novel non-maximum suppression algorithm can be used in post-processing.

In order to improve the detection accuracy for remote sensing objects, we propose a novel transformer-based adaptive object detection method in this paper. Firstly, we select a novel transformer model called dual attention network (DaViT) [19] as the backbone network. DaViT uses spatial window attention and channel group attention to overcome the limitations of CNN. The combination of these two self-attention mechanisms can establish connections between all pixels, interacting with the feature information of objects. In order to provide more effective feature information for feature maps, we propose an adaptive path aggregation network (APAN) based on a path aggregation network (PAN) [24]. Convolutional block attention module (CBAM) [25] is introduced into the feature fusion paths of APAN. The attention module is used to highlight object feature information and suppress background information. Additionally, a novel feature reuse strategy is used in APAN, which fuses same-level feature maps at different layers to rich feature expressions. Finally, we use the weighted boxes fusion (WBF) [26] algorithm to process the candidate bounding boxes, which can improve the optimal bounding boxes’ positioning accuracy.

The remainder of this paper is described as follows. The related work of this paper, which includes convolutional neural networks, transformer, feature fusion networks, and non-maximum suppression algorithms, is briefly mentioned in Section 2. Section 3 presents the proposed method in detail and describes it in part. In Section 4, extensive experiments are conducted to prove the effectiveness of our method, and the experimental results are discussed and analyzed. Finally, the conclusion of the whole paper is given in Section 5.

2. Related Work

2.1. Convolutional Neural Network

CNN has become the backbone architecture in the era of deep learning, and it is widely applied to image classification, object detection, and instance segmentation. However, CNN is good at extracting local information by convolution operations [27,28] but not global information. In order to enable CNNs to capture global information, one approach is to make the network model deeper by stacking convolutional layers [27,28], which often increases the number of parameters. Unlike the above approach, another approach expands the perceptual field of CNN by increasing the size of the convolution kernel, which requires more destructive pooling operations [29]. Although the above two approaches improve the performance of CNN, the self-attention mechanism used directly may be more suitable for remote sensing object detection tasks.

2.2. Transformer

As a sequence encoder, the transformer was originally used in natural language processing. Unlike recurrent neural networks that recursively process sequence information and only focus on local contextual information, the transformer can learn global relationships by focusing on the complete sequence. Some researchers applied transformers to computer vision tasks and achieved great success. Vision transformer (ViT) [18] is the first transformer model for image classification and has shown good performance on image classification tasks. Detection transformer (DETR) [20] is the first end-to-end object detector using a transformer encoder–decoder architecture and achieves good detection accuracy in object detection tasks but low computational efficiency. REGO-DETR [30] mitigates the training difficulty through region-of-interest (RoI) based detection refinement, but the crucial component in these models is trained from scratch, which limits the robustness and generalization ability of the detection models. Deformable DETR [21] improves training efficiency by replacing dense attention with deformable attention, and it uses multi-scale features to improve detection performance, which leads to an increase in computational cost. Sparse DETR [31] lighten the attention complexity in the encoder by sparsing the encoder token, and the total computational cost is significantly decreased compared to deformable DETR. Inspired by the above-mentioned transformer models, we adopted a transformer network as the backbone network of our method.

2.3. Feature Fusion Network

Rich feature information is required for predicting the positions and classes of remote-sensing objects at different levels. In general, high-level feature maps contain rich semantic information but cannot provide an accurate prediction of object positions because of the lack of detailed information. In order to solve the problem, some feature fusion networks were proposed.

Feature pyramid network (FPN) [32] delivers the semantic information to the low-level feature maps by upsampling the high-level feature maps and fusing them with the low-level feature maps, which makes some rich semantic information can be contained in the low-level feature maps. However, FPN ignores that little detailed information is contained in the high-level feature maps. To overcome this shortcoming, PAN [24] adds a new path for fusing the different level feature maps based on FPN. The low-level feature maps increase the resolution of the feature maps by downsampling. Then, the high-level feature maps are fused with them, which makes the high-level feature maps contain rich, detailed information.

2.4. Non-Maximum Suppression Algorithm

The candidate bounding boxes obtained in the prediction phrase are usually processed by the NMS algorithm. As a greedy algorithm, the NMS algorithm selects the optimal bounding box according to the classification confidence and then uses it to suppress the other candidate bounding boxes. This process is repeated until all the candidate bounding boxes are suppressed. The classification confidence

S_{f}

in the candidate bounding boxes suppression process is calculated as follows:

S_{f} = \{\begin{matrix} S_{i}, I o U (M, b_{i}) < N_{t} \\ 0, I o U (M, b_{i}) \geq N_{t} \end{matrix}

(1)

where

M

represents the optimal bounding box,

b_{i}

represents a candidate bounding box, and

N_{t}

represents a manually set threshold. When the overlap between

M

and

b_{i}

is greater than

N_{t}

, the classification confidence of

b_{i}

is set to 0.

3. Overview of Our Method

The overall structure of our method is shown in Figure 1, which consists of three parts: backbone network, adaptive path aggregation network, and bounding boxes prediction network. The detailed information about the three parts is as following sections.

3.1. Backbone Network

Various object detection algorithms usually use CNN as the backbone network. However, it is difficult for CNN to capture long-range dependencies. In order to address this problem, DaViT is used as the backbone network, which is mainly constructed by dual attention blocks. It takes advantage of spatial window attention and channel group attention to build long-distance relationships. The network structure of DaViT is shown in Table 1, which is divided into 4 stages. One patch embedding layer and several dual attention blocks are included in each stage. The patch embedding layer is used to reduce the computational cost by compressing the feature information from the spatial dimension to the channel dimension. Table 1 shows the specific information of each patch embedding layer. The first patch embedding layer has a 7 × 7 convolution kernel with a step size of 4., and the other three patch embedding layers all have a 2 × 2 convolution kernel with a step size of 2. A dual attention block consists of a spatial window multi-head attention block, a channel group attention block, and two feed forward networks, whose structure is shown in Figure 2a. A residual structure is introduced in these blocks to avoid gradient disappearance during training.

The structure of the spatial window multi-head attention block is shown in Figure 2b, which takes the filling operation from spatial dimension and divides feature maps into multiple windows with the same size; each window is 7 × 7. After obtaining multiple windows, the attention operation is performed inside each window. There are

P

patches in the feature map, which is divided into

N

windows, and the number of patches in each window is

P_{w}

, so

P = N \times P_{w}

. In order to obtain feature information in different spaces, the spatial window attention block uses a multi-head structure with

N_{h}

multi-heads, each of which has

C_{h}

channels. The operations performed for each window in the spatial window multi-head attention block are as follows:

\begin{array}{l} A t t e n t i o n_{w i n d o w} (Q, K, V) = {\{A (Q_{i}, K_{i}, V_{i})\}}_{i = 0}^{N_{w}} \\ A (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{N_{h}}) \\ w h e r e h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}) \\ = Softmax [\frac{Q_{i} {(K_{i})}^{T}}{\sqrt{C_{h}}}] V_{i} \end{array}

(2)

where

Q_{i}, K_{i}, V_{i} \in ℝ^{P_{w} \times C_{h}}

represents queries, keys, and values in each window. Compared with global self-attention, the computational complexity of spatial window attention is smaller, but since only self-attention operations are performed on patches in the window, there is no interaction of global information. Channel grouping attention is a good remedy for this shortcoming. As shown in Figure 2c, the channel group attention divides the channels into

N_{g}

groups, and the per group has

C_{g}

channels, so the total number of channels is

C = N_{g} \times C_{g}

. The channel group attention is calculated as follows:

\begin{array}{l} A t t e n t i o n_{c h a n n e l} (Q, K, V) = {\{A t t e n t i o n_{g r o u p} {(Q_{i}, K_{i}, V_{i})}^{T}\}}_{i = 0}^{N_{g}} \\ A t t e n t i o n_{g r o u p} (Q_{i}, K_{i}, V_{i}) = Softmax [\frac{Q_{i} {(K_{i})}^{T}}{\sqrt{C_{h}}}] V_{i} \end{array}

(3)

where

Q_{i}, K_{i}, V_{i} \in ℝ^{P \times C_{g}}

represents queries, keys, and values obtained by projecting on patches. When performing the channel group attention operation,

Q_{i}, K_{i}, V_{i} \in ℝ^{P \times C_{g}}

needs to transpose the dimension. Then,

Q_{i}, K_{i}, V_{i} \in ℝ^{C_{g} \times P}

are used to complete the interaction of feature information between channels. In addition, only one head is adopted.

The complementary spatial window attention and channel group attention allow DaViT to capture the local representations and global context, maintaining high computational efficiency. Compared to CNN, which extends the perceptual field by stacking multiple convolutional layers to capture long-range dependencies, and ViT, which directly captures distant visual dependencies with a single self-attention layer resulting in increased computational complexity, we used dual attention blocks to model local and global information separately and significantly reduce computational complexity. Ultimately, DaViT fully extracts global representations and local features and provides more effective feature information for the feature fusion network.

3.2. Adaptive Path Aggregation Network

The feature fusion network is used to fuse feature maps at different levels, and how to fuse them directly influences the performance of object detection. PAN is proposed to fuse feature information at different levels to make feature maps at different levels contain rich semantic and detailed information at the same time. Figure 3a shows the structure of PAN, which fuse feature maps in different levels by two paths. In the top-down path, the high-level feature maps are upsampled and then fused with the low-level feature maps to achieve the purpose of conveying semantic information to the low-level feature maps. Unlike the top-down path, in the bottom-up path, we downsampled the low-level feature maps to reduce the resolution and fused them with the high-level feature maps, which makes the high-level feature contains rich, detailed information.

In order to provide more effective information for feature maps, we constructed an adaptive path aggregation network (APAN) based on PAN, whose network structure is shown in Figure 3b. In the top-down path of APAN, we use a convolutional block attention module (CBAM) to suppress background information in the feature map. As shown in Figure 4, CBAM consists of a channel attention module (CAM) and a spatial attention module (SAM). As shown in Figure 5, SAM is mainly made up of fully connected layers and pooling layers. CAM emphasizes important feature channels by weighting the feature map, and its calculation process is shown as follows:

\begin{matrix} F_{a v g}^{c} = A v g P o o l (F) \\ F_{m a x}^{C} = M a x P o o l (F) \\ M_{c} = S i g m o d (M L P (F_{a v g}^{c}) + M L P (F_{m a x}^{c})) \\ F^{'} = M_{c} \otimes F \end{matrix}

(4)

where

F

denotes the input feature map, and the average pooling operation and the maximum pooling operation are applied to the feature map

F

to obtain

F_{a v g}^{c}

and

F_{m a x}^{c}

, respectively. Then,

F_{a v g}^{c}

and

F_{m a x}^{c}

are input to the shared fully connected layers to obtain different feature maps and add them together. Finally, the sigmoid activate function is used to process the feature map to obtain the channel attention map

M_{c}

, and we multiply it with the feature map

F

to obtain the feature map

F^{'}

. Figure 6 shows that SAM mainly includes a pooling layer and a convolution layer. SAM provides more accurate location information for high-level feature maps by paying attention to effective information on spatial location, and its formula is as follows:

\begin{matrix} F_{a v g}^{s} = A v g P o o l (F^{'}) \\ F_{m a x}^{s} = M a x P o o l (F^{'}) \\ M_{s} = S i g m o d (C o n v ([F_{a v g}^{s}; F_{m a x}^{s}]) \\ F^{″} = M_{s} \otimes F^{'} \end{matrix}

(5)

Firstly, we performed the average pooling operation and the maximum pooling operation on the feature maps

F^{'}

to obtain the feature maps

F_{a v g}^{s}

and

F_{m a x}^{s}

. Then, we concatenated them and performed convolution operations to obtain the spatial attention map

M_{s}

. Finally, the feature map

F^{″}

was obtained by element-wise multiplying

M_{s}

and

F^{'}

, which is used for subsequent feature information fusion. By paying attention to the feature map channels and spatial, CBAM can effectively suppress the background information and provide more effective feature information for feature maps. In addition, a same-level feature map enhancement strategy is used in APAN. In Figure 3a, there are five paths used to fuse same-level feature maps in different layers separately. Feature reuse strategy can improve the performance of the network almost without increasing the number of parameters and computational complexity [33], so we introduced some additional feature fusion paths in APAN to both reduce the computational cost and enrich the feature information. Ultimately, APAN can make the feature maps contain more effective feature information. Ablation experiments are performed on APAN in Section 4.5 to confirm its efficiency.

3.3. Bounding Boxes Prediction

Each cell of the feature map at different scales predicts 4 coordinates of the bounding box, confidence, and classes. The cell where the center of the object falls is responsible for predicting the bounding box, and the prediction process of the bounding box is shown in Figure 7, where the blue box indicates the predicted box.

t_{x}, t_{y}, t_{w}, t_{h}

represents the 4 coordinates.

p_{w}, p_{h}

denote the width and height of the bounding box prior, respectively. The predictions correspond to the following:

\begin{array}{l} b_{x} = σ (t_{x}) + c_{x}, b_{y} = σ (t_{y}) + c_{y} \\ b_{w} = p_{w} e^{t_{w}}, b_{h} = p_{h} e^{t_{h}} \end{array}

(6)

where

σ

represents the sigmoid activation function.

σ (t_{x}), σ (t_{y})

represents the offset from the top left corner,

(c_{x}, c_{y})

represents the coordinates of the top left corner of the cell,

(b_{x}, b_{y})

represents the coordinates of the center point of the prediction box, and

b_{w}, b_{h}

represents the width and height of the prediction box.

During the bounding box prediction stage, a large number of bounding boxes are obtained, but they have a large amount of redundancy. The NMS algorithm is usually used to deal with these predicted bounding boxes, but classification confidence can not completely represent localization accuracy [34], which makes the detection boxes processed by the NMS algorithm not have good localization accuracy. In order to cope with this problem, the WBF algorithm is used. First, the WBF algorithm selects the optimal bounding box based on the classification confidence and finds its surrounding bounding boxes.

I o U (M, b_{i}) \geq P, i \in (1, 2, 3, \dots, n)

(7)

where

b_{i}

represents a candidate bounding box whose IoU with

M

is within the threshold

P

, and

M

represents the optimal bounding box. Then, the coordinates of these bounding boxes are weighted with their classification confidence to obtain the final coordinates of the optimal bounding.

\begin{array}{l} X a = \frac{\sum_{i = 1}^{Z} C_{i} \times X a_{i}}{\sum_{i = 1}^{Z} C_{i}}, Y a = \frac{\sum_{i = 1}^{Z} C_{i} \times Y a_{i}}{\sum_{i = 1}^{Z} C_{i}}, \\ X b = \frac{\sum_{i = 1}^{Z} C_{i} \times X b_{i}}{\sum_{i = 1}^{Z} C_{i}}, Y b = \frac{\sum_{i = 1}^{Z} C_{i} \times Y b_{i}}{\sum_{i = 1}^{Z} C_{i}}, \end{array}

(8)

In formula (8),

X a_{i}, Y a_{i}, X b_{i}, Y b_{i}

represents the horizontal and vertical coordinates of the top-left and bottom-right corners of the bounding box, respectively.

X a, Y a, X b, Y b

represents the horizontal and vertical coordinates of the top-left and bottom-right corners of the final optimal bounding box. The number of bounding boxes is represented by

Z

, and

C_{i}

represents classification confidence.

Figure 8 shows the optimal bounding boxes obtained by the NMS algorithm and the WBF algorithm, respectively. The predicted bounding box is represented by the green box, and the ground-truth bounding box is represented by the red box. The NMS algorithm selects the optimal bounding box from the candidate bounding boxes based on the classification confidence. However, the classification confidence does not fully represent the localization accuracy, which makes the optimal bounding box can not enclose the objects well. In order to solve the problem, the WBF algorithm is used to process the candidate bounding boxes, which makes full use of the surrounding coordinate position to make the optimal bounding box enclose objects well.

4. Experimental Results and Discussion

4.1. Datasets

4.1.1. RSOD Dataset

As a well-known public dataset in the remote sensing field, the RSOD dataset [35] has four classes: aircraft, oil tank, playground, and overpass, and there are 976 images and 6950 objects. The RSOD dataset is divided into a training set and a test set according to the ratio of 6:4, and there are 585 images in the training set and 391 images in the test set. In this dataset, the scale of aircraft and oil tanks is small, and the scale of playgrounds, as well as overpasses, is large. The scale differences between the classes make the detection task difficult.

4.1.2. NWPU VHR-10 Dataset

There are 10 classes on the NWPU VHR-10 dataset [35], which are airplane, storage tank, tennis court, ground track field, baseball diamond, basketball court, vehicle, harbor, bridge, and ship, including 150 images without objects and 650 images with objects. We divided the 650 images containing objects into different sets, and there are 520 images in the training set and 130 images in the test set, then the 150 images without any objects are added to the test set.

4.1.3. DIOR Dataset

DIOR [36] consists of a large number of 800 × 600 resolution images and has 20 classes and 23,463 images, of which 190,288 objects are labeled, and these objects contain difficult scale objects, which brings some difficulties to the object detection task. We divide the dataset based on the official recommendations, and there are 11,725 images in the training set and 11,738 images in the test set.

4.2. Implementation DETAILS

We used CPU AMD Ryzen 9 3900X and CPU GeForce RTX 2080Ti to train the models under Ubuntu 18.04 OS, which is implemented by Python and Pytorch. The image resolution is set to 608 × 608. We set 150 epochs; the pre-trained model is used during training, and the decay rate of weights is 5 × 10⁻⁴. The batch size is set to 8, and the learning rate is 1 × 10⁻³ during the first 75 epochs. The backbone network freezing strategy was adopted to retain the feature representation capability learned from the classification dataset. During the last 75 epochs, we set the batch size to 4 and the learning rate to 1 × 10⁻⁴.

4.3. Evaluation Metzrics

Four metrics are used to evaluate the model accuracy: precision, recall, F1-Score, and mean average precision (mAP). Precision is used to statistics the proportion of true positive samples that are correctly predicted among the true positive samples:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

The recall is used to perform a statistic on the proportion of true positive samples that are correctly predicted among all predicted true positive samples:

R e c a l l = \frac{T P}{T P + F N}

(10)

We used F1-score to measure the precision and recall comprehensively, and its formula is as follows:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

As shown in formula (12), AP is represented by the area enclosed by the recall and precision at an IoU threshold of 0.5.

A P = \int_{0}^{1} P (R) d R

(12)

As the mean AP of multiple classes, mAP is used to evaluate the model performance comprehensively as the following equation:

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(13)

4.4. Experimental Results and Analysis

4.4.1. Experiments on the RSOD dataset

In this section, we compare the detection performance of different models on the RSOD dataset, whose experimental results are shown in Table 2. The mAP of our method is 96.9%, which is better than other algorithms. Faster R-CNN usually has better detection performance on most object detection tasks, but it does not have good solutions for small-scale objects. As a result, Faster R-CNN has only 55.6% AP on the aircraft. RetinaNet, SSD, YOLOv3, YOLOv4, YOLOv5, Xu et al. [9], and Wang et al. [12] take advantage of multi-level feature fusion to detect multi-scale objects accurately. The mAP of Xu et al. [9], Wang et al. [12], YOLOv3, YOLOv4, YOLOv5, SSD, and RetinaNet is 93.1%, 93.8%, 91.7%, 92.5%, 94.5%, 90.8%, 89.8%, respectively, while the mAP of Faster R-CNN is 83.0%. Compared with Faster R-CNN, the above algorithms have better object detection performance. CNN is good at capturing local information, but its poor ability to model global information makes it unable to accurately detect objects at different scales. As the backbone network of our method, DaViT uses spatial window attention to refine the local representations and then uses channel group attention to capture long-range dependencies, which makes this backbone network fully extract local and global information. Additionally, we propose a novel feature fusion network to provide more effective feature information for different levels of feature maps. To further improve detection accuracy, a new non-maximum suppression algorithm is used in our method, which makes full use of the coordinate locations of the surrounding bounding boxes to adjust the locations of the optimal bounding box. Finally, our method achieves good detection accuracy, and the AP on airplane, oil tank, playground, and overpass are 94.0%, 99.0%, 94.6%, and 100%, respectively, which are better than other algorithms. Figure 9 indicates the detection results of our method under the scenes containing aircraft, oil tank, overpass, and playground, respectively, and all the objects in Figure 9 can be detected.

4.4.2. Experiments on the NWPU VHR-10 Dataset

In this section, we compared the detection performance of different models on the NWPU VHR-10 dataset, and the experimental results are shown in Table 3. The mAP of our method is 96.6%. Compared with other algorithms, our method improves the mAP by 1.1–15.7%. This is because our backbone network integrates the benefits of spatial window attention and channel group attention in extracting feature information. Their complementary advantages enable DaViT to fully extract the feature information, improving the detection accuracy at different scales of objects. Additionally, we propose a novel feature fusion network to reduce the interference of complex backgrounds and further enhance detection accuracy. The feature fusion network uses CBAM to highlight the objects’ feature information and introduce additional paths to fuse feature maps at the same scale to increase the feature information in the feature maps. In addition, the WBF algorithm is used in post-processing to process the predicted bounding boxes. Consequently, our method has better detection accuracy, especially at large-scale objects, such as tennis court and harbor. The detection results of our method on the NWPU VHR-10 dataset are shown in Figure 10, and all multi-scale objects in the figure can be detected.

4.4.3. Experiments on the DIOR Dataset

In this section, we compare the detection performance of different models on the DIOR dataset. Table 4 shows the experimental results of different methods on the DIOR dataset. Compared with other algorithms, the mAP of our method is improved by 1–23.1%. Especially in large-scale objects, such as baseball field and overpass, our method has the best detection accuracy. CNN usually establishes the global perceptual field by stacking multiple convolutional layers when extracting image features, but there is a loss of feature in the establishment of the global perceptual field, which leads to the situation that CNN cannot adequately model the global information. Inspired by various transformer models that use self-attention to capture long-range dependencies, DaViT is constructed by combining spatial window attention and channel group attention. As the backbone network of our method, DaViT fully extracts local and global information. Consequently, our method has good detection accuracy on the DIOR dataset. In addition, our method further improves the performance of the remote sensing datasets by enhancing the feature representation capability of multi-level feature maps and improving the localization accuracy of bounding boxes. As shown in Figure 11, our method detected multi-scale objects on the DIOR dataset accurately.

4.5. Ablation Experiments

In order to evaluate the effectiveness of DaViT, APAN, and WBF on the performance of our method, ablation experiments are conducted. The experimental ablation results on the RSOD dataset, NWPU VHR-10 dataset, and DIOR dataset are shown in Table 5, Table 6 and Table 7, respectively. For the result of No.5, using ViT as the backbone network results in performance degradation on the dataset. Compared to the results of No.6, the precision, recall, F1-score, and mAP of No.3 decreased by 0.2%, 0.3%, 0.2%, and 0.6%, respectively. On the contrary, after using DaViT as the backbone network, the precision, recall, F1-score, and mAP of No.2 improved by 1.1%, 1.5%, 1.3%, and 1.3% over No.1, respectively. This demonstrates that DaViT, as a novel backbone network, takes advantage of spatial window attention and channel group attention, allowing it to fully extract feature information and thus improve detection performance. In order to provide more effective information for different levels of feature maps, a novel feature fusion network called APAN is proposed. APAN is constructed based on PAN. In the fusion path for different-level feature maps, we used CBAM to highlight object feature information and then fused it with other feature maps. In addition, we used a new feature reuse strategy that fuses feature maps of the same level at different layers. This strategy can improve feature representation with almost no increase in computational cost. As shown in Table 5, No.4 achieved better results compared to No.2, with an improvement of 0.6%, 0.5%, 0.5%, and 1.5% in precision, recall, F1-score, and mAP, respectively, indicating that the result of using APAN is better than PAN. No. 6 improved by 0.3%, 0.9%, 0.6%, and 0.7% in precision, recall, F1-score, and mAP compared to No. 4, respectively. The result shows that the optimal bounding box obtained by the WBF algorithm can enclose the object well, and the precision and recall are improved as a result. Table 6 and Table 7 also show that our method has fairly good detection performance on the NWPU VHR-10 dataset and DIOR dataset.

Overall, the use of both DaViT, APAN, and WBF can effectively improve the object detection performance in remote sensing images. As shown in Table 5, Table 6 and Table 7, the detection performance of No.6 outperforms the other settings in most of the metrics. These results show that our method can achieve better detection performance in remote sensing image object detection.

5. Conclusions

In order to improve the remote sensing object detection accuracy, a novel transformer-based adaptive object detection method is proposed in this paper. Firstly, to overcome the shortcoming that CNN is difficult to establish long-range dependencies on feature information, our method uses DaViT as the backbone network, which uses spatial window attention and channel group attention to capture both local and global features. Then, we propose an adaptive feature fusion network based on PAN. APAN uses CBAM to suppress the background information and highlight the objects’ feature information. Furthermore, the network adds new fusion paths to fuse the same-level feature maps, which enhances feature representation. Finally, with the help of the WBF algorithm, our method further improves detection accuracy. Ablation experiments are conducted to verify the effectiveness of DaViT, APAN, and WBF algorithms. The experimental results show that combining them provides the best detection performance. Encouraging experimental results on the ROSD dataset, NWPU VHR-10 dataset, and DIOR dataset demonstrate that our method has better detection accuracy than some popular object detection algorithms. In the following research, we will further optimize some of the parameters for the method to achieve a lightweight effect.

Author Contributions

Methodology, S.S., R.C. and T.Z.; conceptualization, R.C.; software, R.C. and T.Z.; validation, S.S. and X.F.; writing—original draft preparation, S.S., T.Z. and R.C.; writing—review and editing, T.Z., X.F. and R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Research Project of Colleges and Universities in Anhui Province (Grant No. 2022AH040113), the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-006), the National Natural Science Foundation of China (No. 61806006), and the China Postdoctoral Science Foundation (No. 2019M660149).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zheng, J.; Li, W.; Xia, M.; Dong, R.; Fu, H.; Yuan, S. Large-scale oil palm tree detection from high-resolution remote sensing images using faster-rcnn. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1422–1425. [Google Scholar]
Ji, H.; Gao, Z.; Mei, T.; Li, Y. Improved faster R-CNN with multiscale feature fusion and homography augmentation for vehicle detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1761–1765. [Google Scholar] [CrossRef]
Yan, D.; Li, G.; Li, X.; Zhang, H.; Lei, H.; Lu, K.; Cheng, M.; Zhu, F. An improved faster R-CNN method to detect tailings ponds from high-resolution remote sensing images. Remote Sens. 2021, 13, 2052. [Google Scholar] [CrossRef]
Dong, R.; Xu, D.; Zhao, J.; Jiao, L.; An, J. Sig-NMS-based faster R-CNN combining transfer learning for small target detection in VHR optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8534–8545. [Google Scholar] [CrossRef]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
Zhang, Y.; Song, C.; Zhang, D. Small-scale aircraft detection in remote sensing images based on Faster-RCNN. Multimed. Tools Appl. 2022, 81, 18091–18103. [Google Scholar] [CrossRef]
Xu, D.; Wu, Y. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors 2020, 20, 4276. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Tong, W.; Chen, W.; Han, W.; Li, X.; Wang, L. Channel-attention-based DenseNet network for remote sensing image scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4121–4132. [Google Scholar] [CrossRef]
Wang, C.; Wang, Q.; Wu, H.; Zhao, C.; Teng, G.; Li, J. Low-altitude remote sensing opium poppy image detection based on modified yolov3. Remote Sens. 2021, 13, 2130. [Google Scholar] [CrossRef]
Hong, Z.; Yang, T.; Tong, X.; Zhang, Y.; Jiang, S.; Zhou, R.; Han, Y.; Wang, J.; Yang, S.; Liu, S. Multi-scale ship detection from SAR and optical imagery via a more accurate YOLOv3. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6083–6101. [Google Scholar] [CrossRef]
Cengil, E.; Çinar, A.; Yildirim, M. A Case Study: Cat-Dog Face Detector Based on YOLOv5. In Proceedings of the 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Virtual, 29–30 September 2021; pp. 149–153. [Google Scholar]
Zakria, Z.; Deng, J.; Kumar, R.; Khokhar, M.S.; Cai, J.; Kumar, J. Multiscale and direction target detecting in remote sensing images via modified YOLO-v4. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1039–1048. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Cengil, E.; Çinar, A.; Yildirim, M. An efficient and fast lightweight-model with ShuffleNetv2 based on YOLOv5 for detection of hardhat-wearing. Rev. Comput. Eng. Stud. 2022, 9, 116–123. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. DaViT: Dual Attention Vision Transformers. arXiv 2022, arXiv:2204.03645. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 7, 1–15. [Google Scholar] [CrossRef]
Ji, Y.; Zhang, R.; Wang, H.; Li, Z.; Wu, L.; Zhang, S.; Luo, P. Multi-compound transformer for accurate biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 326–336. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–9. [Google Scholar]
Solovyev, R.; Wang, W.; Gabruseva, T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vision Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 11–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Chen, Z.; Zhang, J.; Tao, D. Recurrent glimpse-based decoder for detection with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5260–5269. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–16 July 2017; pp. 2117–2125. [Google Scholar]
Su, S.; Chen, R.; Fang, X.; Zhu, Y.; Zhang, T.; Xu, Z. A Novel Lightweight Grape Detection Method. Agriculture 2022, 12, 1364. [Google Scholar] [CrossRef]
Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
Körez, A.; Barışçı, N.; Çetin, A.; Ergün, U. Weighted ensemble object detection with optimized coefficients for remote sensing images. ISPRS Int. J. Geo-Inf. 2020, 9, 370. [Google Scholar] [CrossRef]
Li, Y.; Huang, Q.; Pei, X.; Chen, Y.; Jiao, L.; Shang, R. Cross-layer attention network for small object detection in remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 2148–2161. [Google Scholar] [CrossRef]

Figure 1. The whole structure of our method.

Figure 2. The network structure of the dual attention block is mainly made up of the channel group attention block and spatial window multi-head attention block: (a) dual attention block; (b) spatial window multi-head attention block; (c) channel group attention block.

P

represents the number of total patches,

C

devotes the number of total channels,

N_{w}

is the number of windows,

N_{h}

represents the number of heads,

N_{g}

is the number of channel groups,

P_{w}

devotes patches per window,

C_{h}

is channels per head, and

C_{g}

represents channels per group.

Figure 2. The network structure of the dual attention block is mainly made up of the channel group attention block and spatial window multi-head attention block: (a) dual attention block; (b) spatial window multi-head attention block; (c) channel group attention block.

P

represents the number of total patches,

C

devotes the number of total channels,

N_{w}

is the number of windows,

N_{h}

represents the number of heads,

N_{g}

is the number of channel groups,

P_{w}

devotes patches per window,

C_{h}

is channels per head, and

C_{g}

represents channels per group.

Figure 3. The structure of PAN and APAN: (a) PAN; (b) APAN.

Figure 4. The structure of CBAM.

Figure 5. The network structure of CAM.

Figure 6. The network structure of SAM.

Figure 7. The prediction process for bounding boxes.

Figure 8. Detection results were obtained by using different non-maximum suppression algorithms.

Figure 9. The detection results of our method on the RSOD dataset.

Figure 10. The detection results of our method on the NWPU VHR-10 dataset.

Figure 11. The detection results of our method on the DIOR dataset.

Table 1. Details about backbone network.

Stage	Output	Layer Name	Operation
stage1	152 × 152 × 96	Patch Embedding	$kernel 7, stride 4, pad 3, C^{1} = 96$
stage1	152 × 152 × 96	Dual Attention Block	$[\begin{matrix} winsz = 7 \times 7, P_{w} = 49 \\ N_{h}^{1} = N_{g}^{1} = 3 \\ C_{h}^{1} = C_{g}^{1} = 32 \end{matrix}] \times 1$
stage2	76 × 76 × 192	Patch Embedding	$kernel 2, stride 2, pad 0, C^{2} = 192$
stage2	76 × 76 × 192	Dual Attention Block	$[\begin{matrix} winsz = 7 \times 7, P_{w} = 49 \\ N_{h}^{2} = N_{g}^{2} = 6 \\ C_{h}^{2} = C_{g}^{2} = 32 \end{matrix}] \times 1$
stage3	38 × 38 × 384	Patch Embedding	$kernel 2, stride 2, pad 0, C^{3} = 384$
stage3	38 × 38 × 384	Dual Attention Block	$[\begin{matrix} winsz = 7 \times 7, P_{w} = 49 \\ N_{h}^{3} = N_{g}^{3} = 12 \\ C_{h}^{3} = C_{g}^{3} = 32 \end{matrix}] \times 9$
stage4	19 × 19 × 768	Patch Embedding	$kernel 2, stride 2, pad 0, C^{4} = 768$
stage4	19 × 19 × 768	Dual Attention Block	$[\begin{matrix} winsz = 7 \times 7, P_{w} = 49 \\ N_{h}^{4} = N_{g}^{4} = 24 \\ C_{h}^{4} = C_{g}^{4} = 32 \end{matrix}] \times 1$

Table 2. The experimental results on the ROSD dataset.

Classes	Our Method	YOLOv3	YOLOv4	YOLOv5	Faster R-CNN	SSD	RetinaNet	Xu et al. [9]	Wang et al. [12]
aircraft	94.0%	91.8%	89.5%	93.4%	55.6%	85.7%	73.6%	91.3%	92.6%
oil tank	99.0%	98.1%	98.3%	99.0%	94.9%	98.2%	97.9%	98.0%	98.7%
playground	94.6%	76.7%	82.3%	85.8%	82.3%	83.8%	88.1%	83.9%	84.1%
overpass	100%	100%	100%	99.6%	99.2%	95.6%	99.7%	99.3%	99.9%
mAP	96.9%	91.7%	92.5%	94.5%	83.0%	90.8%	89.8%	93.1%	93.8%

Table 3. The experimental results on the NWPU VHR-10 dataset.

Classes	Our Method	YOLOv3	YOLOv4	YOLOv5	Faster R-CNN	SSD	RetinaNet	Xu et al. [9]	Wang et al. [12]
airplane	100%	100%	100%	100%	94.6%	99.9%	98.6%	99.8%	100%
storage tank	96.2%	96.5%	99.2%	97.9%	66.6%	99.0%	89.7%	99.5%	97.3%
tennis court	99.1%	99.4%	99.4%	99.6%	87.7%	90.9%	87.9%	99.3%	99.5%
ground track field	100%	99.9%	99.8%	99.2%	98.6%	97.8%	97.2%	99.8%	99.4%
baseball diamond	97.2%	98.1%	98.1%	97.9%	95.6%	95.8%	99.1%	98.2%	97.5%
basketball court	100%	98.8%	99.5%	98.5%	93.9%	96.1%	86.0%	99.6%	98.0%
vehicle	93.7%	92.6%	92.9%	96.1%	48.9%	91.0%	67.4%	93.0%	95.9%
harbor	100%	95.3%	97.0%	100%	87.5%	93.9%	98.3%	96.9%	99.7%
bridge	95.8%	81.0%	79.5%	81.5%	71.9%	80.7%	86.7%	79.9%	81.1%
ship	83.5%	83.5%	84.7%	84.7%	64.0%	82.0%	75.2%	84.5%	85.0%
mAP	96.6%	94.5%	95.0%	95.5%	80.9%	92.7%	88.6%	95.1%	95.3%

Table 4. The experimental results on the DIOR dataset.

Classes	Our Method	YOLOv3	YOLOv4	YOLOv5	Faster R-CNN	SSD	RetinaNet	Xu et al. [9]	Wang et al. [12]
airplane	90.7%	82.5%	88.6%	90.6%	54.1%	59.5%	53.7%	86.5%	87.6%
airport	87.5%	83.0%	85.1%	85.9%	71.4%%	72.7%	77.3%	84.0%	84.9%
baseball field	91.5%	86.0%	89.0%	89.7%	63.3%	72.4%	69.0%	87.2%	88.8%
basketball court	93.8%	91.5%	92.5%	92.8%	81.0%	75.7%	81.3%	91.8%	92.6%
bridge	59.3%	52.2%	56.4%	57.0%	42.6%	29.7%	44.1%	54.3%	56.0%
chimney	85.0%	79.4%	84.4%	84.0%	72.5%	65.8%	72.3%	83.4%	83.5%
dam	70.6%	73.0%	71.9%	70.6%	57.5%	56.6%	62.5%	72.0%	72.9%
expressway service area	94.4%	90.3%	91.8%	92.2%	68.7%	63.5%	76.2%	91.6%	90.8%
expressway toll station	87.9%	82.2%	86.3%	86.6%	62.1%	53.1%	66.0%	84.5%	85.3%
golf field	81.7%	82.4%	77.4%	80.3%	73.1%	65.3%	77.7%	80.4%	78.4%
ground track field	88.0%	82.4%	87.3%	87.8%	76.5%	68.6%	74.2%	85.1%	86.3%
harbor	67.0%	62.6%	64.3%	65.1%	42.8%	49.4%	50.7%	63.8%	64.2%
overpass	69.7%	65.2%	68.1%	67.9%	56.0%	48.1%	59.6%	67.4%	67.1%
ship	91.2%	89.7%	90.9%	91.1%	71.8%	59.2%	71.2%	90.7%	90.0%
stadium	79.5%	69.9%	71.8%	79.2%	57.0%	61.0%	69.3%	70.1%	72.8%
storage tank	79.1%	71.9%	78.6%	78.7%	53.5%	46.6%	44.8%	76.7%	75.6%
tennis court	94.4%	91.9%	94.0%	94.1%	81.2%	76.3%	81.3%	93.4%	93.2%
train station	69.4%	66.1%	67.2%	68.2%	53.0%	55.1%	54.2%	66.4%	67.5%
vehicle	59.2%	53.1%	58.6%	59.3%	43.1%	27.4%	45.1%	55.2%	58.4%
windmill	93.2%	90.8%	92.1%	92.3%	80.9%	65.7%	83.4%	91.5%	91.4%
mAP	81.7%	77.3%	79.8%	80.7%	63.1%	58.6%	65.7%	78.80%	79.4%

Table 5. Ablation experimental results on the RSOD dataset.

No.	Models	Precision	Recall	F1	mAP
1	ViT + PAN + NMS	93.9%	92.3%	93.1%	93.4%
2	DaViT + PAN + NMS	95.0%	93.8%	94.4%	94.7%
3	ViT + APAN + NMS	95.3%	93.7	94.5%	95.9%
4	DaViT + APAN + NMS	95.6%	94.3%	94.9%	96.2%
5	ViT + APAN + WBF	95.7%	94.9%	95.3%	96.3%
6	DaViT + APAN + WBF	95.9%	95.2%	95.5%	96.9%

Table 6. Ablation experimental results on the NWPU VHR-10 dataset.

No.	Models	Precision	Recall	F1	mAP
1	ViT + PAN + NMS	91.2%	92.5%	91.8%	95.4%
2	DaViT + PAN + NMS	92.0%	92.4%	92.2%	95.6%
3	ViT + APAN + NMS	92.3%	92.9%	92.6%	95.9%
4	DaViT + APAN + NMS	92.5%	93.0%	92.7%	96.0%
5	ViT + APAN + WBF	93.0%	92.8%	92.9%	96.2%
6	DaViT + APAN + WBF	93.2%	93.5%	93.3%	96.6%

Table 7. Ablation experimental results on the DIOR dataset.

No.	Models	Precision	Recall	F1	mAP
1	ViT + PAN + NMS	88.9%	70.6%	78.7%	80.3%
2	DaViT + PAN + NMS	89.3%	72.3%	79.9%	80.9%
3	ViT + APAN + NMS	89.3%	72.8%	80.2%	81.0%
4	DaViT + APAN + NMS	89.5%	72.9%	80.1%	81.4%
5	ViT + APAN + WBF	89.4%	72.9%	80.3%	81.6%
6	DaViT + APAN + WBF	90.0%	73.0%	80.6%	81.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, S.; Chen, R.; Fang, X.; Zhang, T. A Novel Transformer-Based Adaptive Object Detection Method. Electronics 2023, 12, 478. https://doi.org/10.3390/electronics12030478

AMA Style

Su S, Chen R, Fang X, Zhang T. A Novel Transformer-Based Adaptive Object Detection Method. Electronics. 2023; 12(3):478. https://doi.org/10.3390/electronics12030478

Chicago/Turabian Style

Su, Shuzhi, Runbin Chen, Xianjin Fang, and Tian Zhang. 2023. "A Novel Transformer-Based Adaptive Object Detection Method" Electronics 12, no. 3: 478. https://doi.org/10.3390/electronics12030478

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Transformer-Based Adaptive Object Detection Method

Abstract

1. Introduction

2. Related Work

2.1. Convolutional Neural Network

2.2. Transformer

2.3. Feature Fusion Network

2.4. Non-Maximum Suppression Algorithm

3. Overview of Our Method

3.1. Backbone Network

3.2. Adaptive Path Aggregation Network

3.3. Bounding Boxes Prediction

4. Experimental Results and Discussion

4.1. Datasets

4.1.1. RSOD Dataset

4.1.2. NWPU VHR-10 Dataset

4.1.3. DIOR Dataset

4.2. Implementation DETAILS

4.3. Evaluation Metzrics

4.4. Experimental Results and Analysis

4.4.1. Experiments on the RSOD dataset

4.4.2. Experiments on the NWPU VHR-10 Dataset

4.4.3. Experiments on the DIOR Dataset

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI