Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer

Cheng, Aonan; Xiao, Jincheng; Li, Yingcheng; Sun, Yiming; Ren, Yafeng; Liu, Jianli

doi:10.3390/rs16162885

Open AccessArticle

Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer

by

Aonan Cheng

^1,2,

Jincheng Xiao

^1,2,

Yingcheng Li

^1,2,

Yiming Sun

^1,2,

Yafeng Ren

^1,2 and

Jianli Liu

^1,2,*

¹

National Engineering Research Center of Surveying and Mapping, China TopRS Technology Company Limited, Beijing 100039, China

²

Beijing Low-Altitude Remote Sensing Engineering Technology Research Center, Beijing 100039, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2885; https://doi.org/10.3390/rs16162885

Submission received: 5 July 2024 / Revised: 31 July 2024 / Accepted: 6 August 2024 / Published: 7 August 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Object detection via remote sensing encounters significant challenges due to factors such as small target sizes, uneven target distribution, and complex backgrounds. This paper introduces the K-CBST YOLO algorithm, which is designed to address these challenges. It features a novel architecture that integrates the Convolutional Block Attention Module (CBAM) and Swin-Transformer to enhance global semantic understanding of feature maps and maximize the utilization of contextual information. Such integration significantly improves the accuracy with which small targets are detected against complex backgrounds. Additionally, we propose an improved detection network that combines the improved K-Means algorithm with a smooth Non-Maximum Suppression (NMS) algorithm. This network employs an adaptive dynamic K-Means clustering algorithm to pinpoint target areas of concentration in remote sensing images that feature varied distributions and uses a smooth NMS algorithm to suppress the confidence of overlapping candidate boxes, thereby minimizing their interference with subsequent detection results. The enhanced algorithm substantially bolsters the model’s robustness in handling multi-scale target distributions, preserves more potentially valid information, and diminishes the likelihood of missed detections. This study involved experiments performed on the publicly available DIOR remote sensing image dataset and the DOTA aerial image dataset. Our experimental results demonstrate that, compared with other advanced detection algorithms, K-CBST YOLO outperforms all its counterparts in handling both datasets. It achieved a 68.3% mean Average Precision (mAP) on the DIOR dataset and a 78.4% mAP on the DOTA dataset.

Keywords:

object detection; Swin-Transformer; CBAM; K-Means; NMS

1. Introduction

The fundamental purpose of object detection is to determine the location of and classify objects of interest in images and videos. The representation of the target object in an image directly influences its detection. In natural image datasets, targets are usually concentrated in the center of the image; they are evenly distributed, of moderate size, and have relatively simple backgrounds. However, in remote sensing image datasets, due to their long shooting distances, target objects appear smaller, backgrounds are more complex, and the distribution of targets within the images varies, sometimes being dense and at other times sparse. In Figure 1, the first two rows depict natural images, while the last two rows feature remote sensing and aerial images.

As shown in Figure 1, there are significant visual differences between natural images and remote sensing or aerial images. The first and second rows in Figure 1 feature samples randomly drawn from the COCO and VOC datasets, in which all targets are centrally located, relatively large, and feature prominent main objects. To date, detection algorithms intended for natural images [1,2,3,4,5,6], including both one-stage and two-stage methods, have matured significantly, achieving excellent results using public benchmark datasets such as COCO. However, the third and fourth rows in Figure 1 display samples randomly drawn from the DIOR remote sensing dataset and the DOTA aerial dataset, respectively. These images feature complex backgrounds, undefined main subjects, and irregular distribution. Compared with natural images, the presentation of objects in remote sensing or aerial images demonstrates greater diversity and complexity. Consequently, detection algorithms suited to natural scenes are not appropriate for the detection tasks inherent in remote sensing or aerial imagery. This study provides an in-depth analysis of detection algorithms used to identify unevenly distributed small targets in remote sensing or aerial contexts.

Most existing object detection algorithms are based on the principles of convolutional neural networks. These algorithms are categorized into two types based on their working principles: one-stage and two-stage detection. One-stage detection algorithms simultaneously perform classification and localization, generating predicted candidate boxes for each potential object’s position, all the while calculating classification probabilities and bounding box regression losses. These algorithms sacrifice detection accuracy to increase speed, making them suitable for real-time application scenarios. Representative algorithms include SSD [3], YOLO [4], and RetinaNet [5]. YOLO is an end-to-end algorithm that predicts multiple bounding boxes and class probabilities in a single forward pass, thus significantly speeding up the process. The SSD algorithm uses multi-scale feature maps to detect objects of varying sizes by generating candidate boxes of various sizes and aspect ratios to better match target objects. The YOLO algorithm has undergone multiple updates and iterations, eventually evolving into YOLOv10. The authors of RetinaNet attribute the lower detection accuracy of one-stage algorithms to the imbalanced distribution of samples. Hence, RetinaNet introduces the focal loss scheme, increasing the weight of positive and hard-to-classify samples within the loss function in order to balance the distribution of samples, thereby enhancing the detection accuracy. Two-stage detection algorithms segment the detection process into two distinct stages. The first stage involves the generation of candidate regions from the input feature map, predicting potential object-containing locations. In the second stage, further adjustments and classifications are made to the bounding boxes of each candidate region. These algorithms have higher detection accuracy but are relatively slower, making them more suitable for scenarios requiring higher detection accuracy. Representative algorithms include R-CNN [1], Fast R-CNN [2], and Faster R-CNN [6]. R-CNN initially identifies potential regions of interest from the image, extracts features from each via a convolutional neural network, classifies them using a support vector machine (SVM), and adjusts their positions using bounding box regression to achieve object detection. Fast R-CNN and Faster R-CNN are improvements upon R-CNN and are designed to enhance the speed and accuracy of object detection.

The detection algorithms mentioned above leverage the translational invariance and local sensitivity of convolutional kernels to hierarchically extract abstract features from images. However, convolution operations cannot capture the contextual relationships between feature maps, thus hindering the construction of global semantic relationships between their features. Therefore, detection algorithms that rely solely on CNNs are ill-suited to image detection tasks involving complex backgrounds.

Initially, Transformers were primarily used in the field of natural language processing, in which they demonstrated powerful modeling capabilities and advantages in parallel computation. The proposal of DETR in 2020 brought Transformers into the realm of object detection. In object detection tasks, Transformers divide the input image into serialized patches and then model them using encoders. The core principle of the Transformer is to allow sufficient interaction between each patch, thereby establishing global dependencies within the feature map. Currently, mainstream algorithms based on Transformers include DETR [7], Swin-Transformer [8], and Twins [9]. Transformer modules are highly independent and flexible, making them transferable to different detection algorithms. Although Transformer-based detection algorithms can construct global semantic relationships and enhance a model’s overall understanding of images, thereby improving the accuracy with which small targets are detected against complex backgrounds, they have not effectively resolved the issues encountered when engaging in densely distributed or visually overlapping multi-object detection.

In response to the challenges of remote sensing image detection, this study adopts the network structure of YOLOv5 as the benchmark framework and proposes the K-CBST YOLO algorithm to enhance the detection accuracy of remote sensing targets. The main contributions of this study are outlined as follows:

(1): This study proposes a module that integrates the CBAM mechanism with the Swin-Transformer to serve as a backbone for feature extraction. In this module, the Swin-Transformer performs global modeling of the feature map to obtain comprehensive background information, and the CBAM further extracts the features of key areas, thereby enhancing the model’s focus on critical regions. The combined use of both mechanisms effectively enhances the accuracy with which small targets are detected against complex backgrounds.
(2): This study proposes a detection network based on the K-Means algorithm. During the detection stage, an adaptive method for adjusting the K value is designed to enhance the model’s generalization ability, thus enabling effective handling of images with uneven distributions. The enhanced K-Means algorithm identifies target distribution areas within images, thereby enhancing the accuracy with which multi-target remote sensing images with uneven distributions are detected.
(3): This study proposes a smooth Non-Maximum Suppression algorithm. The core principle of this algorithm is to smoothly suppress the confidence scores of overlapping candidate boxes, thereby retaining more potentially useful information. This approach enhances the detection accuracy and recall rate of dense or overlapping targets, thus reducing the risk of missed detections due to excessive suppression.

2. Related Works

In recent years, trends in object detection have shifted from detecting large objects to focusing on small ones. Object detection via remote sensing has been widely applied in various fields, including military reconnaissance, crop monitoring, intelligent transportation, and urban planning. This section will discuss the advances made from natural image object detection algorithms to remote sensing object detection algorithms.

In natural image detection tasks, YOLO and R-CNN exemplify typical one-stage and two-stage detection algorithms, respectively. The introduction of YOLOv1 enabled classification and regression predictions in a single step. The YOLO model structure typically consists of three parts: the backbone, the neck, and the head. The backbone, responsible for extracting features from the input image, directly impacts the quality of feature extraction and overall model performance; commonly used backbone networks include Darknet and RestNet. The neck further optimizes the output feature maps from the backbone, enhancing the expressive power of features through the construction of a Feature Pyramid Network (FPN) or a Path Aggregation Network (PAN). The head usually contains multiple convolutional layers for generating predictions at different scales in order to accommodate targets of varying sizes. Unlike YOLO, the R-CNN detection algorithm introduced the principle of Region Proposals. R-CNN first uses the Selective Search method, which combines factors such as color, texture, size, and shape to generate around 2000 candidate regions that may contain targets. Subsequently, features are extracted from each preprocessed candidate region and finally classified using a support vector machine (SVM). The candidate region’s bounding boxes are adjusted using linear regression to optimize the detection frame’s alignment with the actual target location, thereby enhancing detection accuracy. However, introduced in 2020, the DETR detection algorithm was the first to fully apply the Transformer architecture to object detection tasks, achieving end-to-end detection. DETR adopts the approach of simultaneously performing classification and regression tasks. The most significant difference between DETR and the previously mentioned detection algorithms is that it uses the encoder–decoder structure of the Transformer, which helps focus on key parts of the image. Additionally, the Transformer’s self-attention mechanism can capture long-range dependencies, better utilizing global information to obtain contextual relationships within the image.

However, in comparison to natural images, these algorithms do not perform ideally in detecting objects via remote sensing. Starting with YOLOv4 [10], CSPDarknet53 was proposed as the backbone, introducing the Cross-Stage Partial (CSP) structure to enhance feature propagation and fusion. CSP divides the feature map into two parts: one for feature extraction and the other to be directly input into subsequent stages without processing, thus preserving the higher-resolution features that are crucial for small object detection. Additionally, YOLOv4 introduced a PAN for feature fusion to enhance the semantic strength of lower-level features, thereby improving the detection accuracy of remote sensing targets. Faster R-CNN represents an improvement upon R-CNN and replaces the Selective Search algorithm with a Region Proposal Network (RPN), significantly speeding up the process of generating candidate regions. In Faster R-CNN, the RPN uses anchor boxes of different scales and ratios, enhancing the model’s robustness to multi-scale targets. Moreover, the RPN also shares the same feature map with the detection network, allowing the model to utilize these high-quality features in order to improve small object detection accuracy. In remote sensing or aerial images, targets often appear at different angles, and the detection boxes generated by Faster R-CNN are all horizontal; thus, they may fail to precisely cover each target’s location. Therefore, the ROI Transformer [11] detection algorithm introduces the concept of rotated boxes to the RPN, generating candidate boxes with rotation angles. This algorithm can more accurately describe targets in a rotated or inclined state. Compared with Faster R-CNN, this method is more suitable for tasks involving the detection of objects via remote sensing. The SCRDet [12] algorithm proposes the SF-Net (Finer Sampling and Feature Fusion Network), which fuses multiple layer features and selects effective candidate boxes as samples that are then used to enhance the model’s sensitivity to small targets. Additionally, SCRDet introduces the MDA-Net (Multi-Dimensional Attention Network) to suppress noise in the RPN, highlighting the features of target objects in order to improve small object detection accuracy. Moreover, this algorithm introduces an IoU constant factor into the smooth L1 loss function to optimize the position of the detection box, ensuring a closer fit to the actual target’s bounding box. Compared to YOLOv4 and Faster R-CNN, the improvements brought about by the ROI Transformer and SCRDet algorithms better meet the requirements of remote sensing targets’ detection. Recent studies, including the Multistage Enhancement Network [13] and advanced learning strategies [14,15], have proposed various approaches to enhance the detection accuracy of small objects in remote sensing images. These methods have been specifically optimized for the challenges associated with detecting remote sensing targets, aiming to enhance the precision of object detection in complex remote sensing environments.

Due to the complex characteristics of remote sensing images—including multiple scales, multiple angles, uneven distribution, and complex backgrounds—detection tasks involve significant challenges. Although existing object detection algorithms have made significant progress in remote sensing or aerial image detection tasks, many difficulties remain unresolved. Considering the above issues, this paper proposes a detection algorithm suitable for detecting densely packed small targets in complex backgrounds.

3. Materials and Methods

3.1. Overall Framework

In response to issues encountered in remote sensing imagery, this study analyzes the performance of existing object detection algorithms based on a CNN and Transformer. Building upon this analysis and using YOLOv5 as a foundational framework, this study proposes the K-CBST YOLO remote sensing object detection algorithm. K-CBST YOLO consists of three components: the backbone network, neck network, and detection network. Addressing the challenge of complex backgrounds in remote sensing imagery, this paper introduces the CBST module, which integrates Swin-Transformer and CBAM mechanisms as core feature extraction methods. The Swin-Transformer can extract rich background information and learn contextual relationships within feature maps. The CBAM extracts detailed features of effective regions from global information, thereby reducing the impact of redundant information on detection results. The CBST module significantly enhances the accuracy with which images with complex backgrounds are detected. To address the issue of detecting densely distributed or multiple overlapping targets in images, this paper introduces the K-Means clustering algorithm into the detection network, forming the K-Detector network. In the K-Detector, this study proposes an adaptive dynamic adjustment of the K-Means clustering algorithm and designs a smooth Non-Maximum Suppression algorithm. The clustering results provide a rough estimation of the area in which the targets for detection are distributed. Dynamic adjustment of K values aims to enhance the model’s robustness, enabling it to adapt to the variable distribution of remote sensing targets. When filtering redundant candidate boxes, a smooth Non-Maximum Suppression algorithm avoids excessive removal of valid information. This algorithm smoothly reduces the confidence scores of redundant candidate boxes, thereby preserving more potentially valid information. The K-Detector network effectively reduces the likelihood that targets are missed, improving the accuracy and recall in dense or overlapping multi-target detection. The structure of K-CBST YOLO is shown in Figure 2, and the design of the CBST module and K-Detector are described in detail in subsequent sections.

3.2. CBST Module

In the backbone network of K-CBST YOLO, the CBST module is integrated in order to extract the detailed features of key regions from rich global information, thereby enhancing performance in the detection of remote sensing targets. The CBST module comprises two components, Swin-Transformer and CBAM, which are connected in series. The Swin-Transformer [8] employs two self-attention mechanisms: W-MSA (Windows Multi-Head Self-Attention) and SW-MSA (Shifted Windows Multi-Head Self-Attention). The structure of the Swin-Transformer is shown in Figure 3.

The working principle of the Swin-Transformer is as follows.

First, the W-MSA module uniformly segments the input feature maps, which are

M \times M

in size, into

2 \times 2

small windows, each

M / 2 \times M / 2

in size, as illustrated in Figure 4 (among them, A, B, C, and D represent the four small windows obtained after segmentation). Next, MSA attention calculations are conducted within each small window to capture local details.

Subsequently, SW-MSA performs window shift operations on the segmented image from W-MSA. The SW-MSA translates the input tensor along the spatial dimension, merging feature information from different windows into a single shifted window. This process facilitates the interaction and fusion of features across windows, thereby achieving the cross-window capture of global information. As shown in Figure 5, Figure 5a illustrates the four small windows, A, B, C, and D, obtained through uniform segmentation using W-MSA. After SW-MSA’s shift operation, Figure 5d is obtained, comprising the new small windows

A^{'}, B^{'}, C^{'}, a n d D^{'}

.

In the Swin-Transformer module, the LN (Layer Normalization) component performs normalization to accelerate the model’s convergence and improve training stability. The MLP (Multi-Layer Perceptron) component extracts and fuses features learned by MSA before introducing nonlinear transformations to the Swin-Transformer, thereby enhancing the model’s ability to handle complex data.

The Swin-Transformer module rearranges and combines the input feature maps using various segmentation methods to obtain the global background information of the feature maps. This module can to some extent reduce the model’s dependence on training data, thus improving the model’s generalization ability and robustness.

Compared to CNN-based feature extraction algorithms, the Swin-Transformer module does not alter the spatial resolution of the feature maps. CNN-based detection algorithms typically use convolutional kernels to increase receptive fields, which adds significant computational overhead during training and inference. In addition, excessively large convolutional kernels can decrease the spatial resolution of feature maps, resulting in the loss of fine-grained details and affecting the detection accuracy. Although stacking multiple small-sized convolutional kernels increases the receptive field, this approach significantly deepens network depth, complicates model architecture, and poses challenges for training and inference. However, the Swin-Transformer segments the input feature maps into multiple small windows, restricting the receptive field of each output feature to its respective small window. Figure 6 illustrates the segmentation of feature maps at different levels. As the hierarchy deepens, the size of the small window continues to shrink, gradually building a hierarchical structure. This module achieves the goals of expanding the receptive field, reducing the risk of information loss, and improving the accuracy of remote sensing objects’ detection without altering the spatial resolution of the feature map.

The Swin-Transformer module can extract global information and enhance the correlation between the background and the object. However, during the training and inference stages, excessive redundant information can affect the model’s judgment of the target. Therefore, this article proposes integrating the CBAM mechanism [16] following the Swin-Transformer module. The CBAM enhances the model’s ability to capture detailed features of key areas.

The CBAM comprises a channel attention mechanism and a spatial attention mechanism. The channel attention mechanism evaluates the importance of each channel and assigns corresponding weights, while the spatial attention mechanism identifies the positions of significant regions in the spatial domain. The structure of the CBAM is shown in Figure 7.

The upper part of Figure 7 shows the workflow of the channel attention module. Firstly, the module performs average pooling and maximum pooling operations on the input feature maps to achieve spatial dimension compression, compressing feature maps F with shapes of

(B, C, H, W)

into two feature maps with shapes of

(B, C, 1, 1),

namely

F_{a v g}^{c} a n d F_{m a x}^{c}

. Next, the pooling results are input into a two-layer MLP to analyze the features of each channel and their respective importance levels. Finally, corresponding weight coefficients are assigned to each channel via the Sigmoid function. The calculation of the channel attention module proceeds as follows:

F_{a v g}^{c} = A v g P o o l (F)

(1)

F_{m a x}^{c} = M a x P o o l (F)

(2)

M_{c} (F) = σ (M L P (F_{a v g}^{c}) + M L P (F_{m a x}^{c})) = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{m a x}^{c})))

(3)

The channel attention module adaptively enhances channels containing important information and suppresses those containing irrelevant or redundant information. It reduces the model’s focus on invalid information and enhances its learning of critical information.

The lower part of Figure 7 illustrates the workflow of the spatial attention module. This module employs average and maximum pooling operations to compress the input feature map along the channel dimension, transforming its shape to

(B, 1, 1, 1)

. Convolutional layers are then used to extract important features. Finally, corresponding weight coefficients are assigned to each spatial position using the Sigmoid function. The calculation proceeds as follows:

F^{'} = M_{c} (F) \otimes F

(4)

F_{a v g}^{s} = A v g P o o l (F^{'})

(5)

F_{m a x}^{s} = M a x P o o l (F^{'})

(6)

M_{s} (F) = σ (f^{7 \times 7} (C o n c a t [F_{a v g}^{s}, F_{m a x}^{s}]))

(7)

The spatial attention module employs a

7 \times 7

convolutional filter to fuse the results of average pooling and maximum pooling. Building on the output results of the channel attention module, it further assists the model in capturing the spatial positions of key features.

Understanding the correlation between the target and its surrounding background is crucial for effective detection of remote sensing targets. Consequently, global information and local details are critical outcomes of feature extraction. In the CBST module, the Swin-Transformer captures the global semantic relationships of images through cross-window interaction. The CBAM extracts key detailed features from global information by evaluating both channel and spatial dimensions. Integrating this module into the backbone network enhances feature extraction capabilities, enabling the model to focus on key areas while capturing the overall semantic structure of the image, thereby improving the accuracy with which remote sensing targets are detected.

3.3. K-Detector

In the detection network, in order to solve the problem of detecting small targets with dense distribution, this article proposes a K-Detector that combines an adaptive dynamic K-Means clustering algorithm with a smooth Non-Maximum Suppression algorithm. This detector effectively reduces the likelihood that dense or overlapping small targets are missed and improves the detection accuracy. Detailed descriptions of the adaptive dynamic K-Means clustering algorithm and the smooth Non-Maximum Suppression algorithm are provided in Section 3.3.1 and Section 3.3.2, respectively.

3.3.1. Adaptive Dynamic K-Means Clustering Algorithm

The K-Means algorithm is a distance-based clustering algorithm. This method iteratively partitions the data into K clusters, ensuring that data points within each cluster are as close as possible to their cluster center and that data points in different clusters are as far apart as possible. The specific calculation steps proceed as follows.

(1): K data points are randomly selected as initial cluster centers.
(2): Each data point in the dataset is assigned to the cluster whose center is closest to it.
(3): The mean of all data points within each cluster is calculated, and the cluster center’s position is updated accordingly.
(4): Steps 2 and 3 are repeated until the cluster centers no longer change significantly or until the predetermined number of iterations is reached.

Within the K-Means algorithm, the initial value of k is manually set, meaning it lacks flexibility in practical applications. Due to the uneven distribution of targets in remote sensing images (as such targets may be densely clustered or sparsely distributed) and the presence of uncertainties, the manual setting of the value of k is an unsuitable approach for handling remote sensing images. To address this, this paper proposes an adaptive dynamic adjustment K-Means clustering algorithm. The calculation process for k is as follows:

k_{i} = \min (k_{i n i t} + ⌈\frac{i}{∆ T}⌉ + 1, k_{m a x})

(8)

k_{i} = k_{m a x} - e^{- (i / ∆ T)}

(9)

In the above formula,

k_{i n i t} a n d k_{m a x}

represent the initial value and the maximum value, respectively, and i denotes the number of training iterations.

∆ T

is a preset threshold indicating the number of iterations that should pass before the value of K is changed. The larger the value of k, the more detailed the partitioning of the target regions. However, an excessively large value of k can result in unclear boundaries between clusters, degrading the clustering performance. Therefore, an upper limit threshold is set for the value of k. When the value of k gradually approaches the maximum value

k_{m a x}

, an exponential decay function is used to smooth its convergence to the upper limit threshold. The improved K-Means algorithm demonstrates strong robustness for the diverse distribution of remote sensing images, effectively preventing overfitting in the clustering results.

In the object detection stage, each image generates 2500 predicted candidate boxes. This paper uses the coordinate positions of the center points of these candidate boxes as a dataset and applies the adaptive dynamic K-Means clustering algorithm to cluster all predicted candidate boxes. This algorithm can identify the areas in which targets are distributed within the image, as shown in Figure 8. Figure 8a illustrates a scenario with a small number of sparsely distributed targets, while Figure 8b,c depict scenarios with a larger number of densely distributed targets.

This algorithm effectively handles unevenly distributed remote sensing images, mitigating negative consequences and improving the detection accuracy. Although the clustering results of the adaptive dynamic K-Means algorithm provide valuable reference information for processing candidate boxes, they do not represent a resolution to the problem of the missed detection of densely packed or overlapping targets. Therefore, this paper proposes a smooth Non-Maximum Suppression algorithm to reduce the likelihood of missed detections.

3.3.2. Smooth Non-Maximum Suppression Algorithm

Non-Maximum Suppression (NMS) is a post-processing technique commonly used in object detection [17]. Its core principle is the suppression of overlapping candidate boxes based on their confidence scores, which reduces redundancy and enhances both detection accuracy and efficiency. The NMS algorithm proceeds as follows.

First, all predicted candidate boxes are sorted by their confidence scores in descending order. Next, for each class, the Intersection over Union (IoU) between the highest confidence candidate box and the other candidate boxes is computed. Finally, a predefined threshold is applied in order to filter out overlapping candidate boxes that exceed the

N_{t}

threshold, retaining only the highest confidence candidate box as the result of object detection. The formula used for this calculation is

S i = \{\begin{matrix} S i i o u (M, b_{i}) < N_{t} \\ 0 i o u (M, b_{i}) \geq N_{t} \end{matrix}

(10)

In Formula (10),

S i

represents the confidence score of the i-th predicted candidate box, M denotes the candidate box with the highest confidence score,

N_{t}

represents the predefined overlap threshold, and

b_{i}

denotes the redundant candidate boxes. In the NMS algorithm, candidate boxes with an overlap that exceeds the threshold have their confidence scores set to zero. Although this algorithm effectively suppresses the influence of redundant candidate boxes on the detection results and improves the model’s detection efficiency, it is overly aggressive when handling densely distributed or visually overlapping targets. This may lead to the deletion of potentially valid candidate boxes, potentially resulting in missed detections.

To address the shortcomings of the NMS algorithm, this paper proposes a smooth Non-Maximum Suppression algorithm based on Euclidean distance. This algorithm smoothly reduces the confidence scores of overlapping candidate boxes, minimizing their interference with the detection results and preventing objects from being missed due to excessive removal. The calculation of this algorithm is shown in Figure 9, where M and

b_{i}

represent the highest-confidence candidate box and a relatively lower-confidence overlapping candidate box within a certain category, respectively.

Firstly, the clustering results of adaptive dynamic K-Means are divided into the following two scenarios.

(1): If the center points of candidate boxes M and $b_{i}$ belong to different clusters, i.e., $\{M \in A_{i}| b_{i} \in A_{j}\}$ , M and $b_{i}$ are candidate boxes generated through the prediction of different targets. Therefore, the overlapping candidate box $b_{i}$ is retained, meaning its confidence value remains unchanged, as illustrated in the following Formula (11).

$S_{i} = S_{i}, i o u (M, b_{i}) < N_{t} \cap A_{M} \neq A_{b_{i}}$

(11)
(2): If the center points of candidate boxes M and $b_{i}$ belong to the same cluster, i.e., $\{M, b_{i} \in A_{i}\}$ , candidate boxes M and $b_{i}$ may be prediction results for the same target, requiring the confidence score of the overlapping candidate box $b_{i}$ to be re-adjusted. However, when assessing targets that are too densely distributed or visually highly overlapping, the clustering results from adaptive dynamic K-Means alone cannot directly determine whether M and $b_{i}$ are predictions of the same target. Therefore, this paper employs a smooth Non-Maximum Suppression algorithm to reduce the confidence score of candidate box $b_{i}$ . This algorithm calculates the ratio of the Euclidean distance between the center points of candidate boxes M and $b_{i}$ to the diagonal length of their minimum enclosing bounding box. This ratio is then used as the weight coefficient for $b_{i}$ to suppress its confidence score and thereby reduce its interference with the detection results. The core principle of this algorithm is the gradual reduction in the importance of overlapping candidate boxes during detection, which allows the brute-force deletion approach of the traditional NMS algorithm to be circumvented. This approach allows more potentially valid candidate boxes to be retained during the detection stage. The specific calculation process is as follows:

$D i s = \sqrt{{{(x}_{M} - x_{i})}^{2} + {(y_{M} - y_{i})}^{2}}, i o u (M, b_{i}) \geq N_{t} \cap A_{M} = A_{i}$

(12)

$δ = \frac{D i s}{L_{M, b_{i}}}$

(13)

$S_{I} = δ \times S_{i}$

(14)

In the above formula,

(x_{M}, y_{M})

and

(x_{b_{i}}, y_{b_{i}})

represent the center coordinates of candidate boxes M and

b_{i}

, respectively.

L_{M, b_{i}}

represents the diagonal length of the minimum bounding box that can enclose both M and

b_{i}

. The calculation formula for

L_{M, b_{i}}

is detailed in GIoU (Generalized IoU), so it will not be repeated here. The smooth Non-Maximum Suppression algorithm effectively prevents dense or overlapping targets from being missed, thus improving the detection accuracy and recall rate of remote sensing targets.

4. Experimental Evaluations

To illustrate the performance of the K-CBST YOLO model in object detection via remote sensing, this section details our experimental methodology. Section 4.1 describes the dataset used in the experiments, Section 4.2 elaborates on the selected experimental parameters and evaluation metrics, and Section 4.3 discusses the performance of K-CBST YOLO through ablation and comparative experiments.

4.1. Experimental Dataset

DIOR [18] is a large-scale benchmark dataset specifically designed for image object detection via remote sensing; it was introduced by Northwestern Polytechnical University in 2019. This dataset comprises 23,463 remote sensing images and 192,472 object instances, with each image having dimensions of 800 × 800. The DIOR dataset encompasses 20 categories, including airplane (APL), airport (APO), baseball field (BF), basketball court (BC), bridge (BR), chimney (CH), dam (DAM), expressway service area (ESA), expressway toll station (ETS), golf field (GF), ground track field (GTF), harbor (HA), overpasses (OP), ship (SH), stadium (STA), storage tank (STO), tennis court (TC), train station (TS), vehicle (VE), and windmill (VM). The DIOR dataset is rich and diverse, making it an important resource when developing technologies for object detection via remote sensing. This paper conducts experiments based on the DIOR dataset to ensure that our results are robust.

DOTA [19] is a dataset specifically designed for aerial image object detection. It comprises 2806 aerial images and 188,282 object instances, with image sizes ranging from small (800 × 800) to very large (4000 × 4000). The images are sourced from diverse origins and feature various backgrounds and scenes, demonstrating a wide range of scales, orientations, and shapes. Three versions of the DOTA dataset have been released, and this paper utilizes DOTA-v1.0. It encompasses 15 common categories, such as plane (PL), baseball diamond (BD), bridge (BR), and others. The DOTA dataset is eminently suitable for developing algorithms that can detect various targets in aerial images.

4.2. Experimental Details and Evaluation Indicators

All experiments described in this study were conducted within the PyTorch framework on an NVIDIA GeForce RTX 4060 Laptop GPU. YOLOv5s served as the baseline framework. The AdamW optimizer was utilized; this optimizer separately handles weight decay and gradient updates to prevent model overfitting and enhance the model’s generalization ability. The initial learning rate was set at 0.001, utilizing a Cosine Annealing LR strategy to gradually reduce the learning rate throughout the training epochs until a minimum value was reached via periodic repetitions of this process. The weight decay coefficient was set to 0.001, momentum to 0.6, and batch size to 16, and training was conducted over 200 epochs. Additionally, owing to the multi-scale characteristics of the DOTA dataset, the DOTA-v1.0 dataset was cropped into a series of 1024 × 1024 small image blocks, featuring a step of 824 and an overlap of 200 pixels for training and testing. During the training phase, a horizontal flip was performed on the entire dataset, with a probability of 50%.

Precision (P), recall (R), mAP50, and mAP50:95 are used as experimental evaluation metrics. Precision reflects the percentage of true targets among all detected targets. Recall reflects the percentage of detected true targets among all true targets. mAP stands for mean Average Precision. mAP50 reflects the overall performance of the model when the detection overlap threshold is greater than 0.5, while mAP50:95 measures the average performance of the model across an overlap threshold range from 0.5 to 0.95. The calculation of the evaluation metrics proceeds as follows:

\{\begin{matrix} P = \frac{T P}{T P + F P} \\ R = \frac{T P}{T P + F N} \\ A P = \int_{0}^{1} P d R \\ m A P = \frac{\sum_{i} {A P}_{i}}{m} \end{matrix}

(15)

In Formula (15), TP represents the number of true samples correctly predicted to be positive samples, FP represents the number of false samples incorrectly predicted to be positive samples, and FN represents the number of true samples incorrectly predicted to be negative samples.

4.3. Experimental Results

To more effectively validate the impact of the methods proposed in this paper on the performance of remote sensing object detection, both ablation and comparative experiments were designed. Ablation experiments aim to analyze the influence of different modules on the model’s performance. Comparative experiments assess the overall performance of K-CBST YOLO by comparing it with other mainstream algorithms.

(1): Ablation experiments

To validate the positive effects of the CBST and K-Detector modules on remote sensing object detection, as proposed in this paper, a series of ablation experiments were planned. Experiments utilized the DIOR dataset with a variety of configurations, including the Baseline model, Baseline + ST, Baseline + CBAM, Baseline + K-Detector, Baseline + ST + K-Detector, Baseline + CBAM + K-Detector, Baseline + CBST, and Baseline + CBST + K-Detector, all involving the same hardware and software setup. The experimental results are presented in Table 1, where P_best and R_best denote the Average Precision and recall achieved by the models under optimal conditions, respectively. It should be noted that ‘Baseline+’ is abbreviated to ‘+’.

Table 1 shows that the baseline model has relatively low values for P_best, R_best, mAP, and mAP50:95, indicating that the baseline model is not well suited for remote sensing image detection. Compared to the baseline model, the highest average detection precision and recall for Baseline + ST reached 73.0% and 61.7%, respectively, with an increase of 1.6% and 0.1% in mAP and mAP50:95, respectively. Baseline + CBAM reached a maximum average detection precision and recall of 71.8% and 60.8%, respectively, but mAP and mAP50:95 decreased by 1.6% and 2.9%. Baseline + K-Detector showed a maximum average detection precision and recall of 70.2% and 64.3%, with mAP increasing by 3.7% and mAP50:95 decreasing by 0.2%. Baseline + ST + K-Detector achieved a maximum average detection precision and recall of 74.3% and 65.6%, with mAP and mAP50:95 increasing by 1.2% and 1.0%, respectively. Baseline + CBAM + K-Detector achieved a maximum average detection precision and recall of 71.4% and 63.9%, but mAP and mAP50:95 decreased by 1.7% and 2.7%. Baseline + CBST reached a maximum average detection precision and recall of 75.2% and 62.9%, with mAP and mAP50:95 improving by 1.2% and 2.3%, respectively, compared to the baseline model. K-CBST YOLO (Baseline + CBST + K-Detector) achieved the best results in maximum average detection precision, recall, and mAP at 75.4%, 66.2%, and 68.3%, respectively, and the mAP 50:95 was increased by 1.8% over the baseline. Analysis of R_best values in Table 1 shows that models integrating the K-Detector have higher recall rates than models without the K-Detector, indicating that the K-Detector plays a positive role in reducing the probability of missing dense, small targets. Compared to other models integrating the K-Detector, Baseline + CBST + K-Detector achieved higher R_best scores, indicating that the synergy between the CBST module and K-Detector is particularly effective in improving detection precision and minimizing missed detections. In the ablation experiments, models incorporating various module configurations were evaluated, robustly validating the beneficial effects of the ST, CBAM, CBST, and K-Detector on target detection.

As shown in the last three rows of Table 1, we analyzed the effects of different forms of model architecture on train speed, test speed, and model parameters (the time recorded in Table 1 refers to the average time used for processing a single sample). The baseline model has the lowest training and testing speed, requiring 614 ms and 4.86 ms, respectively, and the model’s memory is also minimal, occupying only 13.61 MB. Compared with the baseline model, models with different structures showed increased training speed, testing speed, and model size to varying degrees. Among them, the average time taken by K-CBST YOLO to train a single sample reached 671 ms, which is an increase of 57 ms on that of the baseline. The testing time reached 7.37 ms, representing an increase of 2.1 ms compared to the baseline, and the model memory increased by 1.83 MB. Although the training and testing times and model size increased, these changes were not qualitatively significant. More importantly, the detection accuracy of K-CBST YOLO significantly improved relative to the baseline model. Overall, the methodology proposed in this paper achieves a reasonable balance between accuracy and efficiency, demonstrating the suitability of the algorithm to remote sensing object detection tasks.

Additionally, to further validate the impact of the ST, CBAM, and CBST modules on the model’s feature extraction capabilities, in this section, we randomly select seven diverse types of remote sensing images (including densely packed ships, sparsely located airplanes, playgrounds with complex backgrounds, regularly arranged vehicles, and oil tanks, among others). The feature extraction process is illustrated in Figure 10.

Based on the above results, we can confirm that the K-CBST YOLO algorithm proposed in this paper effectively enhances the accuracy with which remote sensing targets are detected, demonstrating robustness and the ability to generalize. Even against complex backgrounds, this algorithm can detect targets accurately, and it also achieved notable results in detecting unevenly distributed remote sensing targets.

(2): Comparative experiments

To comprehensively assess the advantages of the K-CBST YOLO algorithm, this paper compares its effectiveness with that of mainstream algorithms such as Faster-CNN, YOLOv4, RetinaNet, R-DFPN [20], CFC-Net [21], ROI Transformer, and QPDet [22] using the DOIR remote sensing dataset. Taking the DOTA aerial dataset, comparisons are made between our algorithm and Faster R-CNN, ICN [23], ROI Transformer, R-DFPN, QPDet, and SCRDet. The experimental results for both datasets are presented in Table 2 and Table 3, respectively.

From Table 2, it is apparent that the mAP (mean Average Precision) of K-CBST YOLO reached 68.3%, outperforming other mainstream detection algorithms. In eight categories—APL, BF, OP, SH, STA, TC, VE, and VM—the K-CBST YOLO algorithm achieved the highest accuracy. In the detection of DAM and GF categories, K-CBST YOLO’s detection accuracy was 15.8% and 22% lower than that of RetinaNet, respectively. In the detection of the APO and BC categories, K-CBST YOLO’s detection accuracy was 39.3% and 12.2% lower, respectively, than that of R-DFPN. In the detection of the BR, CH, HA, and STO categories, K-CBST YOLO’s detection accuracy was 22.1%, 2.4%, 10.7%, and 2.6% lower, respectively, than that of CFC-Net. In the detection of the ESA, ETS, GTF, and TS categories, the QPDet detection algorithm exhibited optimal performance. K-CBST YOLO’s detection accuracy was 8.4%, 18%, 8.6%, and 24.3% lower, respectively, than that of QPDet.

From Table 3, it is apparent that, in the detection of the DOTA aerial dataset, the K-CBST YOLO algorithm achieved the highest detection accuracy in eight categories: BR, SV, LV, SBF, RA, HA, SP, and HC. In the detection of the GTF category, K-CBST YOLO’s detection accuracy was 8.3% lower than that of the ROI Transformer. For the BD, SH, and BC categories, K-CBST YOLO’s detection accuracy decreased by 0.4%, 7.4%, and 22.3%, respectively, compared to the QPDet algorithm. In the detection of the TC category, both QPDet and SCRDet achieved a performance of 90.9%, which was 25.4% higher than that of K-CBST YOLO. In the detection of PL and ST categories, K-CBST YOLO’s detection accuracy was 0.7% and 2.8% lower, respectively, than that of SCRDet. The average detection accuracy of K-CBST YOLO reached 78.4%, outperforming other detection algorithms.

In conclusion, compared to other mainstream algorithms, K-CBST YOLO demonstrated the best overall performance. It achieved the highest scores in the detection of most image categories. To further validate the detection performance of K-CBST YOLO, in this section, we randomly select several images from the DIOR and DOTA datasets, as shown in Figure 11 and Figure 12.

Figure 11 displays the DIOR dataset, while Figure 12 presents the DOTA dataset. From the detection results depicted in these images, it is apparent that K-CBST YOLO is particularly suited to the detection of small objects in remote sensing and aerial imagery. Whether dealing with images containing complex backgrounds, with densely distributed objects, or with visually overlapping images, K-CBST YOLO demonstrates robust performance across both datasets.

5. Conclusions

Addressing remote sensing object detection challenges, this paper proposes the K-CBST YOLO detection algorithm, which is based on the YOLO framework. The algorithm incorporates a CBST module that combines Swin-Transformer and CBAM mechanisms as the core of the backbone network. This module constructs the global semantic relations of the feature map while refining key features without altering the spatial resolution, thereby reducing the interference of complex backgrounds in detection results. For the detection network, the algorithm introduces the K-Detector. The K-Detector employs an adaptive dynamic K-Means clustering algorithm to provide target distribution regions based on clustering results. Additionally, a smooth Non-Maximum Suppression algorithm is designed to mitigate the impact of overlapping candidate boxes on detection results. The K-Detector retains more potentially useful information for the model, thereby reducing the risk of missing densely or overlapping targets. Our experimental results indicate that the K-CBST YOLO algorithm significantly outperforms other mainstream detection algorithms, rendering it particularly suitable for object detection via remote sensing.

Author Contributions

Funding acquisition, J.X. and Y.L.; writing—original draft, A.C.; writing—review and editing, J.L., Y.R. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work were funded by the National Key Research and Development Program of China (No. 2022YFC3320802 and No. 2023YFB3905704), and Central Guiding Local Technology Development (No. 226Z5901G).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Although the authors Aonan Cheng, Jincheng Xiao, Yingcheng Li, Yiming Sun, Yafeng Ren, and Jianli Liu are employed by China TopRS Technology Company Limited, the funding for this research was provided by the National Key R&D Program of China’s Ministry of Science and Technology. Additionally, this research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 14, pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage enhancement network for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Biswas, D.; Tešić, J. Domain adaptation with contrastive learning for object detection in satellite imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yang, X.; Jiao, L.; Li, Y.; Liu, X.; Liu, F.; Li, L.; Yang, S. Relation Learning Reasoning Meets Tiny Object Tracking in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. A convnet for non-maximum suppression. In Proceedings of the Pattern Recognition: 38th German Conference, GCPR 2016, Hannover, Germany, 12–15 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 38, pp. 38192–38204. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. Isprs J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Yao, Y.; Cheng, G.; Wang, G.; Li, S.; Zhou, P.; Xie, X.; Han, J. On improving bounding box representations for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–11. [Google Scholar] [CrossRef]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 150–165. [Google Scholar]

Figure 1. Comparison of natural images and remote sensing/aerial images.

Figure 2. K-CBST YOLO structure.

Figure 3. Swin-Transformer module.

Figure 4. W-MSA.

Figure 5. Window shift operation. (a) is an image composed of four small windows obtained through uniform segmentation by W-MSA. (b,c) are images showing different arrangements of the small windows obtained during the processing of SW-MSA’s shifting operations. (d) is the image obtained after the completion of SW-MSA’s shifting operation.

Figure 6. Swin-Transformer segmentation feature map.

Figure 7. CBAM structure diagram.

Figure 8. Dynamic clustering results. (a) shows a remote sensing image with sparse targets and its target clustering results. (b,c) show remote sensing images with densely distributed targets and their corresponding multi-target clustering results.

Figure 9. Smooth Non-Maximum Suppression algorithm.

Figure 10. Grade-CAM. (a,f) depict densely packed ships docked at the harbor and vehicles parked alongside the road. The attention regions of the Baseline + CBST comprehensively cover all targets, while heatmaps from the other models exhibit varying degrees of omission. Particularly in (f), the other models largely overlooked the vehicles on the road, with only Baseline + CBST accurately pinpointing the areas in which all targets are located. In images such as (b,d,e), which feature remote sensing targets arranged in an orderly manner, no models missed detections, but Baseline and Baseline + ST excessively focused on areas without targets. (c) portrays a playground with a complex background in which, aside from the baseline model excessively focusing on irrelevant areas, other models accurately identified the target locations. (g) displays scattered and diminutive ships; only Baseline + ST and Baseline + CBST precisely focused on all targets.

Figure 11. Detection effects on DIOR dataset.

Figure 12. Detection effects on DOTA dataset.

Table 1. Results of ablation experiments.

Evaluation Indicator	Models
	Baseline	+ST	+CBAM	+K-Detector	+ST + K-Detector	+CBAM + K-Detector	+CBST	+CBST + K-Detector
P_best	0.648	0.730	0.718	0.702	0.743	0.714	0.752	0.754
R_best	0.554	0.617	0.608	0.643	0.656	0.639	0.629	0.662
mAP	0.648	0.664	0.632	0.685	0.660	0.631	0.660	0.683
mAP50:95	0.40	0.401	0.371	0.398	0.410	0.373	0.423	0.418
Train speed	614 ms	632 ms	618 ms	654 ms	659 ms	657 ms	645 ms	671 ms
Test speed	4.86 ms	5.27 ms	4.97 ms	7.15 ms	7.35 ms	6.95 ms	5.89 ms	7.37 ms
Param	13.61 MB	13.77 MB	13.87 MB	14.30 MB	14.68 MB	14.96 MB	13.97 MB	15.44 MB

Table 2. Results of comparison experiments using DIOR dataset.

Accuracy	Detection Algorithm
	Faster RCNN	Yolov4	RetinaNet	R-DFPN	CFC-Net	ROI Transformer	QPDet	K-CBST YOLO
APL	0.638	0.814	0.626	0.783	0.923	0.633	0.632	0.926
APO	0.616	0.493	0.722	0.809	0.489	0.379	0.414	0.416
BF	0.669	0.671	0.682	0.766	0.716	0.712	0.720	0.923
BC	0.846	0.726	0.848	0.903	0.812	0.875	0.886	0. 781
BR	0.280	0.395	0.505	0.338	0.589	0.407	0.412	0. 368
CH	0.730	0.726	0.767	0.678	0.909	0.726	0.726	0. 885
DAM	0.445	0.366	0.545	0.513	0.463	0.269	0.288	0. 387
ESA	0.526	0.568	0.564	0.579	0.505	0.681	0.690	0. 606
ETS	0.423	0.599	0.471	0.509	0.574	0.787	0.789	0. 609
GF	0.712	0.602	0.749	0.736	0.688	0.690	0.701	0. 529
GTF	0.659	0.661	0.677	0.674	0.661	0.827	0.830	0. 744
HA	0.486	0.523	0.427	0.501	0.623	0.477	0.478	0. 516
OP	0.501	0.518	0.526	0.558	0.540	0.556	0.555	0. 564
SH	0.716	0.802	0.680	0.743	0.724	0.812	0.812	0. 919
STA	0.340	0.654	0.466	0.584	0.836	0.782	0.722	0. 899
STO	0.634	0.707	0.471	0.629	0.838	0.703	0.627	0. 812
TC	0.775	0.866	0.774	0.726	0.883	0.816	0.891	0. 922
TS	0.403	0.434	0.400	0.487	0.398	0.549	0.581	0.338
VE	0.459	0.501	0.371	0.482	0.552	0.433	0.434	0. 726
WM	0.698	0.735	0.719	0.707	0.754	0.655	0.654	0. 786
mAP	0.578	0.603	0.590	0.623	0.660	0.639	0.642	0.683

Table 3. Results of comparison experiments using DOTA dataset.

Accuracy				Detection Algorithm
	Faster RCNN	ICN	ROI Transformer	R-DFPN	QPDet	SCRDet	K-CBST YOLO
PL	0.884	0.814	0.886	0.809	0.896	0.902	0.895
BD	0.731	0.743	0.785	0.658	0.837	0.819	0.833
BR	0.449	0.477	0.434	0.338	0.541	0.553	0.659
GTF	0.591	0.703	0.759	0.589	0.739	0.733	0.676
SV	0.733	0.649	0.688	0.558	0.789	0.721	0.801
LV	0.715	0.678	0.737	0.509	0.831	0.776	0.912
SH	0.771	0.70	0.836	0.548	0.883	0.781	0.809
TC	0.908	0.908	0.907	0.903	0.909	0.909	0.655
BC	0.789	0.791	0.773	0.663	0.866	0.824	0.643
ST	0.839	0.782	0.815	0.687	0.848	0.864	0.836
SBF	0.486	0.536	0.584	0.487	0.620	0.645	0.933
RA	0.630	0.629	0.535	0.518	0.655	0.634	0.818
HA	0.622	0.670	0.628	0.551	0.742	0.758	0.772
SP	0.650	0.642	0.589	0.513	0.701	0.782	0.809
HC	0.562	0.502	0.477	0.359	0.582	0.601	0.695
mAP	0.691	0.682	0.696	0.579	0.762	0.753	0.784

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, A.; Xiao, J.; Li, Y.; Sun, Y.; Ren, Y.; Liu, J. Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer. Remote Sens. 2024, 16, 2885. https://doi.org/10.3390/rs16162885

AMA Style

Cheng A, Xiao J, Li Y, Sun Y, Ren Y, Liu J. Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer. Remote Sensing. 2024; 16(16):2885. https://doi.org/10.3390/rs16162885

Chicago/Turabian Style

Cheng, Aonan, Jincheng Xiao, Yingcheng Li, Yiming Sun, Yafeng Ren, and Jianli Liu. 2024. "Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer" Remote Sensing 16, no. 16: 2885. https://doi.org/10.3390/rs16162885

APA Style

Cheng, A., Xiao, J., Li, Y., Sun, Y., Ren, Y., & Liu, J. (2024). Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer. Remote Sensing, 16(16), 2885. https://doi.org/10.3390/rs16162885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Overall Framework

3.2. CBST Module

3.3. K-Detector

3.3.1. Adaptive Dynamic K-Means Clustering Algorithm

3.3.2. Smooth Non-Maximum Suppression Algorithm

4. Experimental Evaluations

4.1. Experimental Dataset

4.2. Experimental Details and Evaluation Indicators

4.3. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI