SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging

Kang, Zhenping; Liao, Yurong; Du, Shuhan; Li, Haonan; Li, Zhaoming

doi:10.3390/aerospace11080605

Open AccessArticle

SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging

by

Zhenping Kang

,

Yurong Liao

,

Shuhan Du

,

Haonan Li

and

Zhaoming Li

^*

Department of Electronic and Optical Engineering, Space Engineering University, Beijing 101416, China

^*

Author to whom correspondence should be addressed.

Aerospace 2024, 11(8), 605; https://doi.org/10.3390/aerospace11080605

Submission received: 13 June 2024 / Revised: 16 July 2024 / Accepted: 22 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue On-Board Systems Design for Aerospace Vehicles (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Addressing real-time aircraft target detection in microsatellite-based visible light remote sensing video imaging requires considering the limitations of imaging payload resolution, complex ground backgrounds, and the relative positional changes between the platform and aircraft. These factors lead to multi-scale variations in aircraft targets, making high-precision real-time detection of small targets in complex backgrounds a significant challenge for detection algorithms. Hence, this paper introduces a real-time aircraft target detection algorithm for remote sensing imaging using an improved lightweight attention mechanism that relies on the You Only Look Once version 7 (YOLOv7) framework (SE-CBAM-YOLOv7). The proposed algorithm replaces the standard convolution (Conv) with a lightweight convolutional squeeze-and-excitation convolution (SEConv) to reduce the computational parameters and accelerate the detection process of small aircraft targets, thus enhancing real-time onboard processing capabilities. In addition, the SEConv-based spatial pyramid pooling and connected spatial pyramid convolution (SPPCSPC) module extracts image features. It improves detection accuracy while the feature fusion section integrates the convolutional block attention module (CBAM) hybrid attention network, forming the convolutional block attention module Concat (CBAMCAT) module. Furthermore, it optimizes small aircraft target features in channel and spatial dimensions, improving the model’s feature fusion capabilities. Experiments on public remote sensing datasets reveal that the proposed SE-CBAM-YOLOv7 improves detection accuracy by 0.5% and the mAP value by 1.7% compared to YOLOv7, significantly enhancing the detection capability for small-sized aircraft targets in satellite remote sensing imaging.

Keywords:

aircraft detection; YOLOv7; CBAM; SENet; microsatellite

1. Introduction

Detecting small targets within complex backgrounds is a research hotspot, typically applied for the real-time detection of small aircraft targets in satellite-based visible light remote sensing video imaging. The term “complex background” refers to a scenario where the background image is abundant in information and substantial in size, exhibiting a significant spatial scale ratio difference from the target object. In satellite remote sensing images, aircraft targets may be embedded within intricate environments such as clouds, mountains, or snow-covered regions, thereby posing considerable challenges for their detection and classification. Remote sensing images contain complex background texture information [1,2], where aircraft targets appear as multi-scale variation targets and are influenced by the satellite platform’s imaging resolution and relative positional changes. This is incredibly challenging for high-precision detection, especially when small targets are superimposed with a complex background interference. Therefore, it is necessary to investigate the real-time detection of small aircraft targets in satellite remote-sensing imaging.

Object detection is a crucial task in computer vision, aiming at automatically identifying and locating specific objects in images or videos using computer algorithms and techniques [3,4]. In recent years, the continuous development of aerospace, computer, sensor, and data processing technology has pushed the boundaries of object detection, demonstrating significant capabilities in military, civilian, and intelligent applications [5]. Traditional object detection algorithms include region selection, feature extraction, and classification. Deep learning-based object detection algorithms outperform traditional object detection algorithms in complex scenes with convolutional neural network (CNN)-based object detection algorithms divided into two-stage and single-stage detection algorithms. The two-stage detection algorithms first form proposal boxes to predict proposed regions. They afford a high detection accuracy but have a complex structure and long training times. Classic two-stage detection algorithms include R-CNN [6], Fast R-CNN [7], Faster R-CNN [8], and Mask R-CNN [9]. The single-stage detection algorithms directly generate class probabilities and position coordinates on objects. They have a simple structure and shorter training times but slightly lower accuracy. Classic single-stage detection algorithms include SSD [10] and YOLO [11].

Despite the high level of current technology, there is still room for improvement in detecting small objects. Particularly, detecting small aircraft targets in aerial images is quite challenging as their small size often results in them being filtered out by the pooling layers in the CNNs. Hence, this addresses this challenge by proposing a new deep learning algorithm, SE-CBAM-YOLOv7, which improves traditional convolution techniques and incorporates state-of-the-art attention mechanisms. The main contributions of this paper are as follows:

Replacing the standard convolution (Conv) process with a new lightweight convolution (SEConv) to reduce the network’s computational parameters and speed up the detection process for small aircraft targets;
Designing the SESPPCSPC module that integrates the channel attention mechanism network SENet. This achieves multi-scale spatial pyramid pooling on the input feature maps, enhances the model’s receptive field and feature expression capabilities, and improves the network’s feature extraction capability;
Introducing CBAMCAT, a new feature fusion layer that sequentially infers attention maps along two independent dimensions (channel and spatial). The attention maps are multiplied with the input feature maps for adaptive optimization, improving the model’s feature fusion capability.

This paper is organized as follows. Section 2 discusses the existing work on the YOLO algorithm for target detection in remote sensing images. Section 3 overviews the proposed SE-CBAM-YOLOv7 network. Section 4 presents the experiments conducted, tests the algorithm’s performance on a small aircraft target dataset, and analyzes the results. Finally, Section 5 concludes this paper.

2. Related Work

In 2016, inspired by the GoogLeNet architecture [12], Redmon et al. introduced the YOLO (You Only Look Once) structure in the CVPR paper You Only Look Once: Unified, Real-Time Object Detection [11]. They replaced the initial module of GoogLeNet with 1 × 1 convolutions followed by 3 × 3 convolution filters. The main features of YOLO are integrating object localization and classification predictions into a single neural network model, thereby achieving fast object detection and recognition with high accuracy. Since introducing YOLOv1 in 2016, the YOLO algorithm has undergone continuous updates and optimizations. Each subsequent version has demonstrated advancements in innovative architectures, leading to increased speed and accuracy in object detection.

In recent years, YOLO has been widely applied to target detection in remote sensing images. However, real-time detection of small objects in remote sensing images captured by drones is challenging, as the various drone shooting angles present the target objects under varying scales, densities, and shapes. Hence, Zhang et al. focused on real-time small vehicle detection in drone-captured remote sensing images and proposed a depth-wise attention mechanism network (DAGN) based on YOLOv3. This method combines feature concatenation and attention modules, allowing the model to distinguish between important and unimportant features, thereby improving vehicle detection and promoting real-time detection of small objects in drone imagery [13]. Nevertheless, current research on aircraft target detection and classification in remote sensing images suffers from data sample imbalance, significant variations in target scales and backgrounds, and target occlusion, leading to low average precision and slow detection speeds. Spurred by these concerns, Liu et al. proposed the YOLO-Extract model to detect small, dense, and occluded targets. Specifically, they optimized the Mish activation function and the Conv module using representative batch normalization. Furthermore, they improved the classification loss function using a VariFocal loss to overcome the precision issues caused by data sample imbalance. Finally, they designed the RepVGG module in the Backbone, further enhancing the model’s detection accuracy [14]. Sun et al. introduced a ship target detection algorithm based on YOLOv5, achieving ship target detection in remote sensing images with complex backgrounds. Their method relied on an improved K-means clustering method, with the experimental results demonstrating that the enhanced network significantly improved performance on images with densely distributed small targets over the original YOLOv5 network at the expense of a slightly reduced detection speed [15]. Zeng et al. proposed YOLOv7 UAV, a real-time small object detection algorithm designed exclusively for drone-captured aerial photos. They deleted the second sampling layer and the deepest detection head, and replaced them with the DPSPPF module, which uses cascaded small-sized max pooling layers and depth separable convolutions to extract feature information at different scales more effectively. The experimental results showed that the real-time detection speed of the YOLOv7 drone was 27% faster than YOLOv7 [16].

3. Method

The YOLO algorithm is currently one of the most used object detection and recognition algorithms, adopting a single-stage deep learning approach [16]. YOLO, unlike standard two-stage object recognition algorithms, does not require the selection of candidate regions for classification. Instead, it relies on a unified architecture that predicts bounding boxes and class probabilities simultaneously, allowing for end-to-end object detection at high speed. Its main features are real-time performance and accuracy. YOLO transforms the object detection task into a regression problem, where a single neural network simultaneously predicts object classes and bounding boxes. The core idea of YOLO is to divide the input image into a fixed-size grid and predict multiple bounding boxes within each grid cell using convolutional layers. Each bounding box contains an object and predicts the object’s class and position. Using CNN for feature extraction and prediction allows YOLO to perform object detection and classification in a single forward pass.

The proposed SE-CBAM-YOLOv7 algorithm is introduced in this study for remote-sensing aircraft target detection, based on the YOLOv7 framework. The network model comprises the head, neck, and backbone components, as illustrated in Figure 1. Specifically, after resizing and normalizing the remote-sensing aircraft images, these are input into the proposed SE-CBAM-YOLOv7 network model. The backbone network first extracts feature information, which is input into the neck network for feature fusion, producing three feature maps of different sizes (large, medium, and small). Finally, the fused feature maps are processed by the head network, which has three detection heads that output the predicted bounding boxes and class information. The backbone network of SE-CBAM-YOLOv7 comprises convolution modules, cross-stage partial network block (CBS), an efficient layer aggregation network (ELAN) module, a mixed-precision convolution (MPConv) module, and an SPPCSPC module. This study introduces the SENet attention mechanism at the feature layer located at the backbone’s output to enhance feature extraction capability and increase the critical information of small aircraft targets on the feature map. The neck network of SE-CBAM-YOLOv7 adopts a path aggregation feature pyramid network (PAFPN) structure, performing feature fusion for the upsampling and downsampling parts. Furthermore, this study incorporates a lightweight CBAM network in the fusion module to comprehensively capture critical information on small aircraft targets in channel and spatial dimensions. The head network of SE-CBAM-YOLOv7 incorporates three sizes of identity detection (IDetect) detection heads, which detect and recognize small aircraft targets based on the critical feature information output by the neck network. The following sections detail the principles of network optimization.

3.1. SEConv

Neural network models continuously accumulate information that needs to be stored during model training as the number of network layers rises and with the parameters the model needs to learn. This result leads to information overload, with typical solutions utilizing attention networks. These networks help the model focus on the most crucial information relevant to the training task, reducing the attention given to less important information and filtering out irrelevant data. This strategy effectively solves the information overload problem and improves the model’s classification accuracy in the later stages. Regarding SENet (squeeze-and-excitation network), it first performs global average pooling on the input features across the spatial dimensions, generating C weights between 0 and 1 in a fully connected network structure. Then, it captures the inter-channel dependencies through fully connected layers. Different channel features are essential and are obtained using the Sigmoid function, which serve as weight coefficients. These weights are multiplied with the input feature signals, automatically assigning weights to each channel [17]. Figure 2 depicts SENet.

SENet is divided into four steps. Firstly, transformation; after transforming F_tr, maps the input X to the feature map U. The calculation formula is shown in Equation (1):

U_{c} = V_{c} * X

(1)

where

U \in R^{H^{'} \times W^{'} \times C^{'}}

denotes the input feature map,

X \in R^{H \times W \times C}

denotes the transformed output feature map, V_c denotes the parameter of the cth filter, and * denotes the convolution operation.

The squeeze operation then compresses the

H \times W \times C

feature map containing global information into a

1 \times 1 \times C

feature vector defined as follows:

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(2)

where z_c is the cth element of z.

This is followed by the excitation operation, where the features are adaptively corrected to obtain the weight coefficients s through two fully connected layers.

s = F_{e x} (z, W)

(3)

Finally, scale, which multiplies each feature map in the feature map U by the corresponding weight to obtain the final output of the SE model

\tilde{X}

.

{\tilde{X}}_{c} = F_{S c a l e} (u_{c}, s_{c})

(4)

Figure 3 illustrates the proposed lightweight SEConv. The convolutional layer is connected to batch normalization, which solves the problem of vanishing gradient to some extent. After that, a SiLU activation function is connected. SiLU is the combination of Relu and sigmoid. It can be regarded as a smooth Relu, which solves the disadvantage that Relu has negative input and output of 0, and the problem of gradient dispersion does not occur. Finally, the SENet network is connected. The expression for SiLU is as follows:

S i L U (x) = x \cdot σ (x)

(5)

where

σ (x)

denotes the sigmoid function,

σ (x) = \frac{1}{1 + e^{- x}}

.

The use of the SEConv convolution instead of the standard convolution Conv reduces the computational parameters of the network. Additionally, SEConv optimizes the output of small aircraft target features by the ELAN module, thus accelerating the model’s detection speed.

3.2. SESPPCSPC

YOLOv7 incorporates the SPPCSPC (spatial pyramid pooling and connected spatial pyramid convolution) module. SPPCSPC, which was initially presented in YOLOv5 [18], extracts image features and enhances target detection performance. This module combines the spatial pyramid pooling (SPP) and channel pyramid convolution (CSPC) techniques. The former performs pooling operations on feature maps at different scales to capture contextual information of targets of various sizes, thus improving target detection accuracy. On the other hand, SPP does not change the size of the feature map but uses pooling kernels of different sizes to pool the feature map. Then, it concatenates these pooled results into a fixed-length feature vector. This process retains multi-scale information, enabling the network to better adapt to objects of different sizes. CSPC performs pyramid convolution operations along the channel dimension by introducing more nonlinear transformations to extract richer feature representations. Additionally, CSPC splits the input feature map into two parts, applies pyramid convolution to one part, and then concatenates it with the other part. This process increases the network’s width and enhances its expressive capability. The SPPCSPC module combines SPP and CSPC to improve target detection performance. First, the input feature map is sent to the SPP layer for spatial pyramid pooling, resulting in a fixed-length feature vector. This vector then undergoes several convolution operations, including channel pyramid convolution, to extract more discriminative feature representations. Finally, the resulting feature map is used for the target detection task.

This paper introduces the SENet attention mechanism network into this module, as depicted in Figure 4. In addition, adding parallel maximum pooling (MaxPool) layers to the continuous convolution layers partially addresses image distortion problems due to image preprocessing. MaxPool is the process of selecting the maximum value in each region of an image or signal feature map, thereby reducing the dimensionality of the data and retaining the most important features while preventing overfitting. Instead of calculating detail at the pixel level, this process uses a fixed-size window to scan the input and take the maximum value, and is commonly used in feature extraction and image processing. Hence, SESPPCSPC performs multi-scale feature fusion on the small aircraft target images and fine-tunes these features at each scale, effectively capturing information of different scales. This strategy significantly enhances the model’s ability to detect small objects.

3.3. CBAMCAT

The CBAM network [19] is a lightweight attention mechanism network comprising the channel attention module (CAM) and the spatial attention module (SAM). These modules extract attention maps from the input feature map in the channel and spatial dimensions, respectively, which are then multiplied with the input feature map for adaptive feature refinement. The specific structure is presented in Figure 5. The feature map

X \in R^{H \times W \times C}

is used as input, and CBAM sequentially infers a one-dimensional channel attention map

M_{c} \in R^{1 \times 1 \times C}

and a two-dimensional spatial attention map

M_{s} \in R^{H \times W \times 1}

, as shown in Figure 5. The whole attention process can be summarized as:

X^{'} = M_{c} (X) \otimes X,

X ″ = M_{s} (X^{'}) \otimes X^{'}

(6)

where ⊗ denotes element-wise multiplication, X″ stands for final output.

The innovation of CAM is the use of mean-pool and max-pool operations to aggregate spatial information of feature maps. The CAM sub-module compresses the input feature map along the spatial dimension, then performs max and average pooling operations. This is then fed into a shared network to form the channel attention map, which is then multiplied with the feature map to highlight the essential target features. The shared network consists of a multilayer perceptron (MLP). This process is mathematically formulated as follows:

The calculation formula is shown in Equation (7):

M_{c} (F) = σ (M L P (AvgPool (F)) + M L P (MaxPool (F)))

(7)

where F represents the input feature map, AvgPool denotes global average pooling, MaxPool denotes max pooling, MLP stands for multi-layer perceptron, and σ is the Sigmoid activation function.

SAM aggregates the channel information of the feature map through two pooling operations to generate two 2D maps. The SAM sub-module uses the channel attention map as the input feature map, compressing it along the channel dimension, followed by max pooling and average pooling operations. The results are concatenated along the channel axis and formed by a 7 × 7 convolution to form the spatial attention map, which is then multiplied with the feature map to emphasize important positional information about the target. The calculation formula is shown in Equation (8):

M_{s} (F) = σ (f^{7 \times 7} [AvgPool (F); MaxPool (F)])

(8)

where f ^7×7 denotes a 7 × 7 convolution.

This study integrates the CBAM attention mechanism into the fusion layer of the neck network in the proposed SE-CBAM-YOLOv7, replacing the ordinary concatenation (CAT) layer with a CBAMCAT layer. This integration suppresses complex background noise and optimizes and fuses critical features of small aircraft targets in both channel and spatial dimensions.

4. Experiments

4.1. Experimental Data

The data used in this study were sourced from publicly available satellite remote sensing datasets, comprising 1012 images with a resolution of 1283 × 521 pixels. During the experiment, the data were split into a training set and a testing set in a ratio of 7:3.

4.2. Space-Based Intelligent Processing Platform

Combined with the industry development trend and actual architecture evolution, the satellite-borne intelligent computing platform in this paper adopts a super heterogeneous body system of an ECU unit (with an FPGA module inside) and an SCC unit (with a GPU module inside). The platform is connected to a satellite data simulator and executes algorithms based on mission commands and data transmitted by the simulator and transmits the results of the processing back. The platform adopts the ZYNQ chip, as shown in Figure 6a, and the prototype of the satellite-borne intelligent computing platform is shown in Figure 6b. Table 1 lists the configuration parameters of satellite-borne intelligent computing platform.

The ECU unit is mainly composed of an FPGA SoC with an internal FPGA unit and an ARM core. The FPGA unit is mainly responsible for performing interface adaptation and receiving external instructions and data, while the ARM side is mainly responsible for carrying out the functional management of the load and is able to carry out flexible scheduling of functions based on the task instructions.

The SCC unit mainly consists of a GPU SoC, which contains an ARM core and a GPU processing unit inside. the ARM side mainly carries out the task management and can schedule different algorithmic models according to different task requirements to meet different applications. As the main processing unit, the GPU unit provides a large number of general computing algorithms and AI computing algorithms to achieve efficient general image processing and AI processing. With the highly parallel processing architecture of the GPU unit and mature GPU acceleration tool libraries, fast algorithm processing can be achieved with low energy consumption, and the SE-CBAM-YOLOv7 algorithm has been developed and deployed in this unit.

4.3. Ground Link Experiment Syetem

The flowchart of the ground link experiment in this study is shown in Figure 7, and the experimental platform is shown in Figure 8. The experimental platform includes four main components as follows: a satellite ground simulator, a satellite data simulator, a satellite-borne intelligent computing platform prototype, and an upper computer. The satellite data simulator simulates the camera payload of a real satellite platform, generating data that mimics the output of onboard cameras and transmitting these data to the computing payload. The satellite-borne intelligent computing platform prototype comprises an ECU (electronic control unit) and an SCC (satellite control center). The intelligent computing platform connects to the satellite data simulator, executes algorithms based on the simulator’s task instructions and data transmitted, and then returns the processed results. The upper computer connects to the satellite data simulator, simulating a ground station command control and scheduling functions. Table 2 reports the hardware setup used in the satellite data simulator. The input image configuration was set to 640 × 640 pixels, and the model’s training process involved 200 epochs, with a batch size of 16 and an initial learning rate of 0.01.

4.4. Evaluation Metrics

Several evaluation metrics were used to analyze and assess the model’s prediction results, thus determining the performance of various models from multiple perspectives. The performance of SE-CBAM-YOLOv7 was evaluated based on Precision [20], Recall [21], F1-Score [22], and mean average precision (mAP) [23,24]. All models were evaluated on the same training and testing datasets.

Precision is the ratio of correctly predicted positive samples to the total predicted positive samples, formulated in Equation (9). Here, TP (true positive) denotes the case where the actual and predicted values are positive. FP (false positive) denotes the case where the actual value is negative, but the expected value is positive.

Recall is the ratio of the correctly predicted positive samples to the actual positive samples and is calculated as in Equation (10). Here, FN (false negative) denotes the case where the exact value is positive, but the predicted value is negative.

The F1 score is a comprehensive evaluation metric that considers precision and recall and is mathematically formulated as in Equation (11)

The P-R curve (precision–recall curve) visualizes the relationship between precision and recall, with precision on the y-axis and recall on the x-axis. The area under the P-R curve is called the AP value (see Equation (12)). A higher AP value indicates better model performance.

The mAP (mean average precision) represents the mean of AP values across different categories and is calculated using Equation (13).

Precision = \frac{T P}{T P + F P}

(9)

Recall = \frac{T P}{T P + F N}

(10)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

AP = \int_{0}^{1} P (R) d R

(12)

mAP = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(13)

4.5. Experimental Results and Analysis

During network operation, achieving high precision and fast convergence is crucial for ensuring the robustness and stability of the model. In this experiment, we trained and tested YOLOv7, SE-YOLOv7, and SE-CBAM-YOLOv7 models for 200 epochs to compare their convergence curves. Figure 9 illustrates the convergence curves of bounding box losses and target detection losses for these three models. The vertical axis represents the loss value during network training, while the horizontal axis represents the iteration rounds of the network. The experimental results demonstrate that all three models initially exhibited higher loss values in the early training stages.

However, within the first 25 epochs, there was a rapid decrease in loss values across all models indicating effective learning from the training data. As iterations progressed during training, there was a continuous reduction in network loss values signifying improved fitting to the training data. Notably, among these models, SE-CBAM-YOLOv7 maintained low loss values while ensuring faster convergence speed with superior performance in terms of robustness and stability.

The comparison of test results in Table 3 reveals that the SE-CBAM-YOLOv7 model exhibits superior performance in identifying small aircraft targets. Specifically, the SE-CBAM-YOLOv7 model achieved a recognition accuracy that was 0.5% higher than YOLOv7, the recall rate was 0.5% higher than that of YOLOv7 and the mAP value was 1.7% higher than that of YOLOv7. By independently learning weight coefficients through SENet and CBAM attention mechanism networks, our approach employs a dynamic weighting method to enhance crucial features while suppressing irrelevant feature information simultaneously. This targeted learning strategy enhances the deep learning network model’s ability to recognize small targets and improves its recognition sensitivity under complex backgrounds for remote sensing aircraft detection missions, thereby demonstrating the algorithm’s high efficiency, reliability, and providing robust technical support for future endeavors.

During model training, we saved the weights of the best-performing model. When validating the algorithm on the space-based intelligent processing platform, we utilized the optimal model weights obtained during training to detect 23 small target images from remote sensing aircraft. The entire detection process took 3.3 s, with an average detection time per image of 0.14 s, showcasing exceptional real-time performance.

5. Discussion

The output results of the YOLOv7 model on the test set are presented in Figure 10, while Figure 11 illustrates the output results of the SE-CBAM-YOLOv7 model on the same test set. As depicted in Figure 10b, when confronted with a complex background, the YOLOv7 model encounters challenges in recognizing small airborne aircraft targets. This difficulty primarily stems from inadequate feature extraction of small aircraft targets during the early stages of the model, resulting in missed detections. For instance, as shown in Figure 10c, buildings are erroneously identified as small aircraft targets by mistake. Furthermore, in Figure 10d, a raised portion of a mountain is also misclassified as a small aircraft target. These false detections indicate that the YOLOv7 model has limitations regarding its capability to handle complex backgrounds and detect small targets.

In contrast, the SE-CBAM-YOLOv7 model exhibits significant advantages in target detection by accurately identifying smaller aircraft objects with higher sensitivity. This superiority can be primarily attributed to the incorporation of SENet and CBAM attention mechanism networks, which effectively enhance the model’s capability for recognizing low-density targets. Specifically, SENet and CBAM augment feature extraction, enabling the model to focus more precisely on key features within small targets and complex backgrounds, thereby reducing instances of missed detections and false positives. Overall, these findings demonstrate that introducing advanced attention mechanisms can substantially improve the performance of object detection models, rendering them more reliable and accurate for practical applications.

6. Conclusions

This paper proposes the SE-CBAM-YOLOv7 optimization algorithm to facilitate the real-time detection of small airborne targets in complex background remote sensing videos. Specifically, we introduce SENet, a lightweight attention mechanism, and design the SESPPCSPC module to improve the model’s efficiency in feature extraction. Additionally, a hybrid attention mechanism CBAM is introduced, and the CBAMCAT module was designed to effectively suppress complex background noise and enhance the model’s ability to integrate important feature information of small airborne targets. The SE-CBAM-YOLOv7 model was tested on remote sensing datasets, and the detection accuracy reached 91.2%. This lays the algorithmic foundation for subsequent deployment applications in satellite missions.

Author Contributions

Formal analysis, Y.L.; investigation, Z.K. and Z.L.; software, S.D. and H.L.; validation, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Du, C.; Cui, J.; Wang, D.; Li, G.; Lu, H.; Tian, Z.; Zhao, C.; Li, M.; Zhang, L. Prediction of aquatic vegetation growth under ecological recharge based on machine learning and remote sensing. J. Clean. Prod. 2024, 452, 142054. [Google Scholar] [CrossRef]
Yang, F.; Men, X.; Liu, Y.; Mao, H.; Wang, Y.; Wang, L.; Zhou, X.; Niu, C.; Xie, X. Estimation of Landslide and Mudslide Susceptibility with Multi-Modal Remote Sensing Data and Semantics: The Case of Yunnan Mountain Area. Land 2023, 12, 1949. [Google Scholar] [CrossRef]
Braun, A.; Warth, G.; Bachofer, F.; Schultz, M.; Hochschild, V. Mapping Urban Structure Types Based on Remote Sensing Data—A Universal and Adaptable Framework for Spatial Analyses of Cities. Land 2023, 12, 1885. [Google Scholar] [CrossRef]
Reyes, J.A.; Cowardin, H.M.; Velez-Reyes, M. Analysis of Spacecraft Materials Discrimination Using Color Indices for Remote Sensing for Space Situational Awareness. J. Astronaut. Sci. 2023, 70, 33. [Google Scholar] [CrossRef]
Bai, L.; Ding, X.; Chang, L. Remote Sensing Target Detection Algorithm based on CBAM-YOLOv5. Front. Comput. Intell. Syst. 2023, 5, 12–15. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Computer Science. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Johnson, J.W. Adapting Mask-RCNN for Automatic Nucleus Segmentation. arXiv 2018, arXiv:1805.00500. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Zhang, Z.Y.; Liu, Y.P.; Liu, T.C.; Lin, Z.; Wang, S. DAGN: A real-time UAV remote sensing image vehicle detection framework. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1884–1888. [Google Scholar] [CrossRef]
Liu, Z.; Gao, Y.; Du, Q. YOLO-Class: Detection and Classification of Aircraft Targets in Satellite Remote Sensing Images Based on YOLO-Extract. IEEE Access 2024, 11, 109179–109188. [Google Scholar] [CrossRef]
Sun, X.M.; Zhang, Y.J.; Wang, H.; Du, Y.X. Research on ship detection of optical remote sensing image based on Yolo V5. J. Phys. Conf. Ser. 2022, 2215, 012027. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, Y.; Fan, J.Y.; Hu, Y.; Guo, J.; Zhu, Y. TBi-YOLOv5: A surface defect detection model for crane wire with Bottleneck Transformer and small target detection layer. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2024, 238, 2425–2438. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, C.; Bi, F.; Zhang, W.; Chen, L. An Intensity-Space Domain CFAR Method for Ship Detection in HR SAR Images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 529–533. [Google Scholar] [CrossRef]
Ai, J.; Luo, Q.; Yang, X.; Yin, Z.; Xu, H. Outliers-Robust CFAR Detector of Gaussian Clutter Based on the Truncated-Maximum-Likelihood- Estimator in SAR Imagery. IEEE Trans. Intell. Transp. Syst. 2020, 21, 2039–2049. [Google Scholar] [CrossRef]
Karvonen, J.; Gegiuc, A.; Niskanen, T.; Montonen, A.; Buus-Hinkler, J.; Rinne, E. Iceberg Detection in Dual-Polarized C-Band SAR Imagery by Segmentation and Nonparametric CFAR (SnP-CFAR). IEEE Trans. Geosci. Remote Sens. 2021, 60, 4300812. [Google Scholar] [CrossRef]
Hou, X.; Ao, W.; Song, Q.; Lai, J.; Wang, H.; Xu, F. FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition. Sci. China Inf. Sci. 2020, 63, 140303. [Google Scholar] [CrossRef]
Ao, W.; Xu, F.; Li, Y.; Wang, H. Detection and Discrimination of Ship Targets in Complex Background from Spaceborne ALOS-2 SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 536–550. [Google Scholar] [CrossRef]

Figure 1. SE-CBAM-YOLOv7 model structure.

Figure 2. SENet model.

Figure 3. SEConv structure.

Figure 4. SESPPCSPC structure.

Figure 5. (a) CBAM model, (b) CAM, (c) SAM.

Figure 6. On-board intelligent computing simulation platform. (a) ZYNQ Chip. (b) Satellite-borne intelligent computing platform prototype.

Figure 7. Ground link experiment flow chart.

Figure 8. Experimental platform; (a) satellite ground simulator; (b) satellite-borne intelligent computing platform prototype.

Figure 9. Convergence curves; (a) box loss; (b) object detection loss.

Figure 10. Test results of YOLOv7 in different scenarios.

Figure 11. Test results of SE-CBAM-YOLOv7 in different scenarios.

Table 1. Configuration parameters of the satellite-borne intelligent computing platform.

Satellite-Borne Intelligent Computing Platform
Basic parameter	volumetric	208 ∗ 125 ∗ 55 mm ± 5 mm
	weights	1.5 kg ± 0.2
	electricity supply	28 ± 3 V
ECU Modules Central Control Unit	microchip	ZYNQ 7100
	main frequency	766 MHz (dual core)
	random access memory (RAM)	512 MB × 2, DDR3, 1066 MHz
	storage	32 GB eMMC × 2
SCC Module Central Computing Unit	microchip	Jetson AGXi Xavier
	main frequency	CPU: 2.0 GHz (8 core) GPU: 1.2 GHz
	random access memory (RAM)	32 GB, LPDDR4x, 136.5 GB/s
	storage	1 TB SSD
	arithmetic power	30 TOPS

Table 2. Configuration parameters of the satellite data simulator.

Satellite Data Simulator
Basic parameter	volumetric	208 ∗ 125 ∗ 55 mm ± 5 mm
	weights	1.5 kg ± 0.2
	electricity supply	28 ± 3 V
OBC On-board computing unit	microchip	ZYNQ 7100
Storage module	storage	1TB SSD

Table 3. Experimental results.

Model	Precision (%)	Recall (%)	[email protected] (%)	F1 (%)
YOLOv7	90.7	85.2	84.9	87.86
SE- YOLOv7	84.5	85.7	83.4	85.10
SE-CBAM-YOLOv7	91.2	85.7	86.6	88.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, Z.; Liao, Y.; Du, S.; Li, H.; Li, Z. SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging. Aerospace 2024, 11, 605. https://doi.org/10.3390/aerospace11080605

AMA Style

Kang Z, Liao Y, Du S, Li H, Li Z. SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging. Aerospace. 2024; 11(8):605. https://doi.org/10.3390/aerospace11080605

Chicago/Turabian Style

Kang, Zhenping, Yurong Liao, Shuhan Du, Haonan Li, and Zhaoming Li. 2024. "SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging" Aerospace 11, no. 8: 605. https://doi.org/10.3390/aerospace11080605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging

Abstract

1. Introduction

2. Related Work

3. Method

3.1. SEConv

3.2. SESPPCSPC

3.3. CBAMCAT

4. Experiments

4.1. Experimental Data

4.2. Space-Based Intelligent Processing Platform

4.3. Ground Link Experiment Syetem

4.4. Evaluation Metrics

4.5. Experimental Results and Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI