1. Introduction
Military target detection technology is the key to improving battlefield situation generation, reconnaissance, surveillance, and command decision-making and is an essential factor for winning modern warfare. Real-time and accurate detection of battlefield targets will help us grasp the battlefield environment faster, search and track enemy units, and understand the enemy’s dynamics to seize the opportunity in the war and be in a dominant position [
1,
2,
3].
A large amount of data, rapid changes, and strong camouflage are features of battlefield targets in modern combat, influenced by artificial intelligence’s development [
4,
5]. Most traditional visual target detection technologies are based on hand-designed features for target detection, and it is challenging to obtain target information comprehensively, quickly, and accurately from the complex battlefield environment.
Computer vision technology has become widely used in various industries, including video surveillance, drone piloting, and military intelligence analysis, due to the rapid growth of deep learning [
2]. Currently, target detection algorithms based on deep learning can be divided into candidate frame-based and regression-based algorithms. The former is represented by the Region-based Convolutional Neural Network (R-CNN) [
6], Fast R-CNN [
7], and Faster R-CNN [
8]. The latter mainly include You Only Look Once (YOLO) series algorithms [
9,
10,
11,
12] and SSD [
13,
14,
15] algorithms. In order to achieve a higher detection accuracy, the target detection algorithm based on the candidate frame first counts the target frame on the feature map and then obtains the detection result in a refined manner. However, there are drawbacks, such as high memory resource consumption and slow speed. The regression-based target detection algorithm is an end-to-end detection method. The target is obtained by direct regression on the feature map, so the detection speed is significantly improved, but the detection accuracy is slightly lower than that of the candidate frame-based detection algorithm.
Several scholars have successfully applied deep learning-based methods in military target detection recently. For instance, [
16] proposed a neural network-based military vehicle detection method, attaining a recognition rate of 97.36%. In [
17], the authors proposed an improved Fast R-CNN algorithm for small tank target detection. This algorithm is superior to the Faster R-CNN algorithm in detection speed and accuracy but suffers from miss-detections when detecting occluded targets. The work of [
18] suggested a tank military robot with target detection and tracking functions, effectively improving the battlefield’s combat capability. Reference [
19] proposed a remote sensing image selection and searching method to solve the potential hot spot detection problem in large-scale remote sensing images and improve the detection accuracy of overlapping targets. This method improves the target detection accuracy without considering the model’s space complexity. In [
20], the authors fully integrated polarization imaging and deep learning to detect camouflaged artificial targets quickly under normal and low illumination conditions. Reference [
21] developed a new military target detection algorithm, which introduced the GhostNet module to improve the detection accuracy and speed and then improved the loss function to enhance detection accuracy. The experimental results show that the model’s parameters are about three times higher than the YOLOv5 model. Furthermore, reference [
22] solved the DIOU defect when the center of the bounding box was aligned at the same point, which is conducive to the efficient deployment of detection algorithms in resource-constrained environments. Reference [
23] proposed an armored target detection algorithm named GCD-YOLOv5 that utilized a LIDAR array in complex environments. This algorithm has a strong detection ability, but its network structure is complex and thus challenging to implement the transplantation of embedded terminals.
Based on the above research, with the continuous improvement in the performance of the network model, the increase in the number of model parameters and computation restricts its embedding in resource-constrained weapons and equipment. In order to meet the requirements of military target detection under limited resources of weapon hardware platforms, this paper proposes an improved YOLOv5 algorithm (SMCA-α-YOLOv5), which is tested and compared through ablation experiments. The results show that compared with YOLOv5s, the mean average precision is increased by 1.9%, the amount of model parameters is decreased by 85.7%, and the amount of computation is decreased by 95.9%. The main contributions of this paper can be summarized as follows:
The Stem block is used to replace the Focus module, and the multi-channel information fusion improves the feature expression ability, reducing the model’s parameters and computation complexity.
The coordinate attention module is embedded in the MobileNetV3 block structure to redesign the backbone network of YOLOv5. This strategy reduces the network’s parameters and computation complexity and improves its detection performance.
Considering the defects of CIOU loss, we propose a power parameter loss opti-mized by combining the EIOU loss and Focal loss. The experiments show that the convergence speed is faster and the regression error is lower.
The remainder of this paper is organized as follows:
Section 2 introduces the con-struction of the military target dataset.
Section 3 introduces the work related to the YOLOv5s structure, MobileNetV3 block module, coordinate attention mechanism, and Loss Metrics in Object Detection.
Section 4 introduces the improved YOLOv5 algorithm.
Section 5 analyzes and discusses the experimental results. Finally,
Section 6 presents the conclusion and future work.
4. Approach
This section details the improvement methods of YOLOv5, including the introduction of the Stem block, the design of the MNtV3-CA module, the optimization of the loss function, and the overall structure design of the network.
4.1. Introduction of Stem Block
Military target detection not only puts forward higher requirements on the detection accuracy and detection speed of the target but also is affected by the limitations of the memory and computing resources of the weapon equipment platform. The Focus module of the YOLOv5 algorithm improves the detection speed of the model to a certain extent but greatly increases the amount of calculation and parameters.
Therefore, designing a military target detection algorithm with small memory and less computation is very important. With the above requirements, this paper introduces the Stem block structure, as shown in
Figure 5. This structure has achieved good results in real-time detection algorithms on mobile devices, such as PELEE [
45], PP-LCNet [
46], YOLO5Face [
47], etc. Inceptionv4 and Deeply Supervised Object Detector inspire the design of the Stem block. By replacing the large convolution module with a smaller computation cost and parameters, the module improves feature expression ability with almost no increase in computation and parameters.
4.2. MNtV3-CA Block Structure
The backbone network of the YOLOv5 algorithm adopts the traditional residual structure, which solves the problem of network degradation caused by the increase in the network structure’s depth, and has a faster convergence speed under the same number of network layers [
48]. Residual networks have been widely used in deep neural networks, improving the network performance by increasing the network depth. However, this substantially increases the network parameters, making it difficult to train the model. It is not easy for the network to calculate Deploy on weapons with limited capabilities and memory resources. Therefore, this paper designs a lightweight MNtV3-CA structure to redesign the backbone network of the YOLOv5 algorithm, as shown in
Figure 6. This structure is based on the MobliNetV3 block and integrates the lightweight coordinate attention module, enhancing the model’s detection performance while ensuring a light network structure.
4.3. Optimization of Loss Function
The IOU function is the most commonly used evaluation index in the field of target detection, used to measure the overlap rate between the target box and the predicted box,
, where
represents the area of the target box, and
represents is the predicted box area. The formula is as follows:
The YOLOv5 algorithm uses the CIOU loss [
38], which considers three important geometric factors: the overlap between the predicted box and the target box, the distance between the center points, and the aspect ratio. The disadvantage is that
in the formula only reflects the difference in aspect ratio, which increases the similarity of aspect ratio to a certain extent, but sometimes hinders the real difference between aspect ratio and confidence and does not consider the balance of difficult and easy samples [
39].
To solve the shortcomings of the CIOU loss, this paper introduces the EIOU loss [
39], which improves the CIOU loss by discarding the penalty term of the aspect ratio and employing the prediction results of width and height to guide the loss convergence. EIOU loss is formulated as:
where
and
are the width and height of the minimum circumscribed rectangle of the prediction box and the target box, respectively,
is the Euclidean distance between the prediction box and the target box,
is the center point of the prediction box,
is the center point of the target box,
and
is the width and height of the prediction box, respectively, and
and
are the width and height of the target box, respectively.
Equation (8) reveals that the EIOU loss is divided into three parts: the IOU loss
, the distance loss
, and the aspect loss
. The EIOU loss not only retains the characteristics of the CIOU loss but also reduces the difference between the width and height of the target box and the anchor box, affording a rapid model convergence and accuracy improvement. Inspired by Alpha-IoU [
40], this paper generalizes EIOU loss to a loss function with power terms, defined as α-EIOU loss, formulated as:
where α is the power parameters.
The Focus-EIOU loss cannot flexibly achieve the accuracy of different levels of the bounding box regression, and the Alpha-IoU does not consider the problem of difficult and easy sample balance. Therefore, this paper combines Focus Loss with the α-EIOU by using the IOUα to weight α-EIOU. This scheme is defined as Focal-α-EIOU Loss and is formulated as:
When α = 1, Equation (10) is , and γ is a parameter that controls the degree of outlier suppression.
In summary, the proposed Focal-EIOU loss with the power alpha function has the following advantages: (1) adjusting α provides the detector more flexibility to achieve different levels of box regression accuracy, (2) considers the difficulty Easy sample balance problem, and (3) the regression loss is lower, and the convergence speed is faster.
4.4. Network Structure of SMCA-α-YOLOv5
Regarding the SMA-α-YOLOv5 network structure,
Section 4.1 introduced the Stem block, and
Section 4.2 presented the MNtV3-CA block, which is used to build the backbone network of YOLOv5. Additionally, the second to twelfth layer network structures in the MobileNetV3-Small [
34] specification are used for reference. Finally, the loss function is optimized, and the improved structure is illustrated in
Figure 7.
6. Conclusions
Aiming at the difficulty of deploying military target detection algorithms on embedded platforms with limited resources, a lightweight military target detection method based on improved YOLOv5 is proposed. This method redesigns the backbone network of YOLOv5 by introducing Stem block and MobileNetV3 block to reduce the number of parameters and computation of the model. In order to further improve the feature expression ability of the network, a coordinate attention module is embedded in the MobileNetV3 block structure, which improves the detection performance of the model for military targets. Based on EIOU Loss and Focal Loss, a loss with power parameter α is designed to optimize CIOU Loss, which provides more flexibility for the detector and achieves different levels of bounding box regression accuracy. The experimental results show that the algorithm proposed in this paper can ensure real-time performance and detection accuracy, and can also meet the needs of military target detection under the condition of limited resources of weapon equipment platforms.
The experimental results show that the average inference time of the algorithm proposed in this paper has increased, and the next step is to use the pruning algorithm to compress the backbone network composed of Stem block and MNtV3-CA block to improve the average detection speed. At the same time, the algorithm is deployed on embedded devices with limited hardware resources to verify the applicability of the algorithm in this paper.