1. Introduction
Optical imaging detection technology is widely used in various American weapon systems, such as missile defense systems, early warning detection systems, ground-based midcourse defense systems with kinetic energy interceptors, the THAAD weapon system, air defense missiles, and air-to-air missiles. Object detection algorithms process imaging data to determine whether a target is present within the detection field and its position, making it one of the key technologies in optical imaging detection systems [
1,
2,
3].
Optical imaging detection technology has become an integral component in the development of various advanced weapon systems, especially in military applications. In the United States, optical imaging plays a critical role in missile defense systems, early warning detection systems, and ground-based midcourse defense systems with kinetic energy interceptors. Additionally, it is a key technology in the THAAD (Terminal High-Altitude Area Defense) system, air defense missiles, and air-to-air missiles. These weapon systems rely heavily on object detection algorithms, which process imaging data to determine the presence and position of targets within a given detection field. As such, optical imaging detection technology forms the backbone of many high-performance guidance systems, ensuring precise and reliable identification of threats in dynamic environments.
The increasing complexity of modern battlefield environments, however, has created new challenges for optical terminal guidance systems, particularly those used in precision-guided weapons. The complex environment refers to the confluence of various electromagnetic activities, natural phenomena, and multi-target scenarios that can disrupt or degrade the ability of optical imaging systems to accurately detect and track targets. These factors are as follows: (1) Electromagnetic Environmental Elements: Electromagnetic interference can degrade the signal quality and increase the difficulty of target identification. (2) Natural Environmental Elements: Adverse weather conditions such as fog, rain, or dust storms can obscure the imaging sensor’s view, negatively impacting detection accuracy. (3) Multi-target Environmental Elements: The presence of multiple targets within a scene can create confusion, leading to misidentification or misclassification.
These environmental factors place significant demands on optical terminal guidance systems and impact their ability to track and engage targets effectively. This, in turn, influences the overall combat effectiveness of optical imaging-based precision-guided systems [
4].
Traditional object detection and recognition methods were designed to address some of these challenges. Typically, these methods follow a multi-stage process: generating candidate regions, extracting features, classifying regions, and performing post-processing to refine detection results. However, these traditional approaches are limited by several factors, particularly their reliance on manually designed low-level visual features. These features are often handcrafted based on empirical knowledge and are typically limited to specific categories, rendering them less versatile in handling complex object transformations, occlusions, or variations across different scenarios. The lack of deep semantic understanding and the inability to generalize across diverse environments significantly hinder the performance of traditional models.
The introduction of convolutional neural networks (CNNs) has revolutionized the field of object detection by enabling models to learn hierarchical feature representations directly from raw input data. CNN-based approaches have significantly outperformed traditional methods by eliminating the need for handcrafted features and providing more accurate and robust detection capabilities. Among these CNN-based models, the YOLO (You Only Look Once) series has emerged as one of the most popular and effective solutions for real-time object detection [
5]. YOLO combines high detection accuracy with impressive speed, making it ideal for real-time applications such as autonomous vehicles, surveillance systems, and military guidance systems. Since its inception in 2016, YOLO has undergone several iterations, from YOLOv1 to YOLOv11 [
6,
7,
8,
9], with each version introducing improvements in speed and accuracy. YOLOv7, introduced in 2022, surpassed all other real-time object detectors by achieving an unprecedented balance between speed and accuracy. At 30 FPS (Frames Per Second) on a GPU V100, YOLOv7 achieved a peak accuracy of 56.8% AP (Average Precision), outperforming other real-time detectors in terms of both detection accuracy and processing speed.
Despite its superior performance on high-performance GPUs, YOLOv7 faces challenges when deployed on resource-constrained platforms such as smartphones, drones, and embedded systems. These devices often lack the computational power and memory capacity needed to run large, complex models efficiently. To address this issue, various model compression and acceleration techniques have been developed to make deep learning models more suitable for deployment on these platforms [
10,
11,
12,
13]. Common methods for accelerating neural networks include lightweight networks, weight quantization [
14], neural network pruning [
15], and knowledge distillation [
16].
Weight quantization is a widely used technique in model compression, where the precision of the model’s weights is reduced to lower-bit representations [
17,
18]. This approach significantly reduces storage requirements and accelerates computation, but it often comes at the cost of reduced accuracy. This is particularly problematic for object detection models, where small errors in weight quantization can result in large deviations in the predicted bounding box coordinates, affecting the overall detection quality. Binary and ternary quantization have been explored as means to further reduce the model size, but these approaches often introduce substantial accuracy losses, especially in the detection of small or occluded objects [
19].
Neural network pruning, another common technique, involves removing less important weights or entire neurons from the model to reduce its size. While pruning can effectively reduce model’s complexity, it also poses challenges when it comes to maintaining the model’s performance, particularly when large portions of the network are removed. Song Han’s pruning method [
20], which combines pruning with weight sharing and Huffman encoding, has been proposed to address this challenge by further compressing the model while attempting to retain its predictive power [
21,
22,
23]. However, as the network grows more complex, the implementation of such techniques becomes increasingly difficult. There are also studies on early exit for acceleration [
24,
25].
In recent years, lightweight networks have emerged as a promising solution for model acceleration, especially on mobile and embedded devices. Lightweight networks such as MobileNet [
14], SqueezeNet [
15], and ShuffleNet [
16,
26] have been specifically designed to reduce the number of parameters and computational cost without sacrificing model accuracy. These networks utilize novel architectural techniques such as depthwise separable convolutions, which significantly reduce the number of operations required for convolutional layers while maintaining a high performance. Depthwise separable convolutions decompose the standard convolution operation into two steps: a depthwise convolution followed by a pointwise convolution, which reduces the number of computations and memory usage. This makes lightweight networks particularly effective for real-time object detection on devices with limited resources.
In this paper, we explore the application of lightweight network principles to the YOLOv7 object detection model to improve its performance on resource-constrained hardware. By incorporating depthwise separable convolutions into the YOLOv7 architecture, we aim to accelerate detection speeds without sacrificing detection accuracy. We validate the effectiveness of this acceleration method using a self-built dataset, demonstrating that the modified YOLOv7 model can achieve real-time performance while maintaining competitive accuracy.
At the end of this Introduction, we briefly outline the structure of the remainder of the manuscript.
Section 2 presents related works, in which we describe the YOLOv7 and ShuffleNet architecture.
Section 3 introduces the enhancements made through depthwise separable convolutions.
Section 4 discusses the experimental setup, including the dataset, evaluation metrics, and results obtained. In
Section 5, we provide a discussion of the results, interpreting the findings and comparing them to existing models. Finally,
Section 6 concludes the paper with a summary of the contributions, limitations, and directions for future work.
2. Related Works
2.1. YOLOv7
YOLOv7 is a one-stage deep neural network model for object detection. Compared to the two-stage Faster R-CNN series detection models [
7], YOLOv7 achieves a better balance between detection speed and accuracy. The backbone network structure of YOLOv7 is shown in
Figure 1.
The backbone network of the YOLO series models extracts features from images at different levels. In the YOLOv7 backbone network, C1, C2, and C3 are obtained as downsampled feature maps at 8×, 16×, and 32×, respectively. Among these, C1 represents the shallow features of the image, which contain more spatial information about the target, while C3 represents the deep features, which contain more semantic information. After feature extraction by the backbone network, YOLOv7 uses the PaFPN (Path Aggregation Feature Pyramid Network) structure and coupled heads to output the predicted values of the target.
The predicted output of YOLOv7 includes the values for four prediction boxes, category prediction values, and target confidence values (x, y, w, h, c).
The loss function consists of three components: the offset of the predicted box, IoU (Intersection over Union) and confidence loss, and category loss.
Equation (1) is the loss value of the prediction box:
In Equation (1), represents the predicted value and represents the true label value. is the Indicator Function, K represents the grid size, and M represents the number of anchors per grid.
Equation (2) is the category loss:
In Equation (2), represents the predicted category probability value and represents the true label value.
Equation (3) is the confidence loss:
In Equation (3), represents the predicted confidence value and represents the true label value.
The loss function value of YOLOv7 is obtained by summing three parts, as shown in Equation (4):
The loss function represents the distance between the predicted values of the network and the true target values. Using the loss function as a feedback signal, the loss is backpropagated through an optimizer to update the weights of each layer and train the network. The training process of the neural network model is shown in
Figure 2.
2.2. Lightweight Network ShuffleNet V2
ShuffleNet v2 is a lightweight network [
26]. To reduce the number of parameters and improve computational efficiency, the design of the network structure adheres to the following principles: (1) When the number of input and output channels of the convolutional layer is equal, memory access consumption is minimized, and the model operates at the fastest speed. (2) Choosing an appropriate number of test paper integrals can increase the computational resource consumption for group convolution. (3) An increase in the number of network branches reduces computational efficiency, so the number of branches in the model needs to be minimized. (4) Although element-wise operations require fewer parameters, they consume more memory during runtime, so these operations should be reduced.
Based on these principles, the basic structural units of ShuffleNet v2 are as shown in
Figure 3. When ShuffleNet v2 requires downsampling and doubling the number of channels, the channel separation operation is removed, and downsampling is achieved using a depthwise separable convolution (DWConv) with a stride of 2. The basic structural unit is shown in Unit 2 of
Figure 3.
By adopting the concept of lightweight networks, the ShuffleNet v2 convolutional neural network is used as the feature extraction network to create a lightweight object detection model for YOLOv7.
3. Method of Acceleration
In deep neural network models, several factors affect the accuracy, including the network’s downsampling rate, depth, and receptive field (RF). A lightweight model, such as one that uses ShuffleNet v2 as the backbone, reduces the depth of the network, which can decrease detection accuracy. While increasing the depth of the network and reducing the downsampling rate can improve accuracy, these modifications also increase computational complexity, which hinders the acceleration of the object detection model.
To address this, our method focuses on improving the receptive field of the network without significantly increasing complexity. The receptive field is a crucial factor in object detection because it defines the area of the image that influences each prediction. By enlarging the receptive field, the network can better capture the content of the entire image and understand the context surrounding the object. This improvement allows the model to make more informed predictions, especially in challenging environments with cluttered or occluded objects. In essence, the larger receptive field increases the number of connections between the feature extraction points and the input pixels, improving the model’s ability to detect objects even in complex scenarios. The receptive field of a convolutional neural network is primarily influenced by the convolutional layers and the downsampling layers, as described in Equations (5) and (6) for calculating the receptive field of convolutional networks.
In the formula, the receptive field of the feature map at the -th layer is denoted as , is the kernel size of the convolutional kernel, and is the product of the convolution steps of all network layers preceding the feature map at the -th layer.
The backbone network of YOLOv7 has a receptive field of on the last layer feature map, while the ShuffleNet v2 network has a receptive field of on the last layer feature map.
To increase the receptive field of the lightweight network model, a depthwise separable convolution (DWConv) is added before each stage in the main network of shuffleNet v2. This addition does not change the size of the output feature map but increases the receptive field. The receptive field of the modified backbone network on the last layer feature map is .
In real-world applications, object detection often happens in noisy, cluttered, or occluded environments, which can reduce accuracy. To improve YOLOv7’s performance in these conditions, we added depthwise separable convolutions to enhance the model’s receptive field. The receptive field refers to the area of the image that influences a particular prediction. By increasing the receptive field, the model can capture more surrounding context, helping it detect objects even when they are partially blocked or when there is background noise.
This improvement allows the model to better understand the scene, especially when objects are hard to distinguish because they are occluded or surrounded by clutter. With a larger receptive field, the model can also filter out noise, improving its accuracy even in challenging conditions.
5. Discussion
In this paper, we addressed the challenge that slow detection speeds pose for hardware platforms with limited computing resources. The hardware configuration of the testing environment is a Jetson TX2, and the deep learning framework used is PyTorch 1.8.0. By incorporating depthwise separable convolutional layers into the YOLOv7 backbone network, we were able to significantly enhance the detection speed while retaining a high level of detection accuracy. Depthwise separable convolutions effectively reduce computational complexity by decoupling the spatial and depthwise convolutions, leading to faster image processing. The improvements to the receptive field allowed the model to maintain detection performance, even in resource-constrained environments.
Compared to the original YOLOv7, the improved model showed a remarkable increase in detection speed, achieving 3.77 times faster performance while reducing detection accuracy by only 3.84%. This demonstrates that the model can be effectively deployed for real-time applications in scenarios where hardware limitations are a key concern. Our results indicate that lightweight YOLOv7 with these optimizations is particularly well suited for applications in fields such as autonomous vehicles, robotics, surveillance, and healthcare.
Despite the improvements, there are opportunities to further enhance the model’s performance. Future work could focus on optimizing the model further for edge devices, such as mobile phones or IoT systems, which would benefit from reduced model size and faster inference times. Additionally, integrating advanced techniques such as knowledge distillation or quantization could lead to even greater improvements in efficiency without compromising performance.