An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture

Wang, Jun; Xu, Ke

doi:10.3390/app15052528

Open AccessArticle

An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture

by

Jun Wang

and

Ke Xu

^*

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2528; https://doi.org/10.3390/app15052528

Submission received: 17 December 2024 / Revised: 21 February 2025 / Accepted: 24 February 2025 / Published: 26 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Object detection is a key technology in optical imaging detection systems. To address the issue of slow object detection model performance on mobile platforms, a lightweight network can be used as the backbone of the object detector, reducing the model’s parameters and increasing the inference speed. However, while using a lightweight network directly as the backbone improves inference speed, it significantly reduces detection accuracy. This paper proposes enhancing the backbone network by adding depthwise separable convolutional layers, which improves the receptive field of the model, enabling better detection performance in complex environments. The experimental results show that, compared to YOLOv7, the improved model reduces detection accuracy by 3.84%, while achieving a detection speed that is 3.77 times faster than the original model. This method significantly enhances the model’s suitability for real-time applications on resource-constrained devices, offering an effective solution for faster object detection without sacrificing critical performance in practical scenarios.

Keywords:

CNN; YOLOv7; lightweight network

1. Introduction

Optical imaging detection technology is widely used in various American weapon systems, such as missile defense systems, early warning detection systems, ground-based midcourse defense systems with kinetic energy interceptors, the THAAD weapon system, air defense missiles, and air-to-air missiles. Object detection algorithms process imaging data to determine whether a target is present within the detection field and its position, making it one of the key technologies in optical imaging detection systems [1,2,3].

Optical imaging detection technology has become an integral component in the development of various advanced weapon systems, especially in military applications. In the United States, optical imaging plays a critical role in missile defense systems, early warning detection systems, and ground-based midcourse defense systems with kinetic energy interceptors. Additionally, it is a key technology in the THAAD (Terminal High-Altitude Area Defense) system, air defense missiles, and air-to-air missiles. These weapon systems rely heavily on object detection algorithms, which process imaging data to determine the presence and position of targets within a given detection field. As such, optical imaging detection technology forms the backbone of many high-performance guidance systems, ensuring precise and reliable identification of threats in dynamic environments.

The increasing complexity of modern battlefield environments, however, has created new challenges for optical terminal guidance systems, particularly those used in precision-guided weapons. The complex environment refers to the confluence of various electromagnetic activities, natural phenomena, and multi-target scenarios that can disrupt or degrade the ability of optical imaging systems to accurately detect and track targets. These factors are as follows: (1) Electromagnetic Environmental Elements: Electromagnetic interference can degrade the signal quality and increase the difficulty of target identification. (2) Natural Environmental Elements: Adverse weather conditions such as fog, rain, or dust storms can obscure the imaging sensor’s view, negatively impacting detection accuracy. (3) Multi-target Environmental Elements: The presence of multiple targets within a scene can create confusion, leading to misidentification or misclassification.

These environmental factors place significant demands on optical terminal guidance systems and impact their ability to track and engage targets effectively. This, in turn, influences the overall combat effectiveness of optical imaging-based precision-guided systems [4].

Traditional object detection and recognition methods were designed to address some of these challenges. Typically, these methods follow a multi-stage process: generating candidate regions, extracting features, classifying regions, and performing post-processing to refine detection results. However, these traditional approaches are limited by several factors, particularly their reliance on manually designed low-level visual features. These features are often handcrafted based on empirical knowledge and are typically limited to specific categories, rendering them less versatile in handling complex object transformations, occlusions, or variations across different scenarios. The lack of deep semantic understanding and the inability to generalize across diverse environments significantly hinder the performance of traditional models.

The introduction of convolutional neural networks (CNNs) has revolutionized the field of object detection by enabling models to learn hierarchical feature representations directly from raw input data. CNN-based approaches have significantly outperformed traditional methods by eliminating the need for handcrafted features and providing more accurate and robust detection capabilities. Among these CNN-based models, the YOLO (You Only Look Once) series has emerged as one of the most popular and effective solutions for real-time object detection [5]. YOLO combines high detection accuracy with impressive speed, making it ideal for real-time applications such as autonomous vehicles, surveillance systems, and military guidance systems. Since its inception in 2016, YOLO has undergone several iterations, from YOLOv1 to YOLOv11 [6,7,8,9], with each version introducing improvements in speed and accuracy. YOLOv7, introduced in 2022, surpassed all other real-time object detectors by achieving an unprecedented balance between speed and accuracy. At 30 FPS (Frames Per Second) on a GPU V100, YOLOv7 achieved a peak accuracy of 56.8% AP (Average Precision), outperforming other real-time detectors in terms of both detection accuracy and processing speed.

Despite its superior performance on high-performance GPUs, YOLOv7 faces challenges when deployed on resource-constrained platforms such as smartphones, drones, and embedded systems. These devices often lack the computational power and memory capacity needed to run large, complex models efficiently. To address this issue, various model compression and acceleration techniques have been developed to make deep learning models more suitable for deployment on these platforms [10,11,12,13]. Common methods for accelerating neural networks include lightweight networks, weight quantization [14], neural network pruning [15], and knowledge distillation [16].

Weight quantization is a widely used technique in model compression, where the precision of the model’s weights is reduced to lower-bit representations [17,18]. This approach significantly reduces storage requirements and accelerates computation, but it often comes at the cost of reduced accuracy. This is particularly problematic for object detection models, where small errors in weight quantization can result in large deviations in the predicted bounding box coordinates, affecting the overall detection quality. Binary and ternary quantization have been explored as means to further reduce the model size, but these approaches often introduce substantial accuracy losses, especially in the detection of small or occluded objects [19].

Neural network pruning, another common technique, involves removing less important weights or entire neurons from the model to reduce its size. While pruning can effectively reduce model’s complexity, it also poses challenges when it comes to maintaining the model’s performance, particularly when large portions of the network are removed. Song Han’s pruning method [20], which combines pruning with weight sharing and Huffman encoding, has been proposed to address this challenge by further compressing the model while attempting to retain its predictive power [21,22,23]. However, as the network grows more complex, the implementation of such techniques becomes increasingly difficult. There are also studies on early exit for acceleration [24,25].

In recent years, lightweight networks have emerged as a promising solution for model acceleration, especially on mobile and embedded devices. Lightweight networks such as MobileNet [14], SqueezeNet [15], and ShuffleNet [16,26] have been specifically designed to reduce the number of parameters and computational cost without sacrificing model accuracy. These networks utilize novel architectural techniques such as depthwise separable convolutions, which significantly reduce the number of operations required for convolutional layers while maintaining a high performance. Depthwise separable convolutions decompose the standard convolution operation into two steps: a depthwise convolution followed by a pointwise convolution, which reduces the number of computations and memory usage. This makes lightweight networks particularly effective for real-time object detection on devices with limited resources.

In this paper, we explore the application of lightweight network principles to the YOLOv7 object detection model to improve its performance on resource-constrained hardware. By incorporating depthwise separable convolutions into the YOLOv7 architecture, we aim to accelerate detection speeds without sacrificing detection accuracy. We validate the effectiveness of this acceleration method using a self-built dataset, demonstrating that the modified YOLOv7 model can achieve real-time performance while maintaining competitive accuracy.

At the end of this Introduction, we briefly outline the structure of the remainder of the manuscript. Section 2 presents related works, in which we describe the YOLOv7 and ShuffleNet architecture. Section 3 introduces the enhancements made through depthwise separable convolutions. Section 4 discusses the experimental setup, including the dataset, evaluation metrics, and results obtained. In Section 5, we provide a discussion of the results, interpreting the findings and comparing them to existing models. Finally, Section 6 concludes the paper with a summary of the contributions, limitations, and directions for future work.

2. Related Works

2.1. YOLOv7

YOLOv7 is a one-stage deep neural network model for object detection. Compared to the two-stage Faster R-CNN series detection models [7], YOLOv7 achieves a better balance between detection speed and accuracy. The backbone network structure of YOLOv7 is shown in Figure 1.

The backbone network of the YOLO series models extracts features from images at different levels. In the YOLOv7 backbone network, C1, C2, and C3 are obtained as downsampled feature maps at 8×, 16×, and 32×, respectively. Among these, C1 represents the shallow features of the image, which contain more spatial information about the target, while C3 represents the deep features, which contain more semantic information. After feature extraction by the backbone network, YOLOv7 uses the PaFPN (Path Aggregation Feature Pyramid Network) structure and coupled heads to output the predicted values of the target.

The predicted output of YOLOv7 includes the values for four prediction boxes, category prediction values, and target confidence values (x, y, w, h, c).

The loss function consists of three components: the offset of the predicted box, IoU (Intersection over Union) and confidence loss, and category loss.

Equation (1) is the loss value of the prediction box:

l_{b b o x} = \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i, j}^{o b j} (2 - w_{i} \times h_{i}) [\begin{matrix} - x_{i} \log ({\hat{x}}_{i}) - (1 - x_{i}) \log (1 - {\hat{x}}_{i}) \\ - y_{i} \log ({\hat{y}}_{i}) - (1 - y_{i}) \log (1 - {\hat{y}}_{i}) \\ + {(w_{i} - {\hat{w}}_{i})}^{2} + {(h_{i} - {\hat{h}}_{i})}^{2} \end{matrix}]

(1)

In Equation (1),

\hat{x_{i}}, \hat{y_{i}}, \hat{w_{i}}, \hat{h_{i}}

represents the predicted value and

x_{i}, y_{i}, w_{i}, h_{i}

represents the true label value.

I_{i, j}^{o b j}

is the Indicator Function, K represents the grid size, and M represents the number of anchors per grid.

Equation (2) is the category loss:

l_{c l s} = - \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i, j}^{o b j} \sum_{c \in c l a s s e s} [\begin{matrix} (1 - p_{i} (c)) \log (1 - {\hat{p}}_{i} (c)) \\ + p_{i} (c) \log ({\hat{p}}_{i} (c)) \end{matrix}]

(2)

In Equation (2),

{\hat{p}}_{i} (c)

represents the predicted category probability value and

p_{i} (c)

represents the true label value.

Equation (3) is the confidence loss:

\begin{matrix} l_{o b j} = - \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i, j}^{o b j} [C_{i} l o g ({\hat{C}}_{i}) + (1 - C_{i}) l o g (1 - {\hat{C}}_{i})] \\ - \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i, j}^{n o o b j} [C_{i} l o g ({\hat{C}}_{i}) + (1 - C_{i}) l o g (1 - {\hat{C}}_{i})] \end{matrix}

(3)

In Equation (3),

{\hat{C}}_{i}

represents the predicted confidence value and

C_{i}

represents the true label value.

The loss function value of YOLOv7 is obtained by summing three parts, as shown in Equation (4):

l o s s = l_{b b o x} + l_{c l s} + l_{o b j}

(4)

The loss function represents the distance between the predicted values of the network and the true target values. Using the loss function as a feedback signal, the loss is backpropagated through an optimizer to update the weights of each layer and train the network. The training process of the neural network model is shown in Figure 2.

2.2. Lightweight Network ShuffleNet V2

ShuffleNet v2 is a lightweight network [26]. To reduce the number of parameters and improve computational efficiency, the design of the network structure adheres to the following principles: (1) When the number of input and output channels of the convolutional layer is equal, memory access consumption is minimized, and the model operates at the fastest speed. (2) Choosing an appropriate number of test paper integrals can increase the computational resource consumption for group convolution. (3) An increase in the number of network branches reduces computational efficiency, so the number of branches in the model needs to be minimized. (4) Although element-wise operations require fewer parameters, they consume more memory during runtime, so these operations should be reduced.

Based on these principles, the basic structural units of ShuffleNet v2 are as shown in Figure 3. When ShuffleNet v2 requires downsampling and doubling the number of channels, the channel separation operation is removed, and downsampling is achieved using a depthwise separable convolution (DWConv) with a stride of 2. The basic structural unit is shown in Unit 2 of Figure 3.

By adopting the concept of lightweight networks, the ShuffleNet v2 convolutional neural network is used as the feature extraction network to create a lightweight object detection model for YOLOv7.

3. Method of Acceleration

In deep neural network models, several factors affect the accuracy, including the network’s downsampling rate, depth, and receptive field (RF). A lightweight model, such as one that uses ShuffleNet v2 as the backbone, reduces the depth of the network, which can decrease detection accuracy. While increasing the depth of the network and reducing the downsampling rate can improve accuracy, these modifications also increase computational complexity, which hinders the acceleration of the object detection model.

To address this, our method focuses on improving the receptive field of the network without significantly increasing complexity. The receptive field is a crucial factor in object detection because it defines the area of the image that influences each prediction. By enlarging the receptive field, the network can better capture the content of the entire image and understand the context surrounding the object. This improvement allows the model to make more informed predictions, especially in challenging environments with cluttered or occluded objects. In essence, the larger receptive field increases the number of connections between the feature extraction points and the input pixels, improving the model’s ability to detect objects even in complex scenarios. The receptive field of a convolutional neural network is primarily influenced by the convolutional layers and the downsampling layers, as described in Equations (5) and (6) for calculating the receptive field of convolutional networks.

R F_{l} = R F_{l - 1} + (k - 1) \times S_{l - 1}

(5)

S_{l} = \prod_{i = 1}^{l} S t r i d e_{i}

(6)

In the formula, the receptive field of the feature map at the

l

-th layer is denoted as

R F_{l}

,

K_{l}

is the kernel size of the convolutional kernel, and

S_{l}

is the product of the convolution steps of all network layers preceding the feature map at the

l

-th layer.

The backbone network of YOLOv7 has a receptive field of

725 \times 725

on the last layer feature map, while the ShuffleNet v2 network has a receptive field of

527 \times 527

on the last layer feature map.

To increase the receptive field of the lightweight network model, a depthwise separable convolution (DWConv) is added before each stage in the main network of shuffleNet v2. This addition does not change the size of the output feature map but increases the receptive field. The receptive field of the modified backbone network on the last layer feature map is

3647 \times 647

.

In real-world applications, object detection often happens in noisy, cluttered, or occluded environments, which can reduce accuracy. To improve YOLOv7’s performance in these conditions, we added depthwise separable convolutions to enhance the model’s receptive field. The receptive field refers to the area of the image that influences a particular prediction. By increasing the receptive field, the model can capture more surrounding context, helping it detect objects even when they are partially blocked or when there is background noise.

This improvement allows the model to better understand the scene, especially when objects are hard to distinguish because they are occluded or surrounded by clutter. With a larger receptive field, the model can also filter out noise, improving its accuracy even in challenging conditions.

4. Experiment

A dataset was created with aircraft detection as the application background to compare the detection performance of YOLOv7 and the improved model.

4.1. Dataset

The dataset consists of 1513 images containing airplanes, which were annotated using the labelImg v1.8.6 software (shown in Figure 4). The annotation boxes for the aircraft were stored in YOLO format. The images were randomly divided, with 1200 images used for the training set and the remaining 313 images used for testing.

To expand the dataset during model training, image data augmentation techniques were applied. These included optical and geometric transformations. For each image input into the detection model, optical transformations were first applied, such as random brightness, contrast, hue adjustment, saturation changes, and adding random noise. Next, geometric transformations were performed, including random image expansion, random cropping, and random horizontal flipping. Additionally, higher-order data augmentation methods like Mosaic and CutMix were used. While enhancing the image data, corresponding transformations were also applied to the annotation boxes to maintain their accuracy.

4.2. Pseudocode

In this section, we present the pseudocode for the improved YOLOv7 model, which incorporates depthwise separable convolutions to enhance the receptive field and improve detection accuracy, while maintaining inference speed. The pseudocode in Algorithm 1 contrasts with the original YOLOv7 model, focusing on how the depthwise separable convolutions are applied to the backbone network.

Algorithm 1 Model Process

# 1. Initialize Backbone
B = YOLOv7(DepthwiseSeparable = True)

# 2. For each image in dataset:
for img in dataset:
# 3. Feature Extraction
for layer in B.backbone:
F_l = DepthwiseConv(I_l) # Depthwise Convolution
F_l = PointwiseConv(F_l) # Pointwise Convolution

# 4. Increase Receptive Field
for layer in B.backbone:
F_l = DepthwiseSeparableConv(F_l) # Expand receptive field

# 5. Path Aggregation Feature Pyramid Network (PaFPN)
F_p = PaFPN(F_l) # Aggregating features from multiple scales

# 6. Prediction
B_box = Pred(F_p)          # Predict bounding box
C = ClassPred(F_p)        # Predict class probabilities
       C_conf = ConfPred(F_p)     # Predict confidence score

# 7. Loss Calculation
L_bbox = Loss_bbox(B_box, B_true)        # Bounding box loss
L_class = Loss_class(C, C_true)          # Classification loss
        L_conf = Loss_conf(C_conf, C_true)        # Confidence loss

# 8. Backpropagation
W = Backpropagate(W, L_bbox + L_class + L_conf) # Update weights

# 9. Repeat until convergence

4.3. Experiment and Analysis

The YOLOv7 model was trained, and its performance was tested. The testing speed of YOLOv7 is 3.1 FPS, with an AP value of 94.67% when the IoU threshold is 0.5. The corresponding PR (Precision–Recall) curve is shown in Figure 5.

Next, a lightweight network was adopted as the backbone of YOLOv7, resulting in a testing speed of 16.1 FPS and an AP value of 65.64% at the same IoU threshold of 0.5. The PR curve for this version is shown in Figure 6. After the original backbone was replaced with a lightweight network, the model’s detection speed increased, but its accuracy was significantly reduced.

To address this, a DWConv was added to the backbone network, and for embedded platforms with low computing resources, the activation function was changed from SiLU to LeakyReLU in the feature extraction network. According to the design principle (4) of ShuffleNet v2, the bias term of the convolutional layer was set to 0 to reduce point-by-point computation in the network. Our improved object detection network model based on this lightweight principle is called YOLOv7-SD. After training, YOLOv7-SD achieved a detection speed of 14.3 FPS in the hardware environment, with an AP value of 90.83% at an IoU threshold of 0.5. The PR curve for this model is shown in Figure 7. The test results for some images are shown in Figure 8, where the model demonstrates good performance in detecting multi-targets, multi-scale objects, and handling complex backgrounds.

For further optimization, the weight quantization method was applied to YOLOv7 to make it more lightweight. The benchmark model, YOLOv7, was constructed using float32 precision. In the experiment, both float16 and INT8 weight quantization methods were used for training and testing the model. The experimental results before and after quantization are presented in Table 1.

Additionally, the neural network pruning method was applied to further lighten the weight of YOLOv7. After training and testing the pruned model, the testing speed reached 11.3 FPS, with an AP value of 73.28%. The PR curve for this pruned model is shown in Figure 9.

4.4. Analysis of Experimental Results

The experimental results of the YOLOv7 model and its improved version are presented in Table 2. This table compares the detection accuracy (AP) and detection speed (FPS) of YOLOv7 with the lightweight modes, including the backbone network [15], weight quantization methods [21,22], and network pruning [23].

Building upon YOLOv7, this study adopts the design principles of lightweight networks by incorporating ShuffleNet v2 as the backbone to further optimize YOLOv7. The key improvements focus on increasing the receptive field of convolutional network feature maps, which enhances detection accuracy while maintaining efficiency. The experimental results show that, compared to the original YOLOv7, the improved model experiences a slight decrease in detection accuracy (3.84% reduction in AP). However, this trade-off is outweighed by the substantial improvement in detection speed, which reaches 14.3 FPS, making it 3.77 times faster than the original YOLOv7 model. This demonstrates that the proposed method effectively balances both speed and accuracy, making it highly suitable for real-time applications, particularly on resource-constrained platforms.

5. Discussion

In this paper, we addressed the challenge that slow detection speeds pose for hardware platforms with limited computing resources. The hardware configuration of the testing environment is a Jetson TX2, and the deep learning framework used is PyTorch 1.8.0. By incorporating depthwise separable convolutional layers into the YOLOv7 backbone network, we were able to significantly enhance the detection speed while retaining a high level of detection accuracy. Depthwise separable convolutions effectively reduce computational complexity by decoupling the spatial and depthwise convolutions, leading to faster image processing. The improvements to the receptive field allowed the model to maintain detection performance, even in resource-constrained environments.

Compared to the original YOLOv7, the improved model showed a remarkable increase in detection speed, achieving 3.77 times faster performance while reducing detection accuracy by only 3.84%. This demonstrates that the model can be effectively deployed for real-time applications in scenarios where hardware limitations are a key concern. Our results indicate that lightweight YOLOv7 with these optimizations is particularly well suited for applications in fields such as autonomous vehicles, robotics, surveillance, and healthcare.

Despite the improvements, there are opportunities to further enhance the model’s performance. Future work could focus on optimizing the model further for edge devices, such as mobile phones or IoT systems, which would benefit from reduced model size and faster inference times. Additionally, integrating advanced techniques such as knowledge distillation or quantization could lead to even greater improvements in efficiency without compromising performance.

6. Conclusions

To address the challenge of slow detection speeds on hardware platforms with limited computing resources, we adopted a lightweight method to enhance the detection speed of YOLOv7. By incorporating depthwise separable convolutions into the backbone network, the model’s receptive field was improved, resulting in faster processing and high detection accuracy. This method allows the model to perform effectively on hardware with limited resources, offering a practical solution for real-time applications in resource-constrained environments.

Our experiments validated that the improved YOLOv7 model can be deployed successfully for efficient object detection in practical scenarios, achieving significant speed improvements without sacrificing accuracy. The proposed model holds promise for various industries requiring real-time object detection, such as autonomous vehicles, robotics, and healthcare.

Author Contributions

Conceptualization, J.W.; Methodology, J.W. and K.X.; Software, J.W.; Writing—original draft, J.W.; Writing—review & editing, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical considerations.

Conflicts of Interest

The authors declare no conflict of interest.

References

Duman, B. A Real-Time Green and Lightweight Model for Detection of Liquefied Petroleum Gas Cylinder Surface Defects Based on YOLOv5. Appl. Sci. 2025, 15, 458. [Google Scholar] [CrossRef]
Han, Z.; Yue, Z.; Liu, L. 3L-YOLO: A Lightweight Low-Light Object Detection Algorithm. Appl. Sci. 2025, 15, 90. [Google Scholar] [CrossRef]
Lv, Y.; Tian, B.; Guo, Q.; Zhang, D. A Lightweight Small Target Detection Algorithm for UAV Platforms. Appl. Sci. 2025, 15, 12. [Google Scholar] [CrossRef]
Zhou, B.; Liu, J. The Application of Intelligent Technology in Precise Strike System. Aerosp. Def. 2019, 2, 77–83. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
El-Dalahmeh, A.; El-Dalahmeh, M.; Li, J. Enhanced Vehicle Detection through Multi-Sensor Fusion Utilizing YOLO-NAS and Faster R-CNN. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–6. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Xu, S.; Wang, J.; Sang, Q. Semi-Supervised Method for Underwater Object Detection Algorithm Based on Improved YOLOv8. Appl. Sci. 2025, 15, 1065. [Google Scholar] [CrossRef]
Hao, K.; Deng, Z.; Wang, B.; Jin, Z.; Li, Z.; Zhao, X. Lightweight multiobject ship tracking algorithm based on trajectory association and improved YOLOv7tiny. Expert Syst. Appl. 2025, 259, 125129. [Google Scholar] [CrossRef]
Liu, Y.; Ma, Y.; Chen, F.; Shang, E.; Yao, W.; Zhang, S.; Yang, J. YOLOv7oSAR: A Lightweight High-Precision Ship Detection Model for SAR Images Based on the YOLOv7 Algorithm. Remote Sens. 2024, 16, 913. [Google Scholar] [CrossRef]
He, J.; Wang, Y.; Wang, Y.; Li, R.; Zhang, D.; Zheng, Z. A lightweight road crack detection algorithm based on improved YOLOv7 model. Signal Image Video Process. 2024, 18, S847–S860. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applocations. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level ACCURACY with 50x Fewer Parameters and <1 MB Model Size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Liu, X.; Wang, T.; Yang, J.; Tang, C.; Lv, J. MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment. Neurocomputing 2024, 574, 127210. [Google Scholar] [CrossRef]
Yang, Y. Quantization and Acceleration of YOLOv5 Vehicle Detection Based on GPU Chips. In Proceedings of the GAIIS 2024: 2024 International Conference on Generative Artificial Intelligence and Information Security, Kuala Lumpur, Malaysia, 10–12 May 2024. [Google Scholar]
You, X.; Ma, J.; Yang, G. An improved insulator self-explosion detection method based on group-level pruning for the YOLOv7-tiny algorithm. J. Real Time Image Process. 2024, 21, 189. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Liu, X.; Zheng, Y.; Yuan, X.; Yi, X. Securely outsourcing neural network inference to the cloud with lightweight techniques. IEEE Trans. Dependable Secur. Comput. 2022, 20, 620–636. [Google Scholar] [CrossRef]
Paralikas, I.; Spantideas, S.T.; Giannopoulos, A.E.; Trakadas, P. Lightweight Inference by Neural Network Pruning: Accuracy, Time and Comparison. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Corfu, Greece, 27–30 June 2024; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Aghapour, E.; Shen, Y.; Sapra, D.; Pimentel, A.D.; Pathania, A. PiQi: Partially Quantized DNN Inference on HMPSoCs. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, Newport Beach, CA, USA, 5–7 August 2024. [Google Scholar]
Zhou, W.; Xu, C.; Ge, T.; McAuley, J.J.; Xu, K.; Wei, F. Bert loses patience: Fast and robust inference with early exit. Adv. Neural Inf. Process. Syst. 2020, 33, 18330–18341. [Google Scholar]
Teerapittayanon, S.; McDanel, B.; Kung, H.T. Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; IEEE: New York, NY, USA, 2016; pp. 2464–2469. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 122–138. [Google Scholar]

Figure 1. Schematic diagram of YOLOv7 backbone network structure.

Figure 2. Deep learning model training.

Figure 3. Lightweight structural units of ShuffleNet v2.

Figure 4. LabelImg annotation image.

Figure 5. PR curve of YOLOv7 model test results.

Figure 6. PR curve of lightweight model testing results.

Figure 7. PR curve of YOLOv7-SD model test results.

Figure 8. Model testing results of some images.

Figure 9. PR curve of YOLOv7 pruned model.

Table 1. Comparison of model with quantization methods.

Model	Float16	INT8
Detection Accuracy (AP)	86.41%	68.79%
Detection speed (FPS)	6.9	13.6

Table 2. Comparison of model experimental results on accuracy and speed.

Model	Detection Accuracy (AP)	Detection Speed (FPS)
YOLOv7	94.67%	3.1
Backbone Network [15]	65.64%	16.1
Float16 Weight Quantization [21]	86.41%	6.9
INT8 Weight Quantization [22]	68.79%	13.6
Network Pruning [23]	73.28%	11.3
Ours	90.83%	14.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Xu, K. An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture. Appl. Sci. 2025, 15, 2528. https://doi.org/10.3390/app15052528

AMA Style

Wang J, Xu K. An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture. Applied Sciences. 2025; 15(5):2528. https://doi.org/10.3390/app15052528

Chicago/Turabian Style

Wang, Jun, and Ke Xu. 2025. "An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture" Applied Sciences 15, no. 5: 2528. https://doi.org/10.3390/app15052528

APA Style

Wang, J., & Xu, K. (2025). An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture. Applied Sciences, 15(5), 2528. https://doi.org/10.3390/app15052528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Accelerating Method of YOLOv7 Based on Lightweight Network Architecture

Abstract

1. Introduction

2. Related Works

2.1. YOLOv7

2.2. Lightweight Network ShuffleNet V2

3. Method of Acceleration

4. Experiment

4.1. Dataset

4.2. Pseudocode

4.3. Experiment and Analysis

4.4. Analysis of Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI