Dense Small Object Detection Based on an Improved YOLOv7 Model

Chen, Xun; Deng, Linyi; Hu, Chao; Xie, Tianyi; Wang, Chengqi

doi:10.3390/app14177665

Open AccessArticle

Dense Small Object Detection Based on an Improved YOLOv7 Model

by

Xun Chen

^1,†,

Linyi Deng

^1,†,

Chao Hu

^2,*,

Tianyi Xie

¹ and

Chengqi Wang

¹

School of Information and Communication Engineering, Hainan University, Haikou 570228, China

²

School of Electronic Information, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(17), 7665; https://doi.org/10.3390/app14177665

Submission received: 24 July 2024 / Revised: 18 August 2024 / Accepted: 26 August 2024 / Published: 30 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Detecting small and densely packed objects in images remains a significant challenge in computer vision. Existing object detection methods often exhibit low accuracy and frequently miss detection when identifying dense small objects and require larger model parameters. This study introduces a novel detection framework designed to address these limitations by integrating advanced feature fusion and optimization techniques. Our approach focuses on enhancing both detection accuracy and parameter efficiency. The approach was evaluated on the open-source VisDrone2019 data set and compared with mainstream algorithms. Experimental results demonstrate a 70.2% reduction in network parameters and a 6.3% improvement in mAP@0.5 over the original YOLOv7 algorithm. These results demonstrate that the enhanced model surpasses existing algorithms in detecting small objects.

Keywords:

small object detection; YOLOv7; feature extraction

1. Introduction

Object detection plays an indispensable role in advanced visual tasks such as target tracking, image description, and scene understanding [1]. It is widely used in fields like intelligent security [2,3], automatic defect detection [4,5], autonomous driving [6,7], and remote sensing image analysis [8,9]. The YOLO (You Only Look Once) family [10,11,12,13,14,15,16,17,18,19] of models, using end-to-end detection, directly output prediction results, and due to the excellent detection rate, they are a popular framework for industrial applications.

While YOLO performs well in general-purpose object detection tasks, there are still some challenges in dense small object detection. Compared with conventional-sized objects, small objects have the disadvantages of fewer pixels, a smaller coverage area, and difficulty in feature information extraction. Small object feature extraction is easily affected by environmental factors in the actual scene; for example, in strong light, the textures of small objects are easily overexposed, affecting the extraction of their features. Scale variations and dense connections can also make the detection of small objects more difficult [20]. These models typically require larger parameters and still fall short of achieving high accuracy, often resulting in missed detections and false positives. Moreover, the feature extraction capabilities of these models are not sufficiently robust to handle the complexities of small objects in diverse and cluttered backgrounds.

To address these limitations, we propose an improved YOLOv7 architecture tailored for dense small object detection. We have upgraded our model in several ways: (1) We have added a new detection layer to improve the recognition of small objects by focusing on finer details. (2) We have removed the detection layer for large target detection to speed up processing and make the model more effective. (3) We have integrated advanced feature fusion techniques to better capture features at different scales. (4) We have replaced old downsampling techniques to preserve more details and enhance accuracy in detecting small objects. (5) We have optimized the number of network channels to further enhance the capability of small target detection. Comprehensive evaluations on the VisDrone2019 [21] data set highlight the superiority of this method compared to mainstream algorithms.

The remainder of this paper is organized as follows. Section 2 reviews related work in the field of small object detection. Section 3 details the proposed improvements to the YOLOv7 architecture. Section 4 presents the experimental setup and analysis of results. Finally, Section 5 concludes the paper.

2. Related Work

The YOLO series has seen continuous development, with recent research focusing on addressing challenges in dense and small object detection. In 2020, Xu et al. [22] enhanced YOLOv3, resulting in a 5.43% increase in detection accuracy. However, this improvement came at the cost of increased computational requirements due to the addition of an auxiliary network, leading to longer detection times. Similarly, Xu et al. [23] integrated DenseNet with YOLOv3 to achieve higher accuracy while maintaining real-time performance for remote sensing target detection. Despite these advancements, the current YOLO models, up to version 10, have significantly outperformed earlier versions in terms of both accuracy and efficiency. In 2021, Zhu et al. [24] introduced TPH-YOLOv5, combining a Transformer detection head and CBAM attention mechanism to increase attention in dense small target areas. However, this led to a significant increase in the number of parameters. Wang et al. [25] improved YOLOv4 by using dense blocks instead of residual blocks and integrating SPPnet and PANnet in the Neck. Despite these modifications, the improvements were limited. In 2022, Zhang et al. [26] augmented YOLOv5 with BAM to enhance attention to small target information in shallow feature maps, whih performed well in the detection of small figures but with a large increase in parameter scale. Huang et al. [27] developed TCA-YOLOv5m by integrating the Transformer algorithm with the Coordinate Attention mechanism, which significantly enhanced target detection capabilities. However, the model still exhibited instances of missed and false detections, particularly with small targets. In 2023, Zhao et al. [28] integrated a small target detection head and an attention mechanism into YOLOv7, enhancing the detection performance of small targets on the sea surface. However, they struggled to effectively control the increase in computational load. Bao et al. [29] addressed the significant information loss of small and medium-sized targets in remote sensing images, reconstructed the feature fusion pyramid, and improved the up-sampling method to enhance the detection of small targets. This improvement, however, also increased the model’s complexity. In 2024, Sui et al. [30] enhanced YOLOv8s by integrating BiFPN and Dynamic Head, resulting in the improved detection of small targets, but the proposed algorithm still has shortcomings such as long training time and long model convergence time.

These advancements primarily focus on incorporating modules tailored for detecting small targets. While these enhancements boost detection accuracy, they also come with a significant increase in computational cost. Considering that YOLOv7 achieves high detection accuracy in the detection of small targets in the recent version of the YOLO algorithm, this paper proposes an improved YOLOv7 algorithm for dense small targets. This algorithm enhances the detection accuracy of small targets while reducing model parameters, resulting in improved overall performance.

3. Proposed Method

The proposed network structure is shown in Figure 1. Our model incorporates a new P2 prediction head operating on a 4x downsampling feature map, enhancing its ability to capture detailed information about small objects. The P5 detection layer, designed primarily for larger objects, is removed to reduce network parameters. Additionally, C3 and Res2Net [31] are integrated to form the multi-scale feature extraction module C3R2N, replacing the 16-fold downsampling ELAN_1 and ELAN_2 modules. This integration aims to improve network precision and reduce parameter count. And the original downsampling convolution layers often lead to spatial information loss, which is particularly detrimental to small object detection. The SPD Conv [32] module is introduced to replace these conventional layers. Furthermore, doubling the channel number of the first ELAN_1 feature fusion module significantly enhanced the detection performance.

3.1. More Focus on Small Object Detection Head

Small objects have the disadvantage of having fewer pixels, and in scenarios where they are densely packed, a higher-resolution detection layer can help differentiate between closely spaced small objects, reducing false positives and improving detection accuracy. The P2 layer corresponds to an earlier stage in the network with higher spatial resolution. This means that the features at this level are more detailed, which is crucial for detecting small objects that might be missed at lower resolutions. YOLOv7 models use multiple detection layers (P3, P4, and P5) to handle objects of various sizes. By adding a P2 head, we enhance this multi-scale detection capability, specifically targeting very small objects.

In the domain of small target image detection, the landscape often teems with a multitude of diminutive objects. Each layer in YOLO corresponds to a different resolution of the input image, and the P5 layer, being deeper, represents a downsampled version of the image (typically 1/32 of the original size). For small objects, this downsampling is too severe, and crucial details might be lost. By removing the P5 layer, the model can allocate more resources and attention to the higher-resolution P2, P3, and P4 layers, which are more relevant for detecting small objects. The P5 layer increases the computational load without significantly contributing to the detection of small objects. Removing this layer can make it more efficient without sacrificing performance for small object detection. These adjustments culminated in a remarkable 69% reduction in network parameters.

3.2. Multi-Scale Feature Extraction Module

Considering that the fine-granularity of feature graphs at different levels contributes differently to small objects, blending features only at equal levels may produce redundant information and conflicts, affect context information, interfere with the network’s learning of small objects, and slow down the inference speed. For that purpose, this paper introduces Res2Net to redesign the residual structure within C3. As depicted in Figure 2, Res2Net employs grouped convolutions within a single residual block to evenly distribute channels across the input feature map. Utilizing a series of compact convolutional kernels ensures model efficiency. Furthermore, employing a ladder structure expands the range of scales represented by the output feature map, enhancing the model’s ability to capture multi-scale features that are essential for detecting small objects.

The calculation formula for Res2Net is as follows:

y_{i} = \{\begin{matrix} x_{i}, i = 1; \\ k_{i} (x_{i}), i = 2; \\ k_{i} (x_{i} + y_{i - 1}), 2 < i < s . \end{matrix}

(1)

Suppose that the number of output channels of the feature graph of Res2Net after a 1 × 1 convolution becomes n; then, the split operation is used to divide the input feature graph into s subsets equally along the channel direction, which are represented by

x_{i}

, where i = {1, 2,..., s}. Each feature graph subset

x_{i}

has the same scale as the feature graph, but the number of channels becomes

n / s

. There is a corresponding 3 × 3 convolution after each

x_{i}

except for the first set of convolution

x_{i}

. The convolution layer is represented by

k_{i} ()

, and the output after

x_{i}

is represented by

y_{i}

, where i = {1, 2,..., s}. Forthe output current

x_{i}

and the last

y_{i - 1}

, the result of the combination is

k_{i} ()

input, so each

k_{i} ()

before the input contains

{x_{i}, j \leq i}

or fewer group characteristics and adopts hierarchical multiple-connection mode again at the same time so that each

y_{i}

can extract more abundant and more scale features on the basis of

y_{i - 1}

.

As can be seen from Formula (1), the larger the value of s, the larger receptive field can be learned. However, the larger the s value, the more redundant the network will be. We set s to 4 so that the output feature map contains different sizes and quantities of receptive fields. This hierarchical residual structure makes the feature information of a single scale no longer extracted from each level but represents multi-scale features at a finer granularity level, inhibits the generation of conflicting information, and enables the fused feature map to have more powerful and rich semantic information and texture details, thus improving the attention of the network to small objects.

Combining C3 and Res2Net, we designed a module called the C3R2N, as illustrated in Figure 3. The input feature map passes through two convolution layers with a convolution kernel of the same size and the number of output channels halved, and it is divided into two branches by normalization and a Sigmoid operation to achieve the purpose of reduction and feature extraction. One branch then passes through the Res2Net module to concatenate the channel dimension with the other branch. Finally, the combined output is processed through the convolution, normalization, and Sigmoid functions to generate a feature map that is rich in multi-scale information.

This module is used to replace the feature fusion ELAN_1 and ELAN_2 modules with 16-fold downsampling of the original YOLOv7. This method can effectively improve the detection effect of small and medium-sized targets.

3.3. SPD Conv Downsampling

Traditional downsampling methods, such as max pooling and strided convolutions (SConv), have been fundamental components of CNNs. In most research scenarios, images have good resolutions and moderate-sized objects, which allows stride convolutions and pooling layers to skip large amounts of redundant pixel information while still enabling effective feature learning. However, these methods have several shortcomings, especially in tasks like small object detection in YOLO models. Strided convolutions, for instance, apply filters to the input with a step size greater than one, effectively reducing the spatial dimensions. This reduction in spatial resolution can cause critical details of small objects to be missed. Figure 4 shows the downsampling process of SConv with a stride size of 2 as an example. We can clearly see that SConv compresses the input feature map by a 3 × 3 convolution kernel with stride size of 2, which is 1/2 of the width and height of the original image, and directly filters out part of the feature information.

To reduce the feature loss of small targets, we have introduced the SPD Conv model. By converting spatial dimensions into depth channels, SPD Conv allows the network to process high-resolution features without reducing the image’s spatial dimensions. This provides a richer representation of the input, which is beneficial for detecting small and densely packed objects. SPD Conv slices the input feature map and obtains four subgraphs with double downsampling, which contains the global spatial information of the original image. The subgraphs were spliced along the channel dimension and the channel dimension was adjusted by the non-stride convolution layer. As shown in Figure 5, SPD Conv maintains a higher resolution in the intermediate feature maps compared to traditional downsampling methods. Replacing downsampling layers with SPD Conv can lead to a more effective and accurate model for detecting small objects in various applications.

3.4. ELAN Model Optimization

The structure of ELAN_1 can be seen in Figure 1. The ELAN_1 plays a crucial role in the YOLOv7 backbone by enhancing the model’s ability to aggregate features from different layers effectively. It is designed to efficiently aggregate features from different layers in the network. This aggregation helps in capturing multi-scale information, which is essential for detecting objects of various sizes. By combining features from multiple layers, ELAN_1 enhances the representation capability of the model. This richer feature representation allows the model to better identify and classify objects in the input image. ELAN_1 can adapt to different architectures and can be modified to further enhance performance.

Expanding the channel number of ELAN_1 enables the network to learn finer details and subtle differences in features, increasing the module’s capacity to process and integrate more feature information. This enhancement leads to better feature discrimination and improved detection accuracy. However, increasing the number of ELAN_1 channels also raises the number of parameters and the model’s complexity. Therefore, increasing the number of ELAN_1 channels only at the P2 layer achieves the best overall effect.

4. Experiments and Analysis of Results

4.1. Data Set

The Visdrone2019 data set is a comprehensive and challenging benchmark designed for visual object detection and tracking tasks in drone-captured imagery. It comprises over 10,000 images and video sequences captured by drones flying over diverse urban and rural environments. The data set includes a wide range of objects, such as pedestrians, vehicles, bicycles, and tricycles, annotated with precise bounding boxes and detailed labels. Figure 6 shows specific target categories and tag count information.

According to Figure 7, the height and width distribution diagram of the target box in the training set shows that more data points are concentrated in the lower left corner, indicating that the ratio of height and width of the target in the data set is less than one-tenth of the original figure, which conforms to the definition of small targets and is related to the problem studied in this paper. And the images in Visdrone2019 present various complexities, including different weather conditions, masses of objects, lighting variations, and occlusions, making it an ideal testbed for evaluating the performance of algorithms for small and dense object detection.

4.2. Experimental Settings

This study used a Ubuntu 20.04 operating system, and the compilation environment was Python 3.8.10, PyTorch 1.11.0, CUDA11.3. All models were trained, validated, and reasoned on an NVIDIA GeForce RTX 4090 GPU.

In our experiments, no changes were made to the data set, and all methods were evaluated on the original validation set. The experimental settings are shown in Table 1 below.

4.3. Evaluation Metrics

In order to test the performance of the proposed model, precision (P), recall (R), average precision (

A P

), mean average precision (

m A P

), and network parameter size were used for evaluation.

False positive (

F P

) refers to the number of negative samples detected as positive samples, true positive (

T P

) refers to the number of negative samples detected as positive samples, false negative (

F N

) refers to the number of positive samples that were not detected as negative samples, and Intersection over Union (

I o U

) refers to the ratio of the area of overlap between the predicted bounding box and the ground truth box to the area of their union.

Precision (P) refers to the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of how many detected targets are relevant. The formula is as follows:

P = \frac{T P}{T P + F P}

(2)

Recall (R) refers to the ratio of correctly predicted positive observations to all observations in the actual class. It measures how many relevant targets are detected. The formula is as follows:

R = \frac{T P}{T P + F N}

(3)

Average precision (

A P

) refers to the area of the precision and recall curve for each category and the axis. The formula is as follows:

A P = \int_{0}^{1} P (R) d R

(4)

Mean average recision (

m A P

) refers to the mean of the average precision scores for each class. It is calculated through Formula (5). It is widely used for evaluating object detectors and provides a single-value summary of the precision–recall curve. In this study,

m A P @ 0.5

refers to the value of

m A P

when the

I o U

threshold is 0.5. And

m A P @ 0.5 : 0.95

refers to the

m A P

under multiple

I o U

thresholds, within the interval [0.5, 0.95]. Ten

I o U

thresholds are taken in steps of 0.05, the

m A P

under each of these ten

I o U

thresholds is calculated, and then the average value is taken. In addition,

m A P

(s),

m A P

(m), and

m A P

(l) refer to the

m A P

values of small, medium, and large targets, respectively.

m A P = \frac{1}{k} \sum_{i = 1}^{k} A P_{i}

(5)

4.4. Experimental Analysis of Different Improvement Points

4.4.1. Improved Detection Head for YOLOv7

To verify the detection performance of the YOLOv7 network model by adding a P2 detection header and removing the P5 detection layer, the following experiments were conducted based on the Visdrone2019 data set: For Experiment 1, the structure of the benchmark model (YOLOv7) is shown in Figure 8; for Experiment 2, the P2 detection head was added, whose structure is shown in Figure 9; for Experiment 3, the P5 detection head was removed on the basis of Experiment 2, but P5 downsampling was retained, and its structure is shown in Figure 10; for Experiment 4, the entire P5 layer was removed, and its structure is shown in Figure 11.

The experimental results are presented in Table 2; adding the P2 detection head (Experiment 2) improved the detection performance for small and medium-sized targets compared to the benchmark YOLOv7 model (Experiment 1). The mAP for small targets increased from 18.6% to 21.0%, and the mAP for medium-sized targets increased from 38.7% to 40.7%. Removing the P5 detection head but retaining P5 downsampling (Experiment 3) resulted in a slight improvement in the mAP for small targets (from 21.0% to 21.3%) compared to Experiment 2. However, the mAP for large targets decreased slightly from 47.8% to 47.2%. Completely removing the P5 layer (Experiment 4) resulted in the highest mAP for small targets at 21.6% and the highest overall mAP@0.5% at 51.4%. This also led to a significant reduction in the number of parameters, from 36.53 M in the benchmark model to 11.31 M in Experiment 4, indicating a more efficient model with less computational complexity.

The changes made in Experiments 2, 3, and 4 generally improved the detection performance for small and medium-sized targets without a significant loss in performance for large targets. Experiment 4, which completely removed the P5 layer, achieved the best balance between performance improvement and model efficiency. These conclusions highlight the trade-offs between detection performance and model complexity, demonstrating that modifications like adding a P2 detection head and removing the P5 layer can lead to better detection performance, particularly for small targets, while also reducing model complexity.

4.4.2. C3R2N Multi-Scale Feature Extraction Module

In order to verify the effect of introducing the designed C3R2N module into the model, the following experiments were conducted on the basis of adding the P2 detection head and removing the P5 detection layer: in Experiment 1, the third ELAN_1 of the backbone network is replaced with the C3R2N module; in Experiment 2, the second and third ELAN_1 modules of the backbone network are replaced with C3R2N modules; in Experiment 3, all ELAN_1 modules are replaced with C3R2N modules; in Experiment 4, the third ELAN_2 in the neck network is replaced by the C3R2N module; in Experiment 5, the third ELAN_2 in neck network and the third ELAN_1 in backbone network are replaced with C3R2N modules; in Experiment 6, on the basis of Experiment 5, the second ELAN_2 of neck network is replaced with C3R2N module; in Experiment 7, on the basis of Experiment 6, the first ELAN_2 in neck network is replaced with C3R2N module; the results are shown in Table 3.

As can be seen from the results of the experiment on the C3R2N module results in Table 3, replacing various ELAN_1 and ELAN_2 modules with the C3R2N module generally resulted in improvements in detection performance compared to the baseline model. The best performance in terms of mAP@0.5% was achieved in Experiment 1, where the third ELAN_1 of the backbone network was replaced, resulting in an mAP@0.5% of 52.1%, which is higher than the baseline of 51.4%. The model parameters were reduced significantly with the introduction of C3R2N modules. The baseline model had 11.31 M parameters, whereas Experiment 7, which replaced the most ELAN modules, resulted in the fewest parameters (9.01 M). This suggests that the C3R2N module can reduce model complexity while maintaining competitive performance. While Experiment 1 achieved the highest mAP@0.5%, Experiment 5 offered a good balance between high detection accuracy (51.9% mAP@0.5%) and lower model parameters (9.30 M), making it a potentially more efficient choice. Therefore, we chose to replace the third ELAN_2 on the neck network and the third ELAN_1 on the backbone network with C3R2N.

4.4.3. Introduce SPD Conv Downsampling Module

In order to verify the effect of SPD Conv module introduced into the model, the following experiments were carried out with the introduction of C3R2N: Experiment 1, all strided convolutions were replaced by SPD Conv; Experiment 2, all strided convolutions and max pooling were replaced by SPD Conv; the results are shown in Table 4.

Replacing all strided convolutions with SPD Conv (Experiment 1) improved detection performance across all target sizes compared to the baseline. The mAP for small targets increased from 21.7% to 23.0%, and the mAP@0.5% increased from 51.9% to 53.8%. Experiment 1, which replaced only the strided convolutions with SPD Conv, achieved the highest mAP@0.5% of 53.8% with a moderate increase in parameters, making it an optimal configuration for balancing performance improvement and model complexity. Experiment 2 suggesting that replacing max pooling with SPD Conv might not be as impactful as replacing strided convolutions alone. So we chose to replace the step convolution only with the SPD Conv, not the max pooling downsampling. The introduction of the SPD Conv module effectively enhances the detection performance of the model, particularly for small and medium-sized targets.

4.4.4. Adjusting the Number of Channels of the Feature Fusion Module

In order to enhance the feature representation of the model and improve the detection accuracy, we have adjusted the channel of the backbone’s feature fusion module. Due to the addition of the feature fusion module under 8-fold downsampling and 16-fold downsampling, the parameter number of the model is greatly increased, which is not in line with our original intention. Therefore, we only doubled the number of channels of backbone’s first ELAN_1, the 4x downsampled feature fusion module (Experiment 1). The experiment was carried out on the basis of the introduction of SPD Conv, and the results are shown in Table 5.

Doubling the channel number of the ELAN feature fusion module led to a notable enhancement in detection performance. This modification allows the model to capture more detailed features, resulting in higher mAP values. These improvements highlight the effectiveness of the enhanced ELAN module in achieving better object detection accuracy, making it a valuable addition to our model’s architecture.

4.5. Ablation Experiments

In this experiment, we conducted an ablation study to evaluate the impact of various modifications to the baseline YOLOv7 model. These changes include adding a P2 detection head, removing the P5 detection layer, and replacing the feature fusion module under 16x downsampling with C3R2N, introducing SPD Conv, and doubling the number of channels for the first ELAN_1 module in the backbone network.

As can be seen from the ablation experiment results in Table 6, adding the P2 detection layer improved the mAP@0.5% to 50.5% and mAP@0.5:0.95% to 29.9% with a slight increase in parameters to 37.10 M. Removing the P5 detection layer from the model in step B resulted in a slight increase in both mAP@0.5% (51.4) and mAP@0.5:0.95% (30.0), with a significant reduction in parameters to 11.31 M. Incorporating the C3R2N module improved the mAP@0.5% to 51.9% and mAP@0.5:0.95% to 30.3% while reducing parameters further to 9.3 M. Introducing SPD Conv provided a significant boost, raising the mAP@0.5 to 53.8% and mAP@0.5:0.95 to 31.8%. This configuration maintained a relatively low parameter count of 10.1 M, making it an efficient and effective modification. Finally, doubling the number of channels in the first ELAN-1 module (our method) achieved the highest mAP@0.5 of 55.1% and mAP@0.5:0.95 of 32.5%, demonstrating the effectiveness of this combination of modifications.

The ablation study demonstrated that each modification had a positive impact on the model’s performance to varying degrees. The final model showed the most significant improvement in detection precision while maintaining a smaller number of parameters. This suggests that our proposed method effectively enhances the YOLOv7 model’s performance and efficiency.

As can be seen from the training process of mAP@0.5 (Figure 12) and mAP@0.5:0.95 (Figure 13), between our method and the baseline (YOLOv7), the improved model showed a higher accuracy growth rate at the early stage of training and maintained higher mAP@0.5 and mAP@0.5:0.95 levels than the baseline throughout the training process. This result shows that our improvement of the model structure is effective. In the comparison graph of the Loss training process (Figure 14), the improved model shows a lower Loss value and faster convergence speed. This means that the improved model can better fit the training data during the training process, thus improving the generalization ability. And the lower Loss value also implies that the model may have higher accuracy in forecasting.

4.6. Comparative Experiments

To evaluate the performance of our proposed method, we conducted a series of experiments on the VisDrone2019 verification set. We compared our method against several recent YOLO algorithms, including the YOLO family (YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10), and YOLO’s improved algorithms Drone-YOLO [33] and MS-YOLO [34]. We chose the algorithms that had a similar number of parameters to our model and those that performed best. In particular, YOLOv7m is our benchmark model, and algorithm data improved by YOLO come from their paper.

As can be seen from the ablation experiment results in Table 7, our method achieved the highest mAP@0.5% of 55.1% and mAP@0.5:0.95% of 32.5%, surpassing all other models, demonstrating its superior object-detection capabilities. Notably, it achieved this while maintaining a relatively low parameter count, comparable to lightweight models like YOLOv8s and YOLOv9s. These results demonstrate the effectiveness of our proposed method in terms of both accuracy and model complexity, making it a suitable choice for intensive small target detection applications.

4.7. Visual Analysis

Figure 15 compares the detection effects of YOLOv7 and our proposed algorithm in different scenarios. As can be seen from (a), in a dark environment, YOLOv7 cannot detect vehicles with roadside occlusion, nor can it detect vehicles with similar colors to the background. There is a lack of detection, but our algorithm can detect objects well. And in the dense small target image (b), YOLOv7 has a large number of missed targets, but our algorithm reduces the rate of missed detections. In the scenario with a light change (c), the difference between the two detection effects is not large, but the detection of YOLOv7 has missed detection and false detection. These fully demonstrate the effectiveness of our improved algorithm.

Figure 16 shows the confusion matrix of YOL0v7 and the algorithm proposed in this paper. It can be seen from the images that the accuracy of each type of object detected by our method is better than that of YOLOv7. Moreover, the omission rate of our algorithm for each type of object is lower than that of YOLOv7, which also proves the superiority of our proposed algorithm.

5. Conclusions

Existing object-detection methods often have the problems of low precision and frequently missed detections when identifying dense small targets. To address these limitations, we propose an improved version of the YOLOv7 model tailored specifically for dense small target detection. By adding a P2 detection head and removing the less efficient P5 detection layer, detection accuracy is improved, thereby reducing missed and false detections. To enhance the multi-scale feature characterization capability of the network and improve the feature extraction capability for small objects in complex backgrounds, the C3 and Res2Net are integrated to design the C3R2N module. Additionally, the original downsampling convolution layers are replaced with the SPD Conv module. This modification reduces feature loss and improves the detection of dense small objects. Finally, the channels of the backbone’s ELAN feature fusion module are optimized to further strengthen the detection of small objects and obtain better results. After testing, our improved algorithm achieves better detection accuracy than many mainstream target-detection algorithms, with a significant reduction in network parameters. In addition, our model also performs better in occlusion and dense small target detection, demonstrating the effectiveness of our proposed modifications. These results underscore the potential of our method for small target detection and will inspire further developments in the field.

Author Contributions

Conceptualization, L.D.; methodology, T.X.; software, L.D.; data curation, C.W.; writing—original draft preparation, L.D.; supervision, X.C.; funding acquisition, X.C. and C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Hainan University under Grant (Project No: KYQD(ZR) 21014), the National Key Research and Development Program of China (Project No: 2021YFC3340800), the National Natural Science Foundation of China (Project No: 62177046), and the High-Performance Computing Center of Central South University (HPC).

Data Availability Statement

Data set: https://github.com/VisDrone (accessed on 25 August 2024).

Acknowledgments

The author expresses sincere gratitude to Chen for his valuable support in accessing the High-Performance Computing Center of Central South University (HPC) resources. The author deeply appreciates his generous contribution.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Amit, Y.; Felzenszwalb, P.; Girshick, R. Object detection. In Computer Vision: A Reference Guide; Springer: New York, NY, USA, 2021; pp. 875–883. [Google Scholar]
Al Shibli, A.H.N.; Al-Harthi, R.A.A.; Palanisamy, R. Accurate Movement Detection of Artificially Intelligent Security Objects. Eur. J. Electr. Eng. Comput. Sci. 2023, 7, 49–53. [Google Scholar] [CrossRef]
Altaher, A.W.; Hussein, A.H. Intelligent security system detects the hidden objects in the smart grid. Indones. J. Electr. Eng. Comput. Sci. (IJEECS) 2020, 19, 188–195. [Google Scholar] [CrossRef]
Usamentiaga, R.; Lema, D.G.; Pedrayes, O.D.; Garcia, D.F. Automated surface defect detection in metals: A comparative review of object detection and semantic segmentation using deep learning. IEEE Trans. Ind. Appl. 2022, 58, 4203–4213. [Google Scholar] [CrossRef]
Wang, X.; Jia, X.; Jiang, C.; Jiang, S. A wafer surface defect detection method built on generic object detection network. Digit. Signal Process. 2022, 130, 103718. [Google Scholar] [CrossRef]
Feng, D.; Harakeh, A.; Waslander, S.L.; Dietmayer, K. A review and comparative study on probabilistic object detection in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9961–9980. [Google Scholar] [CrossRef]
Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote. Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 August 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 August 2024).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhang, M.; Pang, K.; Gao, C.; Xin, M. Multi-scale aerial target detection based on densely connected inception ResNet. IEEE Access 2020, 8, 84867–84878. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Xu, Q.; Lin, R.; Yue, H.; Huang, H.; Yang, Y.; Yao, Z. Research on small target detection in driving scenarios based on improved yolo network. IEEE Access 2020, 8, 27574–27583. [Google Scholar] [CrossRef]
Xu, D.; Wu, Y. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors 2020, 20, 4276. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Wang, Z.Z.; Xie, K.; Zhang, X.Y.; Chen, H.Q.; Wen, C.; He, J.B. Small-object detection based on yolo and dense block via image super-resolution. IEEE Access 2021, 9, 56416–56429. [Google Scholar] [CrossRef]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Mei, S. Finding nonrigid tiny person with densely cropped and local attention object detector networks in low-altitude aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 4371–4385. [Google Scholar] [CrossRef]
Huang, M.; Zhang, Y.; Chen, Y. Small target detection model in aerial images based on TCA-YOLOv5m. IEEE Access 2022, 11, 3352–3366. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, H.; Zhao, Y. Yolov7-sea: Object detection of maritime uav images based on improved yolov7. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 February 2023; pp. 233–238. [Google Scholar]
Bao, W. Remote-sensing Small-target Detection Based on Feature-dense Connection. J. Phys. Conf. Ser. 2023, 2640, 012009. [Google Scholar] [CrossRef]
Sui, J.; Chen, D.; Zheng, X.; Wang, H. A new algorithm for small target detection from the perspective of unmanned aerial vehicles. IEEE Access 2024, 12, 29690–29697. [Google Scholar] [CrossRef]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer: Cham, Swizterland, 2023; pp. 443–459. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Zhao, L.; Zhu, M. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]

Figure 1. Proposed network structure. We introduced the P2 detection header and removed the P5 detection layer, and the SPD Conv was used to replace the original downsampling convolution. In addition, C3R2N replaced an ELAN_1. The network structure of C3R2N and SPD Conv are described in the following text.

Figure 2. Structure of Res2Net module.

Figure 3. Structure of C3R2N (left) and original C3 (right). We used Res2Net to replace the residual module of C3.

Figure 4. SConv downsampling processes. SConv compresses the input feature map by a convolution kernel with stride 2 and directly filters part of the information.

Figure 5. SPD Conv downsampling processes. SPD Conv achieves the effect of downsampling by slicing and splicing the input feature map and then using the non-stride convolution adjustment channel.

Figure 6. Statistics of labels for different target categories in the training set. It contains ten categories and hundreds of thousands of labels.

Figure 7. Height and width distribution of the training set target box. According to the color depth, most of the data set targets are much smaller than one-tenth of the original image.

Figure 8. Base model structure diagram.

Figure 9. Added P2 detection head.

Figure 10. Removal of the P5 detection head.

Figure 11. Removal of the entire P5 layer.

Figure 12. Comparison diagram of mAP@0.5 between YOLOv7 and the proposed algorithm.

Figure 13. Comparison diagram of mAP@0.5:0.95 between YOLOv7 and the proposed algorithm.

Figure 14. Comparison diagram of Loss between YOLOv7 and the proposed algorithm.

Figure 15. Detection results between YOLOv7 (left) and the proposed algorithm (right). (a) Dark environment. In front of the store, where there is light, because the object is too small, the left picture has an obvious missed detection. In the dark environment, the color of the target image is close to the background color, especially in the place where the bottom of the tree is blocked, and the left image is also missed. (b) Dense small-target image. In the dense small-target image, the missed detection is obviously very serious. Although the right image also has a missed detection, it is significantly improved compared with the left image. (c) Scene with a light change. In the scene with the light change, the target-detection effect of the two figures is not very different, and the problem of the left figure is still the target density caused by missed detection, as well as the obvious false detection in the middle of the figure.

Figure 16. Confusion matrix graph. The diagonal data in the figure are the probability of correctly detecting the target, the background FN is the missed rate, and the background FP is the false detection rate. (a) YOLOv7. (b) Method of this paper.

Table 1. Experimental settings.

Parameters	Value
image size	640 × 640
epochs	300
warmup	3
batch size	16
optimizer	Adam
learning rate	0.01
momentum	0.937
weight decay	0.0005

Table 2. Experiment of the detection head.

Experiment	mAP(s)%	mAP(m)%	mAP(l)%	mAP@0.5%	Parameters/M
1	18.6	38.7	48.8	48.8	36.53
2	21.0	40.7	47.8	50.5	37.10
3	21.3	40.8	47.2	50.9	15.49
4	21.6	40.7	49.0	51.4	11.31

Table 3. Experiment on the C3R2N module.

Experiment	mAP@0.5%	mAP@0.5:0.95%	Parameters/M
1	52.1	30.4	10.24
2	51.8	30.4	10.22
3	51.7	30.3	10.22
4	51.2	29.7	10.37
5	51.9	30.4	9.30
6	51.3	29.8	9.08
7	50.7	29.6	9.01
Baseline	51.4	30.0	11.31

Table 4. Experiment on the SPD Conv.

Experiment	mAP(s)%	mAP(m)%	mAP(l)%	mAP@0.5%	Parameters/M
1	23.0	43.3	46.8	53.8	10.10
2	22.9	42.3	46.5	53.4	10.65
Baseline	21.7	41.5	46.1	51.9	9.30

Table 5. Experiment with channel adjustment.

Experiment	mAP(s)%	mAP(m)%	mAP(l)%	mAP@0.5%	Parameters/M
Baseline	23.0	43.3	46.8	53.8	10.10
1	24.2	43.3	49.6	55.1	10.89

Table 6. Ablation experiment results.

	Method	mAP@0.5%	mAP@0.5:0.95%	Parameters/M
A	YOLOv7	48.8	27.6	36.53
B	A+P2	50.5	29.9	37.10
C	B-P5	51.4	30.0	11.31
D	C+C3R2N	51.9	30.3	9.30
E	D+SPD Conv	53.8	31.8	10.10
F	E+Channel(Ours)	55.1	32.5	10.89

Table 7. Comparative experiment results.

Method	mAP@0.5%	mAP@0.5:0.95%	Parameters/M
YOLOv5s	33.1	17.9	7.04
YOLOv5x	39.9	23.5	86.23
YOLOv6s	37.1	22.0	16.30
YOLOv6l	41.0	25.0	51.98
YOLOv7m	48.8	27.6	36.53
YOLOv7x	50.0	28.5	70.84
YOLOv8s	39.5	23.5	11.13
YOLOv8x	45.4	27.9	68.13
YOLOv9s	41.0	24.8	9.60
YOLOv9e	47.5	29.1	57.72
YOLOv10s	39.0	23.3	8.07
YOLOv10x	45.7	28.2	31.67
Drone-YOLOs	44.3	27.0	10.9
Drone-YOLOl	51.3	31.9	76.2
MS-YOLO	53.1	31.3	79.7
Ours	55.1	32.5	10.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Deng, L.; Hu, C.; Xie, T.; Wang, C. Dense Small Object Detection Based on an Improved YOLOv7 Model. Appl. Sci. 2024, 14, 7665. https://doi.org/10.3390/app14177665

AMA Style

Chen X, Deng L, Hu C, Xie T, Wang C. Dense Small Object Detection Based on an Improved YOLOv7 Model. Applied Sciences. 2024; 14(17):7665. https://doi.org/10.3390/app14177665

Chicago/Turabian Style

Chen, Xun, Linyi Deng, Chao Hu, Tianyi Xie, and Chengqi Wang. 2024. "Dense Small Object Detection Based on an Improved YOLOv7 Model" Applied Sciences 14, no. 17: 7665. https://doi.org/10.3390/app14177665

APA Style

Chen, X., Deng, L., Hu, C., Xie, T., & Wang, C. (2024). Dense Small Object Detection Based on an Improved YOLOv7 Model. Applied Sciences, 14(17), 7665. https://doi.org/10.3390/app14177665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dense Small Object Detection Based on an Improved YOLOv7 Model

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. More Focus on Small Object Detection Head

3.2. Multi-Scale Feature Extraction Module

3.3. SPD Conv Downsampling

3.4. ELAN Model Optimization

4. Experiments and Analysis of Results

4.1. Data Set

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Experimental Analysis of Different Improvement Points

4.4.1. Improved Detection Head for YOLOv7

4.4.2. C3R2N Multi-Scale Feature Extraction Module

4.4.3. Introduce SPD Conv Downsampling Module

4.4.4. Adjusting the Number of Channels of the Feature Fusion Module

4.5. Ablation Experiments

4.6. Comparative Experiments

4.7. Visual Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI