YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes

Wang, Jiabao; Wu, Jun; Wu, Junwei; Wang, Jiangpeng; Wang, Ji

doi:10.3390/app13169173

Open AccessArticle

YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes

by

Jiabao Wang

,

Jun Wu

^*,

Junwei Wu

,

Jiangpeng Wang

and

Ji Wang

^*

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9173; https://doi.org/10.3390/app13169173

Submission received: 12 May 2023 / Revised: 28 July 2023 / Accepted: 28 July 2023 / Published: 11 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

With object detection technology, real-time detection of dense scenes has become an important application requirement in various industries, which is of great significance for improving production efficiency and ensuring public safety. However, the current mainstream target detection algorithms have problems such as insufficient accuracy or inability to achieve real-time detection when detecting dense scenes, and to address this problem this paper improves the YOLOv7 model using attention mechanisms that can enhance critical information. Based on the original YOLOv7 network model, part of the traditional convolutional layers are replaced with the standard convolution combined with the attention mechanism. After comparing the optimization results of three different attention mechanisms, CBAM, CA, and SimAM, the YOLOv7B-CBAM model is proposed, which effectively improves the accuracy of object detection in dense scenes. The results on VOC datasets show that the YOLOv7B-CBAM model has the highest accuracy, reaching 87.8%, 1.5% higher than that of the original model, and outperforms the original model as well as other models with improved attention mechanisms in the subsequent results of two other different dense scene practical application scenarios. This model can be applied to public safety detection, agricultural detection, and other fields, saving labor costs, improving public health, reducing the spread and loss of plant diseases, and realizing high-precision, real-time target detection.

Keywords:

computer vision; YOLO; attention mechanism; CBAM; CA; SimAM

1. Introduction

Since the development of computer vision, object detection in dense small targets has been a very challenging topic. In dense scenes, many targets need to be detected, the background is complex, and the occlusion problem is common, which greatly affects the accuracy. However, due to the practical application requirements, such as mask recognition detection in high traffic scenarios, crop fruit detection, etc., the human detection workload is large, the recognition accuracy is unstable, and the execution efficiency is low. So, the high-accuracy object detection recognition method for dense scenes is of great research significance.

Nowadays, object detection has been widely used in the field of deep learning, which can be divided into two main types, respectively, one-stage-based regression and two-stage-based RPN (Region Proposal Network) [1]. The one-stage algorithm starts from the original YOLO (You Only Look Once) [2] and gradually develops SSD (Single Shot MultiBox Detector) [3], YOLOv2 [4], RetinaNet [5], YOLOv3 [6], etc., and it directly uses the whole image as the network input and obtains the position of the target-enclosing frame and the class of the target by only one forward propagation. The one-stage algorithm is fast in detection but suffers from low accuracy and poor detection of small objects. While two-stage starts from the original R-CNN [7], and gradually develops SPPNet (Spatial Pyramid Pooling) [8], Fast-RCNN [9], and Faster-RCNN [10]. The two-stage algorithms need to generate region proposals by heuristic methods (selective search [11]) or CNN networks (RPN) and then classify region proposals. Although its accuracy will be higher than that of the one-stage algorithm, the feature repetition calculation is too large and the training speed is too slow. And, after the one-stage algorithm is continuously optimized, the drawbacks are compensated in YOLOv4 [12], YOLOv5 [13], YOLOv6 [14], and YOLOv7 [15], which are continually developed in the YOLO series.

Most of the previous studies have used older model versions, and suitable application scenarios for the latest YOLOv7 have not yet been found. So, this paper improves the latest YOLOv7 deep learning model and proposes the YOLOv7B-CBAM network model to deal with small target detection in dense scenes. The main innovations and contributions of this paper are as follows:

(1): An improved YOLOv7B-CBAM model based on the attention mechanism is proposed to enhance the performance of the YOLOv7 model using the CBAM attention mechanism to achieve high-precision, real-time detection in complex and dense scenes.
(2): Comparing the results of three improved YOLOv7 models based on different attention mechanisms on the VOC dataset and proposing the YOLOv7B-CBAM model with the highest accuracy, which demonstrates the superiority of the proposed model in accuracy.
(3): Realizing real-time, high-accuracy detection on the two different datasets demonstrates the generalization and applicability of the proposed model in different complex scenes.

2. Related Work

2.1. Computer Vision and Deep Learning

Object detection in images, as the most advanced aspect of computer vision development, is now widely used in various aspects. On one hand, in the field of defect detection, Ref. [16] reviewed and summarized the application of product defects in defect detection in ultrasonic detection, filtering, and computer vision, and performed a detailed analysis of defect classification, feature description, etc. In [17], the real-time detection of surface defects on arc magnets was achieved using a migration learning mechanism using lightweight YOLOv5s, which guarantees the high accuracy of defect detection under the condition of small sample training. On the other hand, various problems to be solved in the study of target detection and classification of UAV (unmanned aerial vehicle) datasets are summarized [18], which illustrate the wide application of computer vision and object detection in the field of deep learning in the UAV domain. Also, in the field of plant and pest detection, four different models are compared on a pine insect pest dataset, and a hybrid model is proposed that can be applied for monitoring and predicting various insect species in agriculture and forestry [19]. The Faster DR-IACNN with higher accuracy is proposed to achieve real-time detection and provide guidance for the field of grape leaf disease detection and other plant pest fields [20]. It turns out that object detection and computer vision based on deep learning has significant advantages in practical applications, saving manpower, and simplifying the recognition process.

2.2. YOLO

The YOLO series is almost the fastest and best algorithm in one-stage object detection, and its continuous development is the main reason for it to stay mainstream. From the appearance of YOLO at the beginning of 2016 to YOLOv2 at the end of the same year, BN layers were used to make bounding box predictions utilizing anchor box. In 2018, YOLOv3 used FPN (Feature Pyramid Network) upsampling to deepen the number of backbone layers. YOLOv4 appeared in April 2020, with its addition of SPP and PAN (Path Aggregation Network) [21] structures, while YOLOv5, which appeared in June, reduced the model size by 90% compared to YOLOv4, but the accuracy was equivalent. The YOLO series was followed by YOLOX [22], YOLOv6, and YOLOv7 using E-ELAN. However, the usefulness of the YOLO series has been proven since YOLOv3.

On the one hand, YOLOv3 and YOLOv4 have been widely used in various fields. In [23], the proportional and scale-aware YOLO method is proposed, which solves the problem of detection of objects with large aspect ratio differences, such as the human body, and detection of smaller objects, and performs well in VOC 2012 and pedestrian detection. However, the accuracy of YOLO models based on older versions is lower than the widely used YOLOv5 and YOLOv7 today. In [24], the average accuracy of YOLOv7 was demonstrated to be better than YOLOv5s using experiments on Camellia oleifera fruit detection. In the experiments, the YOLOv7 model outperformed the accuracy of YOLOv5s in detecting obscured fruits, proving the superiority of the YOLOv7 algorithm. In the field of hat and mask recognition in complex kitchen scenarios, the embedded model using YOLOv5s has been able to achieve real-time detection with a guaranteed accuracy of 85.7% [25]. Experiments show that the YOLO series performs well in dense scenes, proving the superiority of YOLOv7 in detection accuracy against small targets.

2.3. Attention Mechanism

A one-stage algorithm has high detection speed, but the consequent low accuracy has always been the shortcoming of its development. In contrast, the attention mechanism extracts features which significantly improve the accuracy of recognition and classification, which has always led to good results in the improvement of YOLO models. The attention mechanism that appeared in the limelight was first used on RNNs in [26], and the attention mechanism was first introduced into the image field in [27]. Later, the CNN-based attention mechanism RA-CANN was proposed for the first time in [28], and, in the subsequent development, a variety of attention mechanisms gradually emerged, such as channel attention, spatial attention, and self-attention [29]. SE-Net, proposed in [30], introduced the channel attention mechanism into the public’s vision for the first time, which was mainly used to show the correlation between different channels, and subsequently developed ECA-Net [31], GCT [32], and so on. On the other hand, starting from STN [33], the spatial attention mechanism is mainly used to improve the feature expression of key regions, enhance specific target regions, and weaken irrelevant background regions, and then GE-Net [34] was developed. In recent years, the hybrid attention mechanism of parallel channel attention and spatial attention is mainly used, mainly CBAM [35] (Carbon Border Adjustment Mechanism), BAM [36], scSE [37], DANet [38], CA [39] (coordinate attention), etc. In [40], the improvement of YOLOv5s using the CA mechanism resulted in a 30% smaller size than the original model, but still ensured its good detection accuracy. The CAM and parallel residual attention blocks were used in [41,42] to improve the accuracy of the then-highest accuracy models on vehicle model recognition and human pose estimation applications, respectively. It has been proved that mixed attention can effectively improve the robustness of the network as well as the accuracy in practical application. In this paper, various attention mechanisms are used to modify the model, which is mainly mixed attention mechanisms, and the comparison experiments of CBAM, CA, and SimAM [43] (simple, parameter-free attention module) are used.

3. Proposed Model

3.1. Research Status and Application Analysis of Object Detection Based on YOLO

In the above research, it is obvious that the YOLO series has advantages over other models in dense scenes and the YOLOv7 model has high accuracy in detecting small targets. Therefore, this paper uses the YOLOv7 model to solve the problem of small target detection in dense scenes. On the other hand, the mixed attention mechanism performs well in YOLOv5 and can effectively improve the model’s accuracy. Due to the limitations of the YOLOv7 model itself, in order to ensure the requirements of a high detection rate and low false detection rate in dense scenes and improve the accuracy of target detection, this paper makes some improvements to the YOLOv7 model. Based on the original YOLOv7 network model, part of the traditional convolutional layers are replaced with the standard convolution combined with the attention mechanism to take full advantage of the feature semantic information. To find the best results in terms of accuracy using the attention mechanism, this paper selected CBAM, CA, and another new type of SimAM to modify the model, and finally found that CBAM performed best.

3.2. Carbon Border Adjustment Mechanism

CBAM is a lightweight module that combines channel attention and spatial attention proposed by the benchmark SENet. In the case of a small increase in the amount of calculation and parameters, the performance of the model is greatly improved. CBAM mainly emphasizes meaningful features in two dimensions, spatial and channel, which correspond to spatial and channel attention in the model, respectively. In CBAM, the input intermediate feature mapping

F

is first passed through the one-dimensional channel attention module

M_{c}

and obtains the intermediate output

F^{'}

. Subsequently, the final result is obtained through the two-dimensional spatial attention module

M_{s}

, which proceeds as follows:

F^{'} = M_{c} (F) \otimes F,

(1)

F^{″} = M_{s} (F) \otimes F^{'},

(2)

where

\otimes

denotes element-by-element multiplication, while the specific channel attention module and spatial attention module formulas are as follows:

M_{c} F = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))),

(3)

M_{s} F = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])),

(4)

where

σ

denotes the sigmoid function,

M L P

indicates multi-layer perceptron, and

f^{7 \times 7}

is the convolution operation of

7 \times 7

. The specific CBAM structure is shown in Figure 1. CBAM makes the input intermediate feature map focus on important features and suppresses unnecessary features and performs noise reduction of irrelevant clutter, which eventually makes the network focus on the object more correctly. In this paper, CBAM is used to take the place of two initial conventional convolutional layers in the backbone part of the YOLOv7 network, which effectively extracts the image features and significantly improves the accuracy of the model. The specific modified model structure is shown in Figure 2; the red box on the left is the backbone part of the YOLOv7 model, and the red box on the right is the head part. The specific location of the added CBAM attention module has been marked with a yellow color block in the figure, where “CBS, 3/1, 64” refers to the original convolutional layer with a convolutional kernel size of 3, a stride of 1, and a channel count of 32.

3.3. Coordinate Attention

CA is a lightweight mechanism proposed after improving the shortcomings of SE as well as CBAM attention mechanisms, which enables networks to obtain information over a larger range by embedding location information into channel attention. To avoid compressing all spatial information and capturing accurate location information, the traditional channel attention is proposed to be decomposed into two one-dimensional global poolings, which extract spatial information from horizontal and vertical directions, respectively, and perform transformation coding. The specific decomposition is to pool the input feature maps of

(C, H, W)

by x and

y

directions instead of global pooling. Firstly, each channel is pooled from horizontal and vertical coordinates to generate

(H, 1)

and

(1, W)

, respectively, and the outputs of the

c - t h

channel with height

h

and width

w

are, respectively:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i),

(5)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w),

(6)

The obtained result is decomposed into two independent tensors by

1 \times 1

convolution, and then the

1 \times 1

convolution is used to perform the up-dimensioning operation, respectively, and, finally, the attention vector is obtained by combining the sigmoid activation function:

f = δ (f^{1 \times 1} ([z^{h}, z^{w}])),

(7)

g^{h} = σ (f_{h}^{1 \times 1} (f^{h})),

(8)

g^{w} = σ (f_{w}^{1 \times 1} (f^{w})),

(9)

from which, finally, the output Y is obtained, and the whole process can be summarized as:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j),

(10)

where

[\cdot, \cdot]

is the concatenate operation,

f^{1 \times 1}

is the convolution operation of

1 \times 1

,

δ

is the nonlinear activation function, and

σ

is the sigmoid activation function. In this paper, CA is used to modify the MP2 of the head part of the original YOLOv7 network, and a traditional convolutional layer in the original MP2 module is replaced with CA attention. That is, one side performs convolution after pooling while the other side uses CA channel attention to convolution, and finally connects the two. The specific structure is shown in Figure 3. Dotted boxes with yellow blocks in the figure show the modified structure of MP2, and the red CA module is the replacement of YOLOv7. In relation to MP2, 2c refers to the pooling layer with 2c number of input channels.

3.4. Simple, Parameter-Free Attention Module

SimAM is an attention module based on neuroscientific knowledge proposed for convolutional neural networks. Unlike the existing one-dimensional channel attention and two-dimensional spatial attention, SimAM does not use traditional pooling but assigns weights to the results using energy function solutions based on neuroscience theory and the principle of linear differentiability. Thus, it is not adding any parametric quantity but is a novel three-dimensional weighted attention mechanism. The energy function is as follows:

e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2},

(11)

and its analytical solution is as follows:

w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ},

(12)

b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t},

(13)

where

μ_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i},

(14)

σ_{t}^{2} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(x_{i} - μ_{i})}^{2},

(15)

and, thus, the minimum energy can be obtained by the following equation:

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - \hat{u})}^{2} + 2 {\hat{σ}}^{2} + 2 λ},

(16)

where the importance of neurons can be obtained by

\frac{1}{e^{*}}

, and the features are augmented as defined by the attention mechanism, resulting in the following formula:

\tilde{X} = s i g m o i d (\frac{1}{E}) ⊙ X,

(17)

In this paper, SimAM is used similarly to CA, with a conventional convolutional layer replaced by SimAM attention in the MP2 module of the YOLOv7 head section modified. Since SimAM does not add additional parameters, the number of parameters is rather reduced compared to the original network. The specific structure is shown in Figure 4. The dotted boxes with yellow blocks in the figure show the modified structure of MP2.

3.5. Summarize

The purpose of using the attention mechanism in this paper is mainly to solve the current problems of the original YOLOv7 model, such as the difficulty in guaranteeing accuracy when detecting dense scenes, the lack of accuracy in detecting small targets, and the difficulty in distinguishing various targets. In this paper, three different attention mechanisms, CBAM, CA, and SimAM, are used to improve the YOLOv7 model and conduct comparative experiments. In order to ensure the rigor of the comparison experiments, this paper continued to test the CBAM head part after obtaining good results in the backbone part when using CBAM. It was finally found that the YOLOv7 model improved by replacing the convolutional layer in the backbone part using CBAM, which has the highest accuracy, higher than other models, and the test results will be given in Section 4.

4. Experiment and Result Analysis

4.1. Experiment Environment and Datasets

Table 1 describes the software and hardware environments used for model training and testing.

The experimental datasets used in this paper are Pascal VOC2007 and VOC2012, in which 21,493 images can be divided into 20 categories, including people, cats, cows, cars, buses, bicycles, sofas, TVs, bottles, etc. Among them, 50% of “Pascal VOC 2007” and all “Pascal VOC 2012” images are used as the training set, and the remaining 50% of “Pascal VOC 2007” images are used as both the test set and the validation set. During the training of all datasets in this paper, the hyperparameters are set to epoch 400 and batch size 8.

4.2. Comparative Experiment on Attention Mechanisms

In this paper, the effectiveness of the above three attention mechanisms in improving model accuracy is tested by comparing experiments. The three metrics of mAP, parameter size, and GFLOPs (giga floating-point operations per second) were compared in the same experimental setting using an image size of 640 × 640. The results are summarized in Table 2. “YOLOv7B-CBAM” denotes the use of CBAM to replace the convolutional network in the backbone, and “YOLOv7H-CA” denotes the use of CA to replace the convolutional network of the MP2 module in the head. It is evident from the experimental findings that:

(1): In terms of model accuracy, the best results are obtained using CBAM attention, with a 1.0% improvement in model accuracy when replacing the head part compared to the original YOLOv7 model, and the best improvement in model accuracy when replacing the backbone part, with a 1.5% improvement. In contrast, the other two attention mechanisms also improve the model accuracy, but the results show that YOLOv7B-CBAM accuracy is 0.3% and 0.1% higher than YOLOv7H-CA and YOLOv7H-SimAM, respectively, indicating that the other two attention mechanisms do not improve the model accuracy as much as CBAM.
(2): When it comes to model size, it is clear that the three attention mechanisms have only minor effects on the number of parameters and operations. Only the model proposed in this paper has an increase of 0.2 in computation for the original YOLOv7 model, which is equivalent to an increase of 0.19% in the number of parameters and is almost negligible. And, since the main requirement of this paper is the improvement of the accuracy, overall YOLOv7B-CBAM has the best performance in the experimental results.
(3): To ensure the rigorousness of the results, the YOLOv7B-CA and YOLOv7B-SimAM models which used attention mechanisms to replace the backbone are also tested in this part. According to the results, from the perspective of accuracy, the accuracies of these two models are 0.829 and 0.827, which are 0.049 and 0.051 lower than that of YOLOv7B-CBAM, and 0.034 and 0.036 lower than that of the original YOLOv7 model. From the perspective of model size, these two models are not better than YOLOv7B-CBAM and the original YOLOv7 model. This proves that the performances of the YOLOv7B-CA and YOLOv7B-SimAM models are poor, which also fully indicates the superiority of YOLOv7B-CBAM. Therefore, there is no further research on the application of these two models in this paper.

Table 2. VOC dataset results.

Model	mAP50	Parameters (M)	GFLOPs
YOLOv5s	0.829	7.06	16.0
YOLOv7	0.863	36.58	103.6
YOLOv7H-CA	0.875	36.57	103.5
YOLOv7H-SimAM	0.877	36.56	103.4
YOLOv7H-CBAM	0.873	36.38	103.8
YOLOv7B-SimAM	0.827	36.56	103.4
YOLOv7B-CA	0.829	36.58	103.5
YOLOv7B-CBAM	0.878	36.58	103.8

5. Application

As an important application in the field of computer vision, object detection has wide application prospects due to its ability to identify and locate objects in images or videos and reduce labor costs. The current practical needs for real-time detection of dense scenes are mainly in the field of public safety in public places such as airports, stations, subways, and other crowded places, where real-time monitoring of people is needed, or in the field of agriculture, where real-time detection and localization of crops and animals are needed to achieve intelligent agricultural management and precision agricultural production. It is of great significance for improving production efficiency and ensuring public safety. Therefore, how to achieve efficient and accurate real-time detection of dense scenes is the current challenge that needs to be solved for computer vision and object detection. However, small object detection in dense scenes needs to overcome a variety of difficulties, specifically, the difficulties of small-object detection in dense scenes include the following aspects:

(1): High density: there are a large number of objects in the scene, and their mutual occlusion and overlap can increase the difficulty of detection.
(2): Small objects: small objects usually have small pixel sizes and are difficult to be accurately detected and localized in the image.
(3): Diversity: objects in dense scenes may have different classes, shapes, colors, textures, and other features, which can increase the difficulty of training and testing the model.

In this case, achieving fast real-time detection is a challenging task. The YOLOv7B-CBAM model, whose accuracy has been verified on the VOC dataset, is used to test its accuracy and confirm its application in the public safety and agricultural domains on the mask dataset and the tomato dataset, respectively. The tomato and mask datasets used in this paper are both open-source datasets found on the Kaggle website, and the picture size is uniformly 640 × 640 × 3 during training.

5.1. Detection in Tomato Dataset

In the field of object detection, there have been many studies about fruit detection; for example, in [44], there is a systematic summary of the development of picking robots, where the importance of fruit identification techniques is emphasized and the characteristics of different target detection techniques are analyzed to illustrate the feasibility and necessity of the application of target detection techniques in picking robots. On the other hand, in [45], some problems are encountered and the directions for solving them using Nano Aerial Bee (NAB) in agricultural environments are investigated, and the effectiveness of YOLOv7 in practical applications in agricultural environments is tested on the flower detection dataset, demonstrating the robustness of YOLOv7 and the feasibility of its application. In general fruit detection applications for picking robots, it is common to encounter problems such as more complex real-world scenes, too many fruit and vegetable targets, small targets, and a large amount of occlusion, resulting in poor accuracy of the final results and difficulty in accurately identifying fruits. Currently, fruit detection requires high model accuracy and robustness, and the original object detection model can hardly meet its requirements. Therefore, this paper verifies the effectiveness of the above model by applying the application scenario of tomato fruit detection in a dense scene.

To verify the superiority of the YOLOv7B-CBAM model in dense scenarios, we completed comparative experiments on the tomato dataset. There are 895 images in the dataset, including 695 images for training, 95 images for validation, and 105 images for testing. The hardware and software environments for the experiments are consistent with those described in Section 4.

Figure 5 is the original image of the tomato dataset, and Figure 6 is the image after the detection of the dataset using the YOLOv7B-CBAM model. It can be seen that the blocked tomato in the lower left corner can be accurately identified, and the recognition effect of small objects in dense scenes in the upper right corner image is also very good. Specific experimental data are shown in Table 3.

Table 3 shows the results obtained for the YOLOv5s, the original YOLOv7 model, and optimized YOLOv7 models by using the attention mechanism on this dataset. The following conclusions can be drawn from the analysis results.

As shown in Figure 7, compared with the YOLOv7 model with the attention mechanism added, the accuracy of the proposed YOLOv7B-CBAM model is the best, which is 0.6% higher than that of the original YOLOv7 model and 0.9% higher than that of YOLOv7H-CBAM. It is proved that the CBAM module is effective and the model in this paper can identify small objects in complex and dense scenes more accurately. In contrast, YOLOv7H-CA and YOLOv7H-SimAM models performed poorly and had low accuracy in this dataset. The reason may be that the robustness of these two attention mechanisms was poor, and it was difficult to ensure accuracy in dense or complex scenes, which hurt accuracy.

The comparison shows that the improved YOLOv7B-CBAM model in this paper has higher robustness and can still maintain a high recognition rate in dense scenes and accurately identify tomatoes. The model can be integrated into an intelligent tomato-picking robot in practical applications to automate the tomato-picking process. The use of the YOLOv7B-CBAM model can effectively improve the picking efficiency and accuracy of tomato-picking robots, reduce the cost and time of manual picking, and also cope with complex picking environments, such as dense tomato bushes. In addition, the high robustness of the model can also ensure good recognition in different picking scenarios, thus improving the stability and reliability of the robot. The model can be further enhanced by data augmentation, and more training data can be added for other types of fruit recognition, as well as for plant pest and disease detection or fruit quality detection.

5.2. Detection in Face Mask Dataset

Object detection has been widely used in many aspects of daily life and is crucial in the prevention and management of epidemics. Since early 2020, the COVID-19 epidemic has been spreading throughout the world, posing a significant threat to the world’s health systems. COVID-19 is a virus spread by direct transmission and contact, and masks contain droplet nuclei of the virus to prevent wearers from inhaling it. As a result, masks are a crucial barrier against virus infection, which can significantly lower the risk of COVID-19 infection and the potential for cross-infection in public settings. Moreover, due to the unique characteristics of hospitals, airports, and other locations, it is vital to avoid other infectious illnesses in addition to COVID-19 and other health demands, so the detection of mask wear is also extremely significant. However, at present, the detection of mask wear is mainly based on human identification, which is not only a hygiene hazard but also a risk of missed detection due to personnel fatigue. However, due to the large number of people in public places, small targets, and face occlusion, the detection is difficult, and the requirement of model precision is high; the current object detection model finds it hard to meet its accuracy requirements. The validity of the above model is verified by the application scenario of mask-wearing specification detection in complex scenarios.

In this paper, comparison experiments are completed based on the face mask dataset to verify the accuracy of the YOLOv7 model and on the attention mechanism proposed above. There are a total of 1420 photos in the dataset: 990 photos for training, 136 photos for testing, and 294 photos for validation. The experiments’ hardware and software environments are consistent with the description in Part IV. Figure 8 shows the images before identification in the face mask dataset.

In Figure 9a, it can be seen that the YOLOv7 model incorrectly identified the behavior in the upper right picture as improper wearing a mask, while YOLOv7B-CBAM correctly identified it as not wearing a mask in Figure 9b. The specific experimental data are shown in Table 4.

Table 4 shows the results obtained on this dataset for YOLOv5, YOLOv6, YOLOv7, and YOLOv7 models using the attention mechanism. The following conclusions can be drawn from the analysis results:

(1): As shown in Figure 10, when comparing the experimental results of the YOLOv5 series with the YOLOv7 model, it can be seen that in the YOLOv5 series only YOLOv5x, which has the largest number of parameters, has 0.2% more accuracy than YOLOv7 but, in contrast, the number of parameters and calculation of YOLOv7 is only half of YOLOv5x and it is twice as fast as the YOLOv5x model. The accuracy of the YOLOv7 model is 10.2%, 3.6%, and 2.9% higher than that of YOLOv5s, YOLOv5m, and YOLOv5l, respectively. It can be seen that the model performance is gradually improving with the development of the YOLO series. However, the YOLOv7B-CBAM model proposed in this paper also increases accuracy by 2.8% over the best of them in terms of accuracy, YOLOv5x, which proves the superiority of the model proposed in this paper in terms of accuracy.

(2): When comparing the YOLOv7 model with the added attention mechanism, as shown in Figure 11, the accuracy of the proposed YOLOv7B-CBAM model is the best, with a 3.0% improvement over the original YOLOv7 model and 1.0% improvement over the YOLOv7H-CBAM. Compared with YOLOv7H-CA and YOLOv7H-SimAM, the accuracy is improved by 3.0% and 3.6%, respectively. It proves that the model in this paper can more accurately identify whether the mask is worn or not and the specification. The CBAM module effectively extracts the important features of images and noise reduction of irrelevant information, and the accuracy has surpassed that of the YOLOv7 model using other attention mechanisms as well as the original model.

(3): In terms of model size and recognition speed, YOLOv5s and YOLOv5m, with less computation due to their number of parameters, improve the FPS to 128.2 and 90.9, respectively, at the expense of accuracy, which is higher than the YOLOv7 model. The comparison between the YOLOv7 models with the added attention mechanism shows that YOLOv7H-SimAM has less computation and number of parameters than the YOLOv7 model due to its parameter-free characteristics, but its accuracy is not improved. In contrast, YOLOv7B-CBAM does not change the number of parameters but slightly increases the computational power and decreases the FPS by 7, which results in an improvement in accuracy.

This comparison shows that the improved YOLOv7B-CBAM model in this paper can more accurately identify the location of the face and determine whether masks are worn properly, which greatly reduces labor cost and improves the safety and hygiene of mask identification. Due to its high accuracy and low missed detection rate, in addition to preventing the new coronavirus, this model can also be applied to places that need to ensure that masks need to be worn properly—for example, in hospital outpatient clinics to reduce the spread of bacteria and viruses and in restaurants or cafeteria kitchens to prevent droplet contamination of food and so on to ensure the hygiene and safety of the premises. The model may be used to test existing models and boost their precision in the future by being applied to more face detection datasets.

5.3. Summary

The above two comparative experiments demonstrate the accuracy of the proposed YOLOv7B-CBAM model for small object recognition in dense scenes, and it has fast detection speed and low computational cost while ensuring accuracy. Comparing the model proposed in this paper with other models that are currently more popular, SSD needs to process feature maps of different scales, which may be affected in dense scenes. In contrast, Fast-RCNN first performs region extraction on the image, and then classifies and localizes the extracted regions. In this case, Fast-RCNN may perform better in terms of accuracy, but is slower and cannot reach the practical application requirements of real-time detection. In addition, the model uses a special Anchor-Free technology, which does not require a preset anchor and can be more flexible than YOLOv5 to adapt to objects of different sizes and shapes. This makes the model promising for a wide range of applications in many related fields.

For normative mask-wearing recognition, the YOLOv7B-CBAM model can be applied to face mask-wearing detection to help relevant departments monitor the wearing of masks by the crowd. This application scenario can be used in public places, transportation, etc., to improve public health and epidemic prevention and control. In addition, the model has a wide range of applications in plant- or fruit-related fields. For example, it can be applied to the identification and classification of fruits, vegetables, and grains to help farmers better manage their crops and improve the efficiency and yield of agricultural production. Also, the model can be used for plant disease detection to help farmers take timely measures to reduce the spread and loss of diseases. It should be noted that although the model has high accuracy in recognizing normative mask-wearing and tomato fruit recognition, adjustments and optimizations need to be made based on specific scenarios and data in practical applications to improve the model’s accuracy and stability.

6. Conclusions

In order to cope with the problem of insufficient accuracy of small target recognition in dense scenes, this paper improves the overall accuracy by optimizing the network structure of the YOLOv7 model to improve the model’s robustness. By replacing the conventional convolutional layers in the backbone part with the CBAM attention mechanism, invalid and redundant features are eliminated, leaving more useful information for target localization and classification, thus improving the detection effect as well as the accuracy of the localization. The experimental findings on Pascal VOC show that YOLOv7B-CBAM is superior in terms of accuracy. In order to further verify the validity and accuracy of the model, the model is tested on the face mask dataset and the tomato dataset. It proves that the modified model in this paper can achieve a detection speed of more than 60 FPS while ensuring high accuracy in analyzing the specification of mask wearing, which is advantageous for practical use in applications like the specification of mask-wearing detection in public places and mask-wearing detection in hospitals. This model also has higher accuracy in the tomato dataset than the YOLOv7 model improved using other attention mechanisms, and it can be integrated into an intelligent tomato-picking robot in practical applications to automate the tomato-picking process. To facilitate the use of this target detection model in mobile monitoring devices with limited computational resources and improve the practical application value of the model, in the next step this paper will explore the design method of lightweighting the model network to reduce the number of parameters and the computation of the model while ensuring high accuracy.

Author Contributions

Conceptualization, J.W. (Jun Wu); methodology, J.W. (Jun Wu); software, J.W. (Jiabao Wang); validation, J.W. (Junwei Wu); formal analysis, J.W. (Jun Wu) and J.W. (Jiabao Wang); investigation, J.W. (Jiabao Wang); resources, J.W. (Jiangpeng Wang); data curation, J.W. (Ji Wang); writing—original draft preparation, J.W. (Jiabao Wang); writing—review and editing, J.W. (Jun Wu); visualization, J.W. (Junwei Wu); supervision, J.W. (Jun Wu); project administration, J.W. (Jiabao Wang); funding acquisition, J.W. (Jun Wu). All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No: 2020BAB012), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets that support this study are openly available online.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hu, Q.; Zhai, L. RGB-D Image Multi-Target Detection Method Based on 3D DSF R-CNN. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1954026. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2980–2988. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1109. [Google Scholar] [CrossRef] [Green Version]
Gavrilescu, R.; Zet, C.; Foalău, C.; Skoczylas, M.; Cotovanu, D. Faster R-CNN: An Approach to Real-Time Object Detection. In Proceedings of the 2018 International Conference and Exposition on Electrical and Power Engineering, Iasi, Romania, 18–19 October 2018; pp. 165–168. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. Ultralytics/Yolov5: V3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Yang, J.; Li, S.; Wang, Z.; Dong, H.; Wang, J.; Tang, S. Using Deep Learning to Detect Defects in Manufacturing: A Comprehensive Survey and Current Challenges. Materials 2020, 13, 5755. [Google Scholar] [CrossRef]
Huang, Q.; Zhou, Y.; Yang, T.; Yang, K.; Cao, L.; Xia, Y. A Lightweight Transfer Learning Model with Pruned and Distilled YOLOv5s to Identify Arc Magnet Surface Defects. Appl. Sci. 2023, 13, 2078. [Google Scholar] [CrossRef]
Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Lee, S.H.; Gao, G. A Study on Pine Larva Detection System Using Swin Transformer and Cascade R-CNN Hybrid Model. Appl. Sci. 2023, 13, 1330. [Google Scholar] [CrossRef]
Xie, X.; Ma, Y.; Liu, B.; He, J.; Li, S.; Wang, H. A Deep-Learning-Based Real-Time Detector for Grape Leaf Diseases Using Improved Convolutional Neural Networks. Front. Plant Sci. 2020, 11, 751. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef] [Green Version]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Hsu, W.Y.; Lin, W.Y. Ratio-and-Scale-Aware YOLO for Pedestrian Detection. IEEE Trans. Image Process. 2021, 30, 934–947. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Tang, Y.; Zou, X.; Wu, M.; Tang, W.; Meng, F.; Zhang, Y.; Kang, H. Adaptive Active Positioning of Camellia oleifera Fruit Picking Points: Classical Image Processing and YOLOv7 Fusion Algorithm. Appl. Sci. 2022, 12, 12959. [Google Scholar] [CrossRef]
Zhou, Z.; Zhou, C.; Pan, A.; Zhang, F.; Dong, C.; Liu, X.; Zhai, X.; Wang, H. A Kitchen Standard Dress Detection Method Based on the YOLOv5s Embedded Model. Appl. Sci. 2023, 13, 2213. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. arXiv 2014, arXiv:1406.6247. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 7132–7141. [Google Scholar] [CrossRef] [Green Version]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Ruan, D.; Wang, D.; Zheng, Y.; Zheng, N.; Zheng, M. Gaussian Context Transformer. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. arXiv 2015, arXiv:1506.02025. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks. arXiv 2018, arXiv:1810.12348. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef] [Green Version]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent Spatial and Channel ‘Squeeze & Excitation’in Fully Convolutional Networks. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2018; pp. 421–429. [Google Scholar] [CrossRef] [Green Version]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Wu, J.; Dong, J.; Nie, W.; Ye, Z. A Lightweight YOLOv5 Optimization of Coordinate Attention. Appl. Sci. 2023, 13, 1746. [Google Scholar] [CrossRef]
Yu, Y.; Xu, L.; Jia, W.; Zhu, W.; Fu, Y.; Lu, Q. CAM: A fine-grained vehicle model recognition method based on visual attention model. Image Vis. Comput. 2020, 104, 104027. [Google Scholar] [CrossRef]
Huo, Z.; Jin, H.; Qiao, Y.; Luo, F. Deep High-Resolution Network with Double Attention Residual Blocks for Human Pose Estimation. IEEE Access 2020, 8, 224947–224957. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 24 July 2021; pp. 11863–11874. [Google Scholar]
Hua, X.; Li, H.; Zeng, J.; Han, C.; Chen, T.; Tang, L.; Luo, Y. A Review of Target Recognition Technology for Fruit Picking Robots: From Digital Image Processing to Deep Learning. Appl. Sci. 2023, 13, 4160. [Google Scholar] [CrossRef]
Pinheiro, I.; Aguiar, A.; Figueiredo, A.; Pinho, T.; Valente, A.; Santos, F. Nano Aerial Vehicles for Tree Pollination. Appl. Sci. 2023, 13, 4265. [Google Scholar] [CrossRef]

Figure 1. Structure diagram of CBAM.

Figure 2. Structure diagram of YOLOv7B-CBAM.

Figure 3. Structure diagram of YOLOv7H-CA.

Figure 4. Structure diagram of YOLOv7H-SimAM.

Figure 5. Image of tomato dataset before detection.

Figure 6. Results after identification of YOLOv7B-CBAM model.

Figure 7. YOLOv5 and YOLOv7B-CBAM comparison in tomato dataset.

Figure 8. Images of face mask dataset before detection.

Figure 9. Results after identification: (a) YOLOv7 original model’s result; (b) YOLOv7B-CBAM model’s results.

Figure 10. YOLOv5 and YOLOv7B-CBAM comparison.

Figure 11. Optimized YOLOv7 models comparison.

Table 1. Experiment environment.

Hardware and Software	Models and Versions
CPU	Intel(R) Core(TM) i7-11800H @ 2.30 GHz
GPU	NVIDIA GeForce RTX 3060
OS	Window 10
Development Language	Python 3.6
Deep Learning Framework	Pytorch 1.7.0

Table 3. Results of tomato dataset.

Model	mAP50	Parameters (M)	GFLOPs
YOLOv5s	0.9	7.01	15.8
YOLOv7	0.913	36.48	103.3
YOLOv7H-CA	0.889	36.47	103.1
YOLOv7H-SimAM	0.869	36.47	103.1
YOLOv7H-CBAM	0.91	36.48	103.3
YOLOv7B-CBAM	0.919	36.48	103.5

Table 4. Results of face mask dataset.

Model	mAP50	FPS	Parameters (M)	GFLOPs
YOLOv5s	0.769	128.2	7.01	15.8
YOLOv5m	0.835	90.9	20.86	48
YOLOv5l	0.842	49.3	46.11	107.8
YOLOv5x	0.873	26.2	86.19	204
YOLOv6s	0.574	102	18.5	45.23
YOLOv6m	0.515	58.8	34.8	85.74
YOLOv6l	0.538	39.8	59.54	150.67
YOLOv7	0.871	66.7	36.49	103.4
YOLOv7H-CA	0.871	62.9	36.48	103.2
YOLOv7H-SimAM	0.865	67.1	36.47	103.1
YOLOv7H-CBAM	0.891	66.2	36.49	103.4
YOLOv7B-CBAM	0.901	49.5	36.49	103.5
YOLOv7X	0.816	49.3	70.8	188.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Wu, J.; Wu, J.; Wang, J.; Wang, J. YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes. Appl. Sci. 2023, 13, 9173. https://doi.org/10.3390/app13169173

AMA Style

Wang J, Wu J, Wu J, Wang J, Wang J. YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes. Applied Sciences. 2023; 13(16):9173. https://doi.org/10.3390/app13169173

Chicago/Turabian Style

Wang, Jiabao, Jun Wu, Junwei Wu, Jiangpeng Wang, and Ji Wang. 2023. "YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes" Applied Sciences 13, no. 16: 9173. https://doi.org/10.3390/app13169173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes

Abstract

1. Introduction

2. Related Work

2.1. Computer Vision and Deep Learning

2.2. YOLO

2.3. Attention Mechanism

3. Proposed Model

3.1. Research Status and Application Analysis of Object Detection Based on YOLO

3.2. Carbon Border Adjustment Mechanism

3.3. Coordinate Attention

3.4. Simple, Parameter-Free Attention Module

3.5. Summarize

4. Experiment and Result Analysis

4.1. Experiment Environment and Datasets

4.2. Comparative Experiment on Attention Mechanisms

5. Application

5.1. Detection in Tomato Dataset

5.2. Detection in Face Mask Dataset

5.3. Summary

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI