Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model

Xu, Benyu; Yu, Su

doi:10.3390/app14051819

Open AccessArticle

Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model

by

Benyu Xu

¹ and

Su Yu

^2,*

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 1819; https://doi.org/10.3390/app14051819

Submission received: 10 January 2024 / Revised: 20 February 2024 / Accepted: 21 February 2024 / Published: 22 February 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

As one of the state-of-the-art object detection algorithms, YOLOv5 relies heavily on the quality of the training dataset. In order to improve the detection accuracy and performance of YOLOv5 and to reduce its false positive and false negative rates, we propose to improve the Segment Anything Model (SAM) used for data augmentation. The feature maps and mask predictions generated by the SAM are used as auxiliary inputs for the Mask-to-Mask (M2M) module. The experimental results show that after processing the dataset with the improved Segment Anything Model, the detection performance of YOLOv5 is improved with 99.9% precision and 99.1% recall. The improved YOLOv5 model has a higher license plate recognition accuracy than the original detection model under strong snowfall conditions, and the incidence of false-negative and false-positive is greatly reduced. The enhanced model can meet the requirement of accurate real-time recognition of license plates under strong snowfall weather conditions.

Keywords:

Segment Anything Model; YOLOv5; CCPD; object detection

1. Introduction

With the development of convolutional neural networks, deep learning-based object detection algorithms have gradually replaced traditional object detection algorithms. License plate detection is a subtask of object detection, and many generalized object detection algorithms can be directly applied to license plate detection. Classical two-stage target detection algorithms including R-CNN, Fast R-CNN, and Faster R-CNN have high detection accuracy, but the detection speed is not satisfactory, and these methods also have performance limitations when dealing with scenarios such as small-targeted license plates, occluded license plates, and dense license plates. Classical single-stage target detection algorithms include SSD, RetinaNet, YOLO, FCOS, CenterNet, etc. Their advantage lies in the ability to achieve very high detection speeds with partial loss of accuracy, but there are some limitations in dealing with target license plates of different scales and robustness to occluded license plates. Even though YOLOv5 is one of the most advanced target detection algorithms available, its performance depends heavily on the training dataset. Therefore, there is still room for improvement in detecting license plates in the face of complex scenarios such as abnormal weather and tilted license plates. With the continuous development in the field of target detection, the solution to these problems is also one of the focuses of future research.

The Segment Anything Model (SAM) [1] has garnered significant interest in recent times due to its outstanding performance and has become a foundational model in computer vision. Processed images by SAM provide a solid foundation for downstream tasks, showcasing its immense potential in computer vision. The powerful segmentation and zero-shot transfer capabilities of SAM can enhance the performance of visual tasks. In the field of medical image segmentation, Lei et al. [2], cleverly integrated MedLAM (a limited annotation framework) with SAM. By annotating only six extreme points in three directions on several templates, the model autonomously recognizes target anatomical areas on all data designated for annotation, significantly reducing the annotation workload in medical image segmentation. For 3D object detection, Zhang et al. [3] proposed a SAM-powered BEV processing pipeline to leverage SAM’s substantial zero-shot capabilities for zero-shot 3D object detection, confirming SAM’s applicability in 3D object detection. Zhang et al. [4] introduced an end-to-end SpA-Former by incorporating SAM, recovering shadow-free images from a single shadow image. Unlike traditional methods that require two steps for shadow detection and removal, SpA-Former unifies these steps into one, eliminating the need for a separate shadow detection step and any post-processing refinement steps, effectively integrating shadow detection and removal into a single stage. Li et al. [5] utilized SAM as a guiding module, combined with the lightweight Mask-to-Matte (M2M) module, to precisely improve mask outputs into alpha masks for target instances, addressing various image matting tasks. Inspired by Matting Anything, deducting detected targets in advance facilitates YOLOv5 in target detection. This approach aims to reduce the false positive and false negative rates of YOLOv5, thereby enhancing detection accuracy and performance.

2. Materials and Methods

2.1. YOLOv5

YOLOv5 [6] is a one-stage object detection model with four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The network structures of these versions are identical, with differences in network depth and width achieved through the parameters of depth_multiple and width_multiple. The overall framework structure of the YOLOv5 model is shown in Figure 1.

The structural characteristics of YOLOv5 include the adoption of Cross-Stage Partial (CSP) [7] and Spatial Pyramid Pooling Fast (SPPF) [6,8] methods in the backbone network, as well as the utilization of Feature Pyramid Network (FPN) [9] and Path Aggregation Network (PAN) [10] methods in the neck network. Additionally, techniques such as Mosaic data augmentation [11], adaptive anchor box calculation, and adaptive image scaling are employed to enhance training effectiveness.

Since its release, YOLOv5 has gained widespread recognition in both academia and industry. Similar to one-stage algorithms such as SSD [12], YOLO series [13,14,15,16,17], Centernet [18], Efficient Det [19], YOLOv5 treats object detection as a regression task, directly regressing bounding boxes and predicting the categories at multiple positions in the entire image for comprehensive information. As one of the most advanced target detection algorithms, the performance of YOLOv5 depends heavily on the dataset used for training, and there is still room for improvement in license plate detection in the face of abnormal weather scenarios.

The quality of the training dataset directly affects the model’s generalization ability and detection accuracy [20]. YOLOv5 requires a large amount and high quality of labeled data to learn the features and contextual information of the target. If the training dataset is of poor quality, the model may produce misleading learning, leading to performance degradation. Second, the diversity of training datasets is crucial to improve the robustness of the model. Factors such as different scenes, lighting conditions, weather conditions, and target poses can affect the performance of the detector. Therefore, the training dataset needs to cover a variety of possible real-world scenarios so that the model can perform effective detection in different environments. In addition, the size of the training dataset is also an important factor that affects the performance of the model. Larger datasets usually provide more samples and richer contextual information, which helps the model learn a more complex and accurate representation of the target. In contrast, if the size of the training dataset is small, the model may face the problems of overfitting and insufficient generalization ability. Therefore, in this paper, the improved Segment Anything Model is used for data augmentation as a way to build the dataset so that the detection target is more accurate and the detection performance of YOLOv5 is improved.

2.2. Segment Anything Model

The SAM is a breakthrough in image segmentation, carefully designed to help accurately delineate any object in an image. At its core, SAM utilizes state-of-the-art deep learning methods to meticulously analyze and segment a wide variety of objects, and initiates this complex process at the individual pixel level of granularity. This technology enables SAM to achieve unparalleled high-precision semantic segmentation, setting a new benchmark in the field. The model skillfully integrates the power of Deep Convolutional Neural Networks (DCNN) [21,22,23,24,25] with the advanced capabilities of semantic segmentation networks [26]. This strategic combination enables SAM to accurately interpret and segment images, effectively distinguishing between different objects and their respective contexts. In this way, SAM addresses a critical need in applications ranging from autonomous driving to medical imaging, where the ability to accurately recognize and segment objects can significantly improve performance and outcomes.

SAM adopts advanced attention mechanisms [21,23] to better capture contextual information of various regions in the image. It employs multi-scale feature fusion [25] to enhance segmentation accuracy. For a given image

I \in R^{3 \times H \times W}

, SAM first utilizes a ViT-based image encoder to obtain deep feature maps

F \in R^{C \times \frac{H}{16} \times \frac{W}{16}}

. Then, it encodes N input prompts using a prompt encoder and sends them to the mask decoder along with the feature maps. The mask decoder returns a set of masks

m_{i} \in R^{1 \times H \times W}

(

i \in N

) specified by the input prompts.

The sophisticated design of SAM is underscored by its flexible prompt mechanism, which facilitates not only interactive usage [27,28], but also ensures that it excels in dealing with intricate scenes and visuals populated with multiple objects. This capability is especially critical in environments where detail and nuance are paramount. SAM’s architecture is engineered to be highly sensitive to edge information, a feature that plays a pivotal role in accurately delineating the contours and boundaries of objects within an image. This level of precision is crucial for creating clear, distinct representations of each element in a scene, regardless of complexity. The inherent adaptability of SAM, stemming from its nuanced understanding of edge information and interactive prompt mechanism, makes it an ideal solution for a broad range of downstream tasks. These tasks could include, but are not limited to, image segmentation, object recognition, and even complex scene reconstruction, where the accurate capture and interpretation of object boundaries are essential for high-quality outcomes. This versatility and high performance in processing complex imagery positions SAM as a valuable tool in the field of computer vision and image processing, offering significant advantages for both research and practical applications.

2.3. Mask-to-Matte

Mask-to-Matte (M2M) is a highly practical image segmentation task aimed at accurately separating foreground objects in an image and generating a detailed transparency mask, referred to as “Matte.” This task provides crucial technological support for computer vision and graphics, especially in applications such as image editing, advertising design, virtual reality, and video production.

The key objective of M2M is to generate high-quality transparency masks from a simple foreground-background binary mask. To achieve this goal, M2M employs advanced techniques from deep learning, including Generative Adversarial Networks (GANs) [29]. Through training these networks, the generator of M2M learns how to generate accurate and detailed transparency masks from the input binary mask, while the discriminator evaluates the consistency between the generated Matte and real images, continuously improving the overall model’s performance.

In terms of technical principles, the M2M deep learning model analyzes images at the pixel level to learn the generation patterns of transparency masks. The generator is responsible for producing transparency masks, while the discriminator plays a role in assessing the quality of the generated Matte. This training approach enables the M2M model to better understand the relationship between foreground objects and the background in an image, producing more realistic and accurate Mattes, significantly enhancing segmentation accuracy.

M2M excels in its precise modeling of transparency, preserving the details of foreground objects by learning the transparency levels of each pixel in the image. The trained model exhibits strong generalization capabilities, adapting well to different scenes and image types, performing exceptionally well in a wide range of application scenarios.

In terms of applications, M2M finds widespread use in image editing, advertising design, virtual reality, and video production. Through its high-quality Matte generation, M2M offers users elevated control, allowing objects to interact transparently with other elements. Integrating M2M as part of the Matting Anything Model (MAM) aims to effectively and smoothly transform instance-aware mask predictions from SAM into instance-aware Alpha Matte predictions.

Given an input image

I \in R^{3 \times H \times W}

, the pre-trained SAM model produces a feature map

F \in R^{C \times \frac{H}{16} \times \frac{W}{16}}

and mask prediction

m \in R^{1 \times H \times W}

on the target instance with immediate guidance. The rescaled images, masks, and feature maps are connected to form the input

F_{m 2 m} \in R^{(C + 4) \times \frac{H}{8} \times \frac{W}{8}}

to the M2M module. M2M employs several refinement blocks containing connected self-attention layers, batchnorm layers, and activation layers to generate alpha matte predictions, denoted as

α_{o s 8} \in R^{1 \times \frac{H}{8} \times \frac{W}{8}}

, at 1/8 resolution. The feature maps are then up-sampled to a higher resolution at 1/4 and full resolution alpha matte predictions are performed, denoted as

α_{o s 4} \in R^{1 \times \frac{H}{4} \times \frac{W}{4}}

and

α_{o s 1} \in R^{1 \times H \times W}

, respectively. Multi-scale prediction with M2M combination enables MAM to handle objects at different scales and provide finer-grained alpha masking for detailed object extraction.

2.4. Matting Anything Model

Given an image I, it can be considered as a combination of foreground image F and background image B. As shown in formula (1), the coefficient is alpha matte

α

. Image matting is the estimation of

α

when only I is provided as input. When the image I contains multiple instances, synthesis becomes formula (2).

I = α F + (1 - α) B

(1)

I = \sum_{i}^{N} α_{i} F_{i} + (1 - \sum_{i}^{N} α_{i}) B

(2)

α_{i}

represents the alpha mask of instance i. Typically, target and reference masks are used to guide the prediction of instance-aware alpha masks, or natural language descriptions are employed to estimate the alpha matte of the target instance. However, these methods are designed for specific scenarios with corresponding benchmarks, limiting their potential to simultaneously handle various image matting tasks and benchmarks.

The Matting Anything Model (MAM) is a versatile network designed to estimate the alpha matting of any target instance in an image using prompt-based user guidance, as illustrated in Figure 2.

MAM leverages the advanced SAM framework, which supports flexible prompts and outputs segmentation masks for any target instance for interactive use. Specifically, MAM takes feature maps and mask outputs from SAM as input, and incorporates a lightweight Mask-to-Matte module to predict the alpha mask for the target instance. This allows MAM to efficiently generate high-quality alpha mattes for diverse target instances based on SAM’s segmentation capabilities.

During the training process, the foreground instance

F \in R^{3 \times H \times W}

is combined with its corresponding ground truth

α_{g t} \in R^{1 \times H \times W}

and background image

B \in R^{3 \times H \times W}

to create synthetic images. The synthesis process is conducted using Formula (3), followed by the extraction of bounding boxes

(x_{0}, y_{0}, x_{1}, y_{1})

, encapsulating the interested instances in the synthetic images. Images I and bounding boxes are sent as cues to the pre-trained SAM, which returns the mask prediction for the instance. Finally, the images, masks, and feature maps are concatenated and sent to the M2M module. The M2M module further returns multi-scale alpha mask predictions

α_{o s 8}

,

α_{o s 4}

,

α_{o s 1}

. The loss function

L

is computed between multi-scale predictions and ground truth

α_{g t}

, formulated as (4), where coefficients

λ_{L 1}

and

λ_{L_{L a p}}

control the contribution of each loss term. Both loss terms are calculated on multi-scale prediction as formula (5) and formula (6).

I = α_{g t} F + (1 - α_{g t}) B

(3)

L (α_{gt}, α_{o s 1}, α_{o s 4}, α_{o s 8}) = λ_{L 1} L_{1} + λ_{L_{L a p}} L_{L a p}

(4)

L_{1} = L_{1} (α_{gt}, α_{o s 1}) + L_{1} (α_{gt}, α_{o s 4}) + L_{1} (α_{gt}, α_{o s 8})

(5)

L_{L a p} = L_{L a p} (α_{gt}, α_{o s 1}) + L_{L a p} (α_{gt}, α_{o s 4}) + L_{L a p} (α_{gt}, α_{o s 8})

(6)

3. Experiments and Results

3.1. Dataset

The widely used open-source dataset for large-scale license plate detection and recognition in complex scenes in current domestic research is the CCPD [30] dataset created by the University of Science and Technology of China. The CCPD dataset captures license plate photos in various challenging environments, including blur, tilt, rainy, snowy, etc., comprising over 250,000 images of Chinese urban license plates. The authors have divided it into nine sub-datasets, as shown in Table 1.

This paper utilizes the CCPD-Weather sub-dataset, which does not inherently distinguish between training, validation, and test sets. Therefore, following the commonly used dataset splitting method in current domestic and international research [31], images are divided into training and test sets at a ratio of 4:1. Additionally, during training, the training set is further split into training and validation sets at a ratio of 4:1.

The dataset is annotated using the LabelImg annotation tool, and the annotation format is in txt. LabelImg is a widely used image annotation tool that supports the visual annotation of targets in images and allows users to save the annotation information for each image in txt format. We chose the txt format as the saving format for the annotation information because of its versatility and simplicity, which can be easily read and processed by a wide range of programming languages and computing frameworks. In our paper, the name of each image file corresponds to its data annotation, and such a naming convention is intended to simplify data management and subsequent processing procedures. For example, the filename “0478-4_1-136&448_441&579-441&558_141&579_136&469_436&448-0_0_8_32_29_24_32-87-45.txt” can be divided into seven fields:

“0478” represents the proportion of the license plate area to the entire image area (47.8%).
“4_1” represents the horizontal tilt and vertical tilt degrees (horizontal tilt 4 degrees, vertical tilt 1 degree).
“136&448_441&579” represents the coordinates of the top-left and bottom-right vertices (top-left vertex (136, 448) and bottom-right vertex (441, 579)).
“441&558_141&579_136&469_436&448” represents the precise coordinates of the four vertices of the license plate in the entire image, starting from the bottom-right vertex.
“0_0_8_32_29_24_32” represents the index of each character. The last character in each array is the letter ‘O’, not the digit ‘0’. ‘O’ is used as a symbol for “no character” because there is no ‘O’ in Chinese license plate characters. Therefore, the license plate assembled from the characters is “皖AJ8508”.
“87” represents the brightness of the license plate region (87%).
“45” represents the blurriness of the license plate region (45%).
To ensure the accuracy and consistency of data annotation, we adopted strict annotation guidelines and conducted multiple rounds of quality checks. This process ensures the reliability and validity of our datasets for model training and evaluation.

3.2. Experimental Setup

The experimental platform and parameters used in this paper are presented in Table 2.

This paper evaluates the algorithm’s performance using Precision (P), Recall (R), and Average Precision (AP) as the main metrics. There are four different cases for the model’s detection results, namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Precision and Recall are calculated based on the counts of these four cases, leading to the derivation of other evaluation metrics. The formulas for Precision and Recall are given by Formula (7) and Formula (8) respectively.

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

A P = \frac{\sum P_{r_{i}}}{\sum r}

(9)

Different precision and recall rates will be obtained by using different confidence levels, Recall and Precision are plotted as x and y axes, respectively, and the P-R curves are plotted, and the area covered under the curve is the AP value, which is calculated from the discrete point of view as formula (9). Where P denotes the corresponding

P_{r_{i}}

value on the PR curve, and

\sum r

= 1.

3.3. Experimental Process and Results Analysis

In our research, we initially utilize the raw CCPD-Weather dataset as the foundational input for training the YOLOv5 model. This particular dataset, which has not undergone any preprocessing steps, is directly employed to facilitate the model’s learning process. The objective is to assess the model’s baseline performance under varying weather conditions. Following the completion of the training phase, the outcomes, including the various parameters and metrics achieved, are comprehensively detailed and visually represented in Figure 3 of our study. This illustration serves to provide a clear and concise depiction of the model’s performance metrics, derived directly from the training conducted with the unprocessed CCPD-Weather dataset.

When we subjected YOLOv5 to training using the dataset in its raw, unprocessed form, the outcomes were remarkably high in terms of the model’s accuracy metrics. Specifically, the precision metric reached 98.2%, indicating a high rate of true positive predictions out of all positive calls made by the model. Similarly, the recall rate stood at 98.0%, showcasing the model’s efficiency in identifying true positives from the total actual positives present within the dataset. Additionally, the model demonstrated an average precision of 98.9%, which underscores its overall accuracy across different thresholds of detection.

Despite these impressive metrics, our subsequent evaluation of the model, particularly when applied to the designated test set, revealed some limitations. It became apparent that, although the model performs with high accuracy, there are instances where it generates noticeable errors in the form of false positives and false negatives. This observation is clearly illustrated in Figure 4a, where specific instances of these inaccuracies are highlighted. False positives refer to instances where the model incorrectly identifies an object as a target when it is not, whereas false negatives occur when the model fails to detect an actual target. These discrepancies, despite the high precision and recall rates, indicate areas where the model’s performance can still be optimized for even greater accuracy and reliability.

Figure 5 illustrates the process of processing a license plate image, showing how each image is processed individually by MAM to ensure clarity and accuracy of detection.

The processed license plate images are mixed into the CCPD-Weather dataset, and the allocated dataset is then fed into the YOLOv5 model for training. The parameter results of the training are shown in Figure 6. This figure provides a comprehensive overview of the training results, highlighting the YOLOv5 model’s improved capability to detect license plates accurately within the challenging conditions presented by new CCPD-Weather dataset.

After training on the dataset augmented by MAM, the YOLOv5 model achieves a precision of 99.9%, an increase of 1.7%; recall of 99.1%, an increase of 1.1%; and an average precision of 99.5%, an increase of 0.6%. Testing the model on the same test set reveals significant improvements in false positives, false negatives, and confidence levels for correctly predicted instances, as depicted in Figure 4b.

In order to compare the superiority of the improved algorithmic model proposed in this paper to improve the effectiveness of license plate detection under strong snowfall, comparative experiments are conducted using YOLOv5, SSD, and Faster-RCNN target detection algorithms. As can be seen from Table 3, relative to the original dataset, the dataset processed by the improved SAM makes YOLOv5 have a certain improvement in the precision rate, recall rate, and average precision rate, and the precision rate is 7.8% and 1.5% higher than the Faster-RCNN and SSD models, respectively, and the recall rate is 12% and 1.1% higher than that of SSD and the original YOLOv5 model, respectively. From the testing, the model trained from the improved SAM-processed dataset avoids misdetection and omission, and has a higher confidence level of detection.

4. Conclusions

In the field of license plate detection under severe weather conditions, this study pioneers the enhancement of SAM with a focus on improving data augmentation techniques to address and mitigate the dependence of YOLOv5’s performance on its training dataset. By introducing an innovative approach to augmenting the CCPD, we greatly balanced the dataset, which, as demonstrated by our experiments, corrects the performance limitations inherent in YOLOv5 when trained on traditional datasets. This enhanced dataset processed through improved SAM significantly improves the model’s adaptability and ability to detect license plates, even under the most challenging and severe weather conditions such as heavy snowfall.

The key to our study is the careful improvement of the SAM, which generates a well-balanced CCPD dataset. This strategic improvement played a crucial role in the training process of YOLOv5, enabling it to exhibit superior performance under severe weather conditions including, but not limited to, snow, rain, and fog. The adaptability shown by the model, especially in the case of extreme snowfall, proves the effectiveness of our data enhancement strategy.

The significance of this research cannot be overemphasized, as it not only addresses the existing limitations faced by YOLOv5 in harsh weather conditions, but also paves the way for wider applications in real-world scenarios where environmental factors play a key role in license plate detection. The expansion of the CCPD dataset through SAM processing makes a significant contribution to improving the adaptability and overall performance of the license plate detection model, highlighting the importance of continued efforts in dataset expansion. By focusing on expanding the scope of the dataset to cover extreme severe weather conditions, we can ensure the robustness and generalization capabilities of these models, making them more capable of dealing with the variety of challenging conditions prevalent in the real world.

In essence, this research not only reveals the potential of SAM in enhancing data augmentation techniques, but also emphasizes the need for more inclusive and comprehensive datasets. Such advances are critical to advancing the field of license plate detection, especially in the face of unforeseen and extreme environmental conditions, thus ensuring the reliability and validity of these models in a large number of real-world applications.

However, we must also recognize the limitations of the current dataset, i.e., the dataset is mainly dominated by intense snowfall conditions, ignoring other extreme environmental scenarios such as heavy rain, dense fog, and dust storms. In order to take the model to the next level of detection capability and practical applicability, future research must prioritize the expansion of the dataset to include license plates taken under more extreme weather conditions. Incorporating images from sandstorms, tornadoes, and other severe weather conditions will not only improve the model’s detection accuracy, but also enhance its usefulness in real-world applications. This comprehensive approach to dataset collection is essential to ensure the robustness and reliability of the model, allowing it to function optimally across a wide range of environmental challenges.

Author Contributions

Conceptualization, B.X. and S.Y.; methodology, B.X. and S.Y.; software, B.X. and S.Y.; validation, B.X. and S.Y.; formal analysis, B.X. and S.Y.; investigation, B.X. and S.Y.; resources, B.X. and S.Y.; data curation, B.X. and S.Y.; writing—original draft preparation, B.X.; writing—review and editing, S.Y.; supervision, S.Y.; project administration, B.X.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by China National Science and Technology Support Program Project “Development and Demonstration of Automatic Production Line for Precision Assembly of Electronic Products” (Project No: 2015BAF10B00), Scientific Research Program Project of Shanghai Science and Technology Commission “Dyeing Robot Management Software System” (Project No: 17511110204).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be downloaded from https://github.com/detectRecog/CCPD (accessed on 16 December 2023).

Acknowledgments

The authors thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Lei, W.; Wei, X.; Zhang, X.; Li, K.; Zhang, S. MedLSAM: Localize and Segment Anything Model f-or 3D Medical Images. arXiv 2023, arXiv:2306.14752. [Google Scholar]
Zhang, D.; Liang, D.; Yang, H.; Zou, Z.; Ye, X.; Liu, Z.; Bai, X. SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model. arXiv 2023, arXiv:2306.02245. [Google Scholar]
Zhang, X.F.; Gu, C.C.; Zhu, S.Y. SpA-Former: Transformer image shadow detection and removal via spatial attention. arXiv 2022, arXiv:2206.10910. [Google Scholar]
Li, J.; Jain, J.; Shi, H. Matting Anything. arXiv 2023, arXiv:2306.05399. [Google Scholar]
Available online: https://github.com/ultralytics/yolov5 (accessed on 5 July 2023).
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In ComputerVision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, T.; Bao, J.; Wang, J.; Wang, J. Deep learning for industrial image: Challenges, methods for enriching the sample space and restricting the hypothesis space, and possible issue. Int. J. Comput. Integr. Manuf. 2022, 35, 1077–1106. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with Transformers. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, Virtual, 3–7 May 2021. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the NeurIPS, Virtual, 6–14 December 2021. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2017; p. 30. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Xu, N.; Price, B.; Cohen, S.; Yang, J.; Huang, T.S. Deep interactive object selection. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Mahadevan, S.; Voigtlaender, P.; Leibe, B. Iteratively trained interactive segmentation. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Xu, Z.; Yang, W.; Meng, A.; Lu, N.; Huang, H.; Ying, C.; Huang, L. Towards end-to-end license plate detection and recognition: A large dataset and baseline. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 255–271. [Google Scholar]
Ming, S. Research on Deep Learning-based License Plate Recognition Algorithm for Complex Scenes. Master’s Thesis, Shenyang University of Chemical Technology, Shenyang, China, 2021. [Google Scholar]

Figure 1. YOLOv5 Overall Frame Structure.

Figure 2. Matting Anything Model Processing Flow.

Figure 3. Parameter results for training before improvement.

Figure 4. Comparison of tests before and after improvement. (a) Pre-improvement. (b) Post-improvement.

Figure 5. Processing license plate images by Matting Anything Model.

Figure 6. Parameter results for improved training.

Table 1. Individual CCPD sub-datasets.

Class	Description	Quantities
CCPD-Base	Generic license plate	199,996
CCPD-FN	The license plate is relatively close or far from the camera’s capture location	20,967
CCPD-DB	License plate area is brighter, darker or unevenly lit	10,132
CCPD-Rotate	License plate horizontal tilt and vertical tilt friendly	10,053
CCPD-Tilt	License plates with greater horizontal tilt and vertical tilt	30,216
CCPD-Weather	License plate shot in rain, snow, and fog	9999
CCPD-Challenge	Challenging images in license plate detection and recognition	50,003
CCPD-Blur	Blurred license plate image due to camera shake	20,611
CCPD-NP	Pictures of new cars without license plates	20,967

Table 2. Experimental platform and parameters.

Title	Releases
Operating System	Ubuntu18.04
PyTorch	1.9.0
CUDA	11.1
Python	3.8.10
GPU	GeForce RTX 3090(24G), NVIDIA, Santa Clara, CA, USA
CPU	Intel(R)Xeon(R)Platinum [email protected], Intel, Santa Clara, CA, USA

Table 3. Comparison of indicators.

Evaluation Indicators	P	R	mAP
Faster-RCNN	92.1%	100.0%	99.9%
SSD	98.4%	87.1%	99.2%
Pre-improvement	98.2%	98.0%	98.9%
Post-improvement	99.9%	99.1%	99.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, B.; Yu, S. Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model. Appl. Sci. 2024, 14, 1819. https://doi.org/10.3390/app14051819

AMA Style

Xu B, Yu S. Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model. Applied Sciences. 2024; 14(5):1819. https://doi.org/10.3390/app14051819

Chicago/Turabian Style

Xu, Benyu, and Su Yu. 2024. "Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model" Applied Sciences 14, no. 5: 1819. https://doi.org/10.3390/app14051819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Data Augmentation for YOLOv5 Using Enhanced Segment Anything Model

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv5

2.2. Segment Anything Model

2.3. Mask-to-Matte

2.4. Matting Anything Model

3. Experiments and Results

3.1. Dataset

3.2. Experimental Setup

3.3. Experimental Process and Results Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI