1. Introduction
Natural cork, a product derived from the bark of Mediterranean oak trees, is extensively utilized due to its excellent elasticity, sealing properties, and abrasion resistance. It is widely applied in the manufacturing of premium badminton shuttlecocks and cork stoppers for sealing wine bottles, possessing significant commercial value [
1]. In the production of badminton shuttlecock heads, entire sheets of cork bark are cut into small cork discs. The quality screening of these cork discs is currently one of the crucial processes in the manufacturing of badminton shuttlecock heads. However, natural cork, influenced by environmental factors during its growth, exhibits complex surface defects and varied textures, making it impossible to find two samples with identical defect patterns. Traditional manual visual inspection methods are highly subjective, influenced by factors such as physical fatigue and work experience, resulting in inconsistent sorting quality. Therefore, achieving stable quality control presents substantial challenges.
Research on the quality screening of cork discs based on optical detection technology was initiated in the 1990s. Cork images were captured using cameras to conduct automatic visual inspection of cork-related products. In 1997, Chang et al. [
2] designed a feature extraction method involving morphological filtering, contour extraction, and tracking. A complex neural network was used as a classifier in the cork stopper quality classification system, achieving the classification of eight different quality grades of cork stoppers. In 2000, Gonzalez-Adrados et al. [
3] proposed a cork board quality grading system, which was based on the analysis of dozens of data features from cross-sectional and tangential images of cork boards, identifying three different types of defects. Discriminant analysis was further employed for quality grading, with classification results surpassing manual classification. Costa et al. [
4,
5,
6] analyzed the contribution of each porosity feature to cork stopper grading and developed a cork stopper quality classification system based on canonical discriminant analysis and stepwise discriminant analysis techniques, achieving a 14% error rate for the surface classification of seven standard commercial quality grades of cork stoppers. In 2009, Georgieva et al. [
7] studied an intelligent machine vision system for the classification of seven different types of cork bricks. Feature generation was performed using Laws’ masks, and the feature vectors were processed using linear discriminant analysis and principal component analysis for cork brick classification. In 2010, Paniagua et al. [
8] constructed a cork stopper classification vision system that used a static threshold method to determine defect areas and morphological calculations to measure defect sizes, classifying cork stoppers using a neuro-fuzzy classifier. In 2015, Vanda Oliveira et al. [
9] characterized the surface porosity of cork stoppers of three grades using image analysis methods and established a predictive classification model using stepwise discriminant analysis, achieving a classification accuracy of 75%. Other advanced techniques have also been applied to cork classification, such as neutron radiography and tomography for analyzing internal defects of cork, and volatile organic compound (VOC) analysis for natural cork stoppers with different porosity levels [
10]. These methods generally involve two main steps: first, feature extraction from cork images using texture feature generation and extraction techniques, and then classification using neural networks.
In recent years, automatic optical detection technology based on deep learning has been widely and importantly applied in intelligent manufacturing for tasks such as positioning detection and surface quality inspection of components [
11,
12]. Examples include rubber threading line detection [
13], metal surface defect detection [
14,
15,
16], fabric defect detection [
17], and sanitary ceramic defect detection [
18], providing new solutions for cork disc quality screening. Machine vision combined with automated equipment enables automatic detection systems, which offer advantages such as accuracy, efficiency, and continuous operation. Since the introduction of AlexNet [
19] in 2012, numerous excellent deep-learning algorithms have emerged, including R-CNN [
20,
21,
22], ResNet [
23], SSD [
24], RetinaNet [
25], CenterNet [
26], and ConvNeXt [
27]. In industrial applications, the real-time object detection algorithm YOLO (You Only Look Once) has become one of the most popular deep-learning detection frameworks. YOLO [
28,
29,
30], proposed by Redmon, has since been integrated with the latest object detection algorithms by various researchers, leading to the development of efficient detection algorithms such as YOLOv4 [
31], YOLOv5 [
32], YOLOX [
33], YOLOv6 [
34], and YOLOv7 [
35]. Among these, YOLOv5 has garnered extensive attention due to its optimal balance between accuracy and speed. YOLOv5 maintains high accuracy while offering rapid detection speeds, and it allows for model customization according to user-specific requirements. With a compact model size, it is particularly suited for edge computing and deployment environments with limited resources. Therefore, the application of YOLOv5 to the quality inspection of cork discs can better facilitate the automatic extraction of cork defect features and the implementation of a low-cost system. This study addresses the issues of low efficiency and poor consistency in cork disc quality screening by investigating an improved YOLOv5-based method for cork disc quality detection, achieving rapid and precise cork disc quality inspection.
2. Materials
2.1. Dataset Construction
The cork discs in the dataset were sourced from a badminton production factory. Images of each cork disc’s front and back sides were captured using industrial cameras. As shown in
Figure 1a, an example image of a cork disc is presented. After preliminary processing, the oak bark is segmented into cork discs with a diameter of approximately 27 mm and a height of about 5 mm. Each shuttlecock head is formed by adhering three cork discs together. The top cork disc, which requires the highest quality, has holes punched around it to insert feathers. Cork discs of slightly lower quality are placed in the middle or bottom layers.
In this study, both quality classification and defect detection tasks for cork discs are conducted simultaneously. Quality classification is divided into three categories: qualified, unqualified, and outer bark; defect detection is divided into three categories: holes, notches, and black spots. In
Figure 1c, holes are annotated. Pores in the oak bark appear as black spots of varying sizes on the disc’s surface, but only holes larger than 2 mm are annotated as hole defects, as small holes do not affect the shuttlecock head’s quality. In
Figure 1c, notches are annotated, indicating missing portions at the edge of the cork disc, primarily caused by natural factors or during cutting. In
Figure 1f, black spots are annotated, originating from the oak bark’s outer cuticle, which is hard, non-elastic, and of no use. Regarding these defects, cork discs can be annotated with different quality classifications. In
Figure 1b, a qualified disc is annotated, indicating a surface with minimal defects, suitable for the upper layer of the shuttlecock head to insert feathers. In
Figure 1c,e, unqualified discs are annotated, indicating they cannot be used for the upper layer but can be used for the middle and lower layers. In
Figure 1f, the outer bark is annotated, which is of no use.
Figure 1d appears to have few surface defects but is annotated as unqualified due to a hole defect located in the 1–4 mm edge region (feather insertion area). Although large holes in the center area may still be classified as qualified and used for the upper layer, the classification of hole defects requires further determination of hole size and distribution for precise quality classification.
To train the model, a total of 8570 cork images were collected. After preprocessing, each sample had a pixel size of 480 × 480. The images were annotated using the LabelImg tool, and the annotation results were saved in XML format. All images were randomly divided at a ratio of approximately 8:1:1, constructing an original cork disc dataset that includes 6770 training images, 900 validation images, and 900 test images.
2.2. Data Augmentation
In the cork disc dataset, the proportion of defective samples is small, resulting in an imbalance in the training data and poor generalization ability of the model. Collecting a large number of such samples is challenging; therefore, a defect synthesis algorithm based on Generative Adversarial Networks (GANs) is proposed to address the issue of insufficient defective samples. Due to the varying shapes of naturally occurring hole defects, a GAN is employed to generate new hole defects to simulate this variability. The GAN consists of two modules: a generator and a discriminator. Its main objective is to obtain the optimal solution to the following objective function [
36], as shown in Equation (1).
In Equation (1), x represents the training images, G represents the generator, D represents the discriminator, and z represents the generated images. During the adversarial process, the generator creates fake data samples and attempts to deceive the discriminator, while the discriminator tries to distinguish between real and fake samples. When the discriminator cannot determine the source of the images, it can be considered that the generator is capable of producing images that follow the same distribution as the training set.
Fake “holes” similar to those in the original dataset are generated using a Generative Adversarial Network (GAN). These fake holes are integrated as new defects into cork discs to create new samples.
Figure 2 illustrates the proposed data augmentation (DA) algorithm process, which includes the following steps:
- (1)
Selection of Background Images: “Qualified” samples from the original dataset are selected as background images. To prevent data repetition, each background image undergoes horizontal or vertical flipping.
- (2)
Transformation of Defect Images: Defect images generated by the GAN are randomly selected and subjected to affine transformations, including random horizontal flipping, vertical flipping, scaling, and rotation.
- (3)
Binarization and ROI Extraction: The transformed defect images are binarized using the OTUS method to obtain the Region of Interest (ROI) mask, identifying the defect ROI.
- (4)
Background Region Extraction: A center point is randomly selected within a specific area of the cork disc background image (1–4 mm from the outer diameter where holes are punched for feather insertion). A background region of the same size as the transformed defect image is cropped around this center point.
- (5)
Defect Integration: The defect ROI is fused with the background region and overlaid onto the cork disc background image.
The GAN-based defect synthesis algorithm can generate a large number of hole defects. However, augmenting notch defects is challenging due to their occurrence only at the edges of cork discs and their directional nature; thus, random angle rotation is applied only to samples with notches. For black spot defects, due to their distinctive features and high detection accuracy, data augmentation is not performed. For hole and notch defects, 500 new samples are generated for each type. These samples are randomly divided into training, validation, and test sets at approximately an 8:1:1 ratio. After data augmentation, the dataset contains 7570 training images, 1000 validation images, and 1000 test images.
3. Quality Detection Method of Cork Discs Based on Improved YOLOv5
3.1. Overall Architecture
The overall architecture of the cork disc quality detection model is illustrated in
Figure 3. The model architecture consists of three parts: the backbone network, the neck, and the detection head. It is an improvement upon YOLOv5, with the primary modifications including the integration of a Convolutional Block Attention Module (CBAM) into the backbone network to further enhance the model’s feature representation capability. Given the uniform size of the cork discs, the number of prediction anchors in the detection layer has been reduced. A Center Match (CM) strategy is introduced to expand the range of positive sample selection, thereby balancing the number of positive samples. A shortest-distance label assignment (SDLA) strategy is proposed to address the issue of ambiguous sample regression. Additionally, a Detection Result Processing (DRP) algorithm is designed to further improve accuracy by aligning with the quality screening requirements for badminton cork discs.
In the backbone network section, the 6 × 6 convolution in the Stem module provides a larger receptive field spatially, enabling the acquisition of richer image features. Downsampling convolutions expand the channel dimensions while reducing the feature map scale, segmenting image features into different stages. The feature extraction structures at each stage combine the advantages of cross-stage local network structures and bottleneck structures [
23], reducing computational cost while enhancing the backbone’s feature extraction capability. The CBAM is embedded in the backend of Stage 3, Stage 4, and Stage 5 of the backbone network, as a large amount of high-level semantic information is present in the deeper layers of the backbone network. CBAM improves the network’s capability to express features.
The neck of YOLOv5 consists of an improved Path Aggregation Network (PAN) structure, which aggregates multi-scale features by connecting low-level physical features with high-level semantic features through bottom-up paths, thereby constructing pyramid feature maps and providing feature inputs for the detection head. The detection head performs predictions on the three scale feature maps generated by the neck, using a lightweight 1 × 1 convolutional layer. For quality detection, both sides of the cork disc are simultaneously captured and preprocessed to a size of 480 × 480. The two images are stacked into a batch of 2 and input into the neural network. Subsequently, the improved YOLOv5 performs data inference and detects both images simultaneously.
3.2. CBAM
To further enhance the feature expression capability of the model, the CBAM is embedded in the backbone network based on YOLOv5. The CBAM is a lightweight convolutional attention module that operates on both channel and spatial information. It consists of two submodules: the channel attention module (CAM) and the spatial attention module (SAM), as illustrated in
Figure 4. The two modules are connected in series and introduce parallel branches relative to the input feature path. This configuration allows the generation of attention feature map information sequentially along both the channel and spatial dimensions, thereby enabling adaptive feature refinement.
The computation of the CBAM is described by Equation (2).
In the formula,
represents the input features, while
and
, respectively, denote the channel attention features and spatial attention features. The symbol
signifies element-wise multiplication. The computation within the channel attention module is illustrated as shown in Equation (3).
In the formula,
and
represent the channel weights obtained through the application of global maximum pooling and global average pooling operations across all channels, respectively.
denotes the sigmoid activation function, and
,
,
is a dimensionality reduction factor employed to decrease computational load, with a default setting of 16, while the minimum value of
is set to 8. The computation of the spatial attention mechanism module is expressed as shown in Equation (4).
In the equation, and represent two-dimensional spatial mappings obtained from two types of pooling operations. signifies the concatenation operation, and represents a convolution operation with a 7 × 7 kernel.
The CBAM is embedded at the backend of Stage 3, Stage 4, and Stage 5 because deep layers of the backbone network are replete with abundant high-level semantic information. Utilizing attention modules for information fusion in these areas allows the model to differentiate features more effectively.
3.3. CM Strategy
YOLOv5 is an anchor-based detector that sets three different scales of anchors for predictions at each detection layer. While multi-scale prediction can offer performance improvements, it results in slower prediction speed and higher postprocessing complexity. Additionally, given that the target scales of cork disc objects in the dataset are nearly uniform, the benefits of multi-scale prediction are minimal. Therefore, the number of anchors is set to one to reduce the number of predictions and enhance processing speed. The reduction in the number of anchors leads to a decrease in the number of positive samples used for training. To address this, a CM strategy is employed to expand the range of positive sample selection and balance the number of positive samples.
As shown in
Figure 5a, the positive sample selection strategy of YOLOv5 matches predictions from three grids with the ground truth targets. During the selection process, when the center of a ground truth target falls within a particular grid, that grid is chosen. When the number of anchors is reduced to one, the number of selected positive samples decreases from nine to three, resulting in a significant reduction in the number of positive samples, which in turn lowers the algorithm’s convergence rate and accuracy. To balance the number of positive samples, a Center Match (CM) strategy is proposed: predictions from the grid containing the center of the ground truth target and its eight neighboring grids are selected as positive samples, as shown in
Figure 5b. After replacing the original positive sample selection strategy with the Center Match strategy, the transformation formula for the prediction results relative to the ground truth bounding box center coordinates changes. The calculations for YOLOv5 with the Center Match strategy are given by Formula (5).
In the formula, represents the sigmoid function, denotes the predicted offsets for x and y, and signifies the coordinates of the top-left corner of the grid point in the x–y plane.
Compared to the original method of selecting positive samples, the CM strategy maintains the same quantity of positive samples while reducing the number of model prediction outputs, enhancing the model’s prediction speed and reducing postprocessing complexity. Moreover, the positive samples involved in loss calculation come from the target’s central area, where prediction results are generally of higher quality. Calculating loss with higher-quality prediction results enriches the gradient information.
3.4. SDLA Strategy
When selecting positive samples, sufficient proximity between targets can lead to a challenging ambiguity: predictions from certain grids may be selected as positive samples for multiple ground truth targets, resulting in confusion about which target the prediction should be regressed to. Such overlapping samples are referred to as ambiguous samples, as illustrated by the green striped grids in
Figure 6.
YOLOv5 does not address such cases. It is noted that selecting positive samples within a smaller spatial range can significantly reduce the occurrence of ambiguous samples. The CM strategy, however, expands the positive sample selection space, increasing the likelihood of ambiguous samples, which impacts accuracy to some extent. Therefore, a simple and effective SDLA strategy is proposed to eliminate this ambiguity. The process is detailed in
Figure 6. When an ambiguous sample is assigned to two ground truth targets, GT1 and GT2, the center point of the grid cell is first calculated, equal to the top-left corner coordinates plus 0.5 times the downsampling stride. The Euclidean distance from this center point to the centers of the two ground truth targets is then computed, and the target with the shortest distance is chosen as the regression target. As shown in
Figure 6, the ambiguous sample is ultimately assigned to GT2. The SDLA strategy can effectively and simply eliminate erroneous gradient information during backpropagation.
3.5. DRP Algorithm
In the cork disc quality detection results, predictions for multiple categories are included. However, only one classification prediction for the cork disc is desired. Therefore, during non-maximum suppression of redundant detection boxes, the predictions for the three quality categories of the cork discs are processed together, retaining only the category with the highest confidence score. This approach suppresses multiple prediction categories and reduces the complexity of subsequent processing. However, for corks similar to those shown in
Figure 1d, where the probability of model inference errors is higher, the Detection Result Processing (DRP) algorithm is employed to correct the final output category labels based on the relationship between cork disc categories and surface defects, thereby further enhancing classification accuracy.
The logic of the Detection Result Processing (DRP) algorithm is illustrated in
Figure 7. Based on quality classification, the results are further refined by incorporating defect detection outcomes. The “black spot” defect is considered the most severe; cork discs with this defect are classified as the “outer skin” category. Therefore, when the category is “outer skin”, the result is directly output without any further processing. For other categories, it is necessary to check for the presence of the “black spot” defect. If it is present, the output category will be changed to “outer skin”.
A more complex scenario arises when the cork disc is classified as “qualified” but contains defects. In this case, the following steps are sequentially assessed: First, determine if the “black spot” defect is present. If it is, the category will be modified to “outer skin”. If not, check for the presence of “notch” defects. If present, the category is changed to “unqualified”.
Finally, for hole defects, classification will be based on three criteria: (1) whether the holes are located within a specific area, (2) whether the largest hole is greater than 4 mm in size, and (3) whether the total number of holes exceeds 3. If any of these conditions are met, the category will be updated to “unqualified”.
4. Experiment and Result Analysis
4.1. Experimental Environment and Evaluation Index
The hardware configuration for model training and testing includes an Intel Core™ i7-11700K processor and an NVIDIA RTX3080 GPU with 10 GB of memory. The above hardware platform comes from Lenovo Group Co., Ltd. in Beijing, China. The computer’s operating system is Ubuntu 18.04, and the software environment is configured with Python 3.8, Pytorch 1.8, Opencv 4.1.2, and CUDA v11.1.
During model training, the number of training epochs is set to 300, with a batch size of 32. K-means clustering is used to recalculate anchors suitable for the cork disc dataset. For the original YOLOv5 model, anchors are set as follows: [[49, 49], [61, 76], [114, 46], [66, 148], [154, 98], [180, 205], [332, 226], [229, 328], [443, 441]]. After applying the CM strategy, anchors are set as follows: [[66, 66], [240, 237], [442, 441]]. Other hyperparameters are evolved using a genetic algorithm (GA) and selected as follows: the initial learning rate is 0.01, the momentum is 0.946, the weight decay is 0.00047, the bounding box regression loss gain is 0.053, the classification loss gain is 0.86, the classification binary cross-entropy loss positive weight is 0.82, the objectness loss gain is 0.566, and the objectness binary cross-entropy loss weight is 0.949.
For model evaluation, the confidence threshold is set to 0.45 and the IoU threshold to 0.7. The evaluation metric chosen for the detection model is the
F1 score, as shown in Equation (6), because it considers both the
precision and
recall of the classification model, providing a more accurate reflection of the model’s classification capability.
It is noteworthy that the model task includes both quality classification and defect detection. Therefore, in addition to using the mean F1 score (mF1) metric, we also consider the classification detection F1 score (CDF1) and the defect detection F1 score (Defect_F1). The CDF1 score determines the accuracy of the quality classification, while the defect detection results are used by the DRP algorithm to correct the quality classification.
4.2. Ablation Experiment
To verify the effectiveness of the algorithm improvements, ablation experiments were conducted on different modifications made to the YOLOv5 base network.
4.2.1. Effects of Model Size and Pre-Training Weights
YOLOv5 provides models of various scales along with corresponding pre-trained weights to meet the needs of different application scenarios. Generally, larger models offer better performance but slower inference speeds. To determine the most suitable model scale, experiments evaluated three models (n, s, and m) on the cork disc dataset with and without pre-trained weights, assessing mF1 scores and inference speeds. The results are shown in
Figure 8. In
Figure 8, the blue solid line and red dashed line represent the detection results of models with and without pre-trained weights, respectively. It is evident that using pre-trained weights improves detection performance across various model scales, as it better initializes model parameters. Furthermore, while larger models provide limited improvement in mF1 scores, inference time increases rapidly. Among them, the YOLOv5s model, with an inference speed of 4.5 ms and a detection mF1 score of 85.7%, achieves the best balance between speed and accuracy. Therefore, the YOLOv5s model is selected as the baseline model, and pre-trained weights are used to initialize parameters in subsequent training.
4.2.2. Impact of Adding Defect Training
The core task of model inference is the quality classification of cork discs. The model was designed as both a dedicated quality classification model and a unified model (performing both quality classification and defect detection). In the dataset with data augmentation, the dedicated quality classification model achieved a CDF1 score of 91.1%, whereas the unified model achieved a CDF1 score of 93.8%, representing an improvement of 2.7%. This significant improvement is attributed to the inherent relationship between cork disc categories and defects. The neural network, through extensive data learning, gradually recognizes the relationship between them and considers the spatial activation features of defects when making classification predictions. Further explanation using the visual feature maps shown in
Figure 9 reveals that these feature maps are derived from the Head 5 layer of the detection model. In
Figure 9, the detection model without defect targets can only learn global features, while the feature maps with defect detection exhibit strong spatial feature activation at defect targets (e.g., holes in the lower-left corner), leading to more accurate classification judgments.
4.2.3. Impact of DA and DRP
The amount of training data directly affects the final performance of the detection model. Data augmentation methods were employed to add 500 “hole” and “gap” defects to the dataset to mitigate the issue of imbalanced defect samples.
As shown in
Table 1, experimental results indicate that when the number of defect samples is low, the Defect_F1 score for the three types of defects is only 62.7%. After using DA to increase the number of defect samples, the Defect_F1 score significantly improved by 14.1%, making the model’s defect detection more robust. Additionally, the CDF1 score also increased slightly by 1.1%. This improvement is partly due to the model’s enhanced ability to extract defect features, which ultimately impacts the overall classification judgment. Although the unified model improves generalization and efficiency through shared lower-level features and joint training, this self-learned cognition is not perfect, and errors may still occur in some cases. Therefore, using the DRP algorithm to correct the quality classification based on defect detection results further enhances classification accuracy, with the CDF1 score increasing by 0.7% from 93.8%.
4.2.4. Impact of CM and SDLA
The performance of both the CM and SDLA optimization methods was evaluated in the experiment. These methods are “bag-of-freebies” techniques focused on optimizing the training process [
31], aimed at improving object detection accuracy by increasing training costs without adding to the inference cost.
Table 2 compares the impact of the two methods on model accuracy and training time.
Table 3 analyzes the proportion of ambiguous samples among all positive samples when different positive sample selection strategies are applied, with data from YOLOv5 tested using an anchor count of one.
Table 2 shows that the CM strategy slightly increased the CDF1 score by 0.1% and the mF1 score by 0.7% without any additional training cost. However, as indicated in
Table 3, the CM strategy expanded the positive sample selection range, causing the number of ambiguous samples in the training process to increase from 599 to 3824, effectively doubling the probability of ambiguous samples, which destabilized the training process.
The SDLA strategy, which applies a simple shortest spatial distance principle to handle ambiguous samples, reduced the number of ambiguous samples to zero, thereby eliminating erroneous gradient information during training. This led to a 0.3% increase in the CDF1 score. However, the SDLA strategy requires processing on individual images, extending the training time by 0.89 h. Additionally, repeated training of models using the SDLA strategy yields fully consistent training results. This indicates that the SDLA strategy thoroughly eliminates ambiguous samples, ensuring complete consistency in data gradient information under the same training images and parameters.
4.2.5. Impact of CBAM
The experimental results of inserting attention modules at different positions in the YOLOv5 + DA + DRP + CM + SDLA model backbone network are shown in
Table 4. CBAMn represents the placement of n CBAMs in the backbone network following a top-down order (Stage 5, Stage 4, Stage 3, Stage 2). Latency refers to the inference delay caused by CBAMn, excluding preprocessing and postprocessing times.
The experiment demonstrates that as the number of inserted attention modules increases, the model’s inference delay also increases, with each additional CBAM contributing an extra 0.3 ms to the inference delay. According to the results in
Table 4, embedding three CBAM attention modules at the end of Stage 5, Stage 4, and Stage 3 achieves the best balance, with an inference time of 5.6 ms and an mF1 score of 86.7%.
4.2.6. Accuracy Analysis
Table 5 summarizes the impact of different optimization methods on detection accuracy. Sequential application of DA, the DRP algorithm, the CM strategy, the SDLA strategy, and the CBAM attention mechanism to the YOLOv5s baseline model led to incremental improvements in model performance. On the cork disc dataset, a total improvement of 2.4% in CDF1 score and 9.0% in mF1 score was achieved, resulting in a final optimized detection algorithm with a CDF1 score of 95.1% and an mF1 score of 86.7%.
Additionally, further analysis revealed that the DRP algorithm effectively improved classification accuracy without affecting inference time, though it did increase postprocessing time by an average of 1.5 ms per detection. The CM strategy reduced inference time by 2.3 ms and postprocessing time by 1.1 ms, as fewer predictions resulted in faster prediction speeds and lower postprocessing complexity. The SDLA strategy enhanced the model’s classification accuracy without affecting inference speed, as it only increased training costs. The CBAM improved detection accuracy but increased inference time due to the additional parameters and computational load introduced to the model backbone network. Despite the increased inference time, CBAM’s contribution to accuracy improvement was more significant.
The results of the optimized model (Model 6 in
Table 5) compared to the original YOLOv5 model are shown in
Figure 10. For five typical samples selected from the actual test results, the improved model demonstrated enhanced detection accuracy, particularly for samples 3, 4, and 5, where the original YOLOv5 model made incorrect quality classifications. The improved YOLOv5 model correctly performed quality classification and accurately identified defect features.
4.3. Comparison Experiments
To validate the advancement of the improved algorithm, a comparative experiment was conducted with six popular object detection network models: Faster RCNN, RetinaNet, CenterNet, YOLOX, YOLOv4, and YOLOv7. Identical training parameters were set for all models, and their performance was tested on the same cork disc test dataset, as shown in
Table 6. The single-stage detector YOLO models were able to process images at speeds exceeding 100 FPS and achieved an mF1 score of no less than 80% on the cork disc test dataset. In contrast, the classic models Faster RCNN and RetinaNet exhibited lower performance in both accuracy and speed, while the keypoint-based CenterNet demonstrated faster inference speed but lower detection accuracy.
In the YOLO series models, YOLOv7 claims to surpass YOLOv5 in performance. However, the results in
Table 6 demonstrate that our improved YOLOv5 achieves the best balance between accuracy and speed, with an mF1 score of 86.7% and an mAP of 81.5%, surpassing all other detection models listed in terms of cork disc classification accuracy. Additionally, it processes at a speed of 178.5 FPS, which, while slightly lower than YOLOv7, still meets practical requirements.
5. Conclusions
Through in-depth research and experimental validation, a method for cork disc quality detection based on the improved YOLOv5 has been successfully developed. A proprietary cork disc dataset was constructed, including 8570 original images and 1000 augmented images, created using data augmentation algorithms. Tailored to the characteristics of cork disc detection and combined with the YOLOv5 model, a series of deep-learning optimization methods were proposed to enhance detection efficiency and accuracy. These include expanding the dataset using data augmentation, incorporating attention mechanism modules to strengthen feature representation, designing a center-matching strategy to balance the number of positive samples, designing a shortest-distance label assignment strategy to eliminate ambiguous samples, and further improving detection accuracy through result-processing algorithms.
Finally, ablation and comparative experiments were conducted on an NVIDIA RTX 3080 GPU platform. The ablation experiments demonstrated that the proposed optimization methods effectively improved model detection accuracy. The optimized model achieved a 2.4% increase in CDF1 score and a 9.0% increase in mF1 score compared to the original YOLOv5 model on the cork disc dataset. The final optimized model reached a CDF1 score of 95.1%, a processing speed of 178.5 FPS, and an mAP of 81.5%. Compared to mainstream algorithms like Faster RCNN, RetinaNet, and CenterNet, the improved algorithm achieved the best detection performance on the cork disc dataset while maintaining high processing speed. Future work will deploy this algorithm on embedded development platforms and explore its application in other optical inspection fields.