1. Introduction
Automobile tires are an important part of automobile safety, and their design, manufacture, and use require strict standards and specifications [
1]. The tire production process involves a large number of character marks, such as specifications, models, batches, and other information, which are of great significance for the quality control and traceability of tires. In the production of automobile tires, it is necessary to classify the tires according to the size of the specifications and different models. The traditional method of manually recognizing tire characters is time-consuming and inefficient and has a high error rate. Tire character recognition using machine vision is more effective than the traditional recognition method, but it has higher requirements for the recognition environment and harsher conditions. Therefore, it is of great practical significance to study an automatic tire identification method.
Currently, many achievements have been made in character recognition based on machine vision systems. Wang H. et al. [
2] proposed a machine vision-based tire rubber surface character recognition method which implements character recognition through character localization, character segmentation, morphological processing, and template matching. Peng Q. et al. [
3] proposed a steel plate billet spray marking character recognition system which uses an image sensor to capture images in real-time and the Baidu Paddle-OCR character recognition algorithm to achieve automatic character recognition. Chen S. et al. [
4] proposed a multi-line character recognition method for clutch flywheels. Firstly, the clutch flywheel image is preprocessed and the edge coordinates are extracted. Secondly, the circular ring area is located by using the DBSCAN clustering algorithm, and then character segmentation is performed using the pixel projection method. Finally, LBP (local binary patterns) and support vector machine are used for recognition. Bai S. et al. [
5] proposed a machine tool information acquisition method which uses the projection transformation principle to determine the machine tool panel range, preprocesses the acquired image using filtering methods, and, finally, achieves character recognition through convolutional neural networks. Yang G. et al. [
6] developed a chip character recognition system. Firstly, the character area is obtained by using the grayscale value projection method for character segmentation. Secondly, chip positioning is performed using shape matching, and then character recognition is achieved through BP (back propagation) neural networks. Traditional machine vision methods have higher requirements for image quality and have lower versatility. The recognition effect will also change when the environment changes.
With the development of deep learning technology, object detection and recognition have greatly improved. The YOLO (You Only Look Once) algorithm series is comprised of fast and accurate object detection algorithms [
7]. Through improvements to YOLOv5’s model structure, the introduction of new data augmentation techniques [
8], higher detection speed, and better accuracy have been achieved. YOLOv5 technology has also been widely used. Gong P. et al. [
9] proposed a steel stamp character recognition method, based on YOLOv5, which first expanded the dataset through image preprocessing (which was trained with YOLOv5) and then recognized characters using the trained model. However, a large amount of computing resources is required for training and running during training. Zhang J. et al. [
10] proposed a vehicle and tank number detection and recognition method based on an improved YOLOv5 network which added attention mechanisms and GBN modules to enhance the feature extraction capabilities and improve detection speed. Adding attention mechanisms improved accuracy but increased the model’s computational complexity, leading to overfitting. Laroca [
11] proposed an end-to-end, efficient, and independent automatic LP (license plate) recognition system, based on the YOLO model, which included a unified LP detection and layout classification method. The system achieved a balance between accuracy and speed, but the CNNs (convolutional neural networks) used in this method required a large amount of labeled data for training and were less effective in some scenarios. Aduen [
12] proposed YOLO-Z, an improved YOLOv5 method, for detecting small objects in autonomous driving, which simplified Pan-Net into FPN and replaced it with biFPN. However, this method’s ability to process scale changes was limited, making it difficult to detect objects with large scale differences in the same image. Jiang L. et al. [
13] proposed a method for traffic sign detection, based on the YOLOv5 network model, which used a balanced feature pyramid structure and a global context block to enhance feature fusion and feature extraction capabilities. However, in some cases, the balanced feature pyramid structure resulted in information loss and could lead to decreased detection performance when there were too few low-resolution features.
To improve the accuracy and precision of recognizing characters on automobile tires, this paper proposes an improved automobile tire character recognition model based on YOLOv5. The model added a decoupled head, replaced the C3 module with the C3-Faster module, and replaced the CIOU with WIOU in the original YOLOv5 network, which enhanced the accuracy and precision of the network in recognizing the characters on automobile tires.
2. Improving the YOLOv5 Network
YOLOv5 has made significant improvements compared to previous YOLO models and is currently one of the more advanced models in object detection [
14]. In YOLOv5, new techniques have been used, such as adaptive training data augmentation [
15], multi-scale training [
16], and multi-scale prediction for initial detection layers [
17], which make detection faster and more accurate. The YOLOv5 network model is shown in
Figure 1. However, conducting character recognition on the surface of a tire is complex because the characters are molded into the tire and blend into the complex background. Additionally, dust on the surface of a tire further complicates character recognition. To address these issues and strengthen the network’s recognition capabilities, this paper proposes an improved model based on YOLOv5 for recognizing characters on the surfaces of car tires.
This article proposes three main improvements to the network. Firstly, it proposes separating the feature extraction and output of detection and segmentation tasks to accelerate model convergence and improve model accuracy. Secondly, it proposes improving the FasterNet Block by introducing C3-Faster to reduce the number of convolution operations and increase the computational speed. Thirdly, it proposes using WIoU-Loss as the loss function to measure the similarity between the predicted bounding boxes and the actual annotations. The improved network model in this study is shown in
Figure 2. In the figure, the green rectangular frame indicates the improved part.
2.1. The YOLOv5 Decoupled Head
A decoupling head [
18] is a technique that separates convolutional layers from fully connected layers, reducing computational complexity and model size, thus improving model speed and efficiency. The traditional detection head in YOLOv5 is a coupling head [
19] that typically consists of a fully connected layer that converts the feature map outputs by convolutional layers into a prediction vectors. This fully connected layer is usually trained together with the convolutional layers, requiring significant computational resources and longer training times. In this study, the decoupling head was added to YOLOv5, with the aim of separating the classification and localization branches to reduce computational resources and training time, as shown in
Figure 3. First, a 1 × 1 convolution was applied to the output feature map to reduce the model’s complexity while also reducing the number of input data channels and adjusting the feature map size. Then, the convolutionally processed result was split into two branches, each undergoing 3 × 3 convolutional processing. The first branch was further processed with a 1 × 1 convolution to obtain the classification branch while the second branch was split into two further branches, each undergoing 1 × 1 convolutional processing to obtain the target coordinate and confidence branches.
The decoupling head separated the classification and localization branches, reducing the number of parameters and calculations needed in the model and greatly speeding up the training and inference speed, improving the model’s perception ability for target features of different scales and, thus, improving the model’s robustness and accuracy while reducing the occurrence of overfitting.
2.2. Improving FasterNet Block by Proposing C3-Faster
In YOLOv5, the depth and receptive field of the network are increased through the C3 module [
20], and the feature extraction ability of the network is improved. The C3 module is shown in
Figure 4 and consists of three 3 × 3 Conv kernels and several bottleneck modules. Among them, the first one is a 1 × 1 Conv kernel with a step size of two, which halves the size of the feature map, reduces the number of parameters, and increases the receptive field of the network. The second and third modules have step sizes of 1 × 1 Conv kernels, and they retain more local information without changing the size of the feature map, further extract features, and increase the depth and receptive field of the network model. In the bottleneck module, the channel number of the image is reduced by half by a 1 × 1 Conv kernel with a step size of one, and then the number of channels of the image is doubled by a 1 × 1 Conv kernel with a step size of three. The number of channels of the image remains the same, the parameters of the network are reduced, and the depth is increased.
When running the C3 module once, it requires five convolution operations, and generating too many parameters will consume too much memory, further limiting the operating efficiency of the model, prolonging the training time, and affecting the processing speed of the model. In order to further improve the speed and accuracy of the network model for tire character recognition, in this study, FasterNet Block [
21] was improved and the C3-Faster module was proposed and added to the YOLOv5 network structure. The C3-Faster module is shown in
Figure 5 and consists of one 3 × 3 PConv and two 1 × 1 Convs. First, the feature map was calculated by the first PConv, in which the PConv could reduce redundant information and memory usage in the calculation, and then it passed through two 1 × 1 Conv kernels in turn to obtain the effective information of the feature map, and finally, it output the effective information, after which we could proceed to the next step. In this study, the parameter volume using the C3 module was 6.25 m, and the parameter volume using the C3-Faster module was 4.57 m. Fewer parameters can reduce the memory footprint and computational cost of a model, reducing the risk of overfitting.
2.3. Improved Regression Loss Function
As a loss function, IoU-Loss is used to measure the similarity between predicted bounding boxes and actual annotations, with more emphasis on the overlap between the predicted results and the ground truth [
22]. It is the most widely used metric for measuring the similarity between bounding boxes, but in tire logo recognition, the characters are relatively small, and using IoU may result in cases where the predicted bounding box and the ground truth do not intersect, resulting in the IoU being zero and unable to be optimized. GIoU [
23] adds the minimum bounding box of the predicted and actual bounding boxes, which solves the problem of the IoU being zero, but when the predicted and actual bounding boxes have the same widths and heights and are on the same horizontal or vertical line, GIoU degenerates into IoU. DIoU [
24] adds a Euclidean distance between the center points of the two bounding boxes and a Euclidean distance between the two diagonal vertices of the minimum rectangle box based on GIoU, but DIoU degenerates into IoU when the two bounding boxes have the same center point but different aspect ratios. CIoU [
25] is used as the loss function in YOLOv5. CIoU considers the consistency of the aspect ratio between the predicted and ground truth bounding boxes based on DIoU, and it adds a penalty term for the aspect ratio. However, due to the use of complex function calculations, CIoU consumes a lot of computing power in the calculation process, increasing the training time. WIoU [
26] proposes a dynamic non-monotonic focus mechanism which uses “outlierness” instead of IoU to evaluate the quality of anchor boxes, and it adopts a gradient gain allocation strategy to not only reduce the competitiveness of high-quality anchor boxes but also reduce the harmful gradients produced by low-quality anchor boxes, which enables WIoU to focus on low-quality anchor boxes and improve the overall performance of the detector.
There are three versions of WIoU, among which WIoUv1 constructs an attention-based bounding box loss and WIoUv2 and WIoUv3 are obtained by adding a gradient gain to the focus mechanism on the basis of v1. In
Figure 6 [
26], the green rectangle indicates the annotation box, the gray rectangle indicates the prediction box, and the blue rectangle indicates the minimum bounding box.
The calculation formula of the loss function
LWIoUv1 [
26] for
WIoUv1 is shown in Equations (1)–(3):
The calculation formula for the loss function
LWIoUv2 [
26] of
WIoUv2 is shown in Equation (4):
In Equation (4), L*IoU represents the monotonic attention coefficient and is the mean value, which is normalized in the formula to keep the gradient gain at a high level.
The calculation formula for the loss function
LWIoUv3 [
26] of
WIoUv3 is shown in Equations (5) and (6):
In Equations (5) and (6),
β is the non-monotonic focusing coefficient and
α and
δ are hyperparameters. When
β =
δ,
r = 1, and when the outlier degree of the anchor box satisfies
β = C (C is a fixed value), the anchor box will obtain the highest gradient gain. The values of
β and
r are controlled by the hyperparameters α and δ. The relationship between the hyperparameters
α and
δ, the outlier
β, and the gradient gain
r is shown in
Figure 7 [
26].
3. Experiments
3.1. Model Training
The main tasks during model training are collecting and annotating the dataset and setting the training parameters of the network.
- (1)
Dataset processing: We randomly selected car tires in a parking lot for image collection, and a total of 1000 images were collected, including 800 training sets, 100 verification sets, and 100 test sets. We used the annotation tool Labelimg to annotate the dataset, export YOLO format annotation files, and prepare for the next step of training. The labeled sample of the dataset is shown in
Figure 8.
- (2)
Setting the network training parameters: The operating system used was Windows 11, the GPU was an NVIDIA GeForce RTX 3060, and the programming language was Python 3.9. The network training parameters are shown in
Table 1, and the smaller yolov5s model was selected during training.
3.2. Evaluation Index
The trained model needed to be evaluated for the accuracy of the detection. In this experiment, precision (
P) and mean average precision (
mAP) were used to evaluate the performance of the model, as follows:
Formula (7) calculates the precision, where TP is a true positive and FP is a false positive. Formula (8) calculates the mAP, where AP(j) is the average precision for the j defect class, with j representing the number of defect categories, i.e., j = 0, 1, 2, …, n.
3.3. Comparative Experiment
To validate the effect of the improved C3-Faster on the training results of the yolov5s network model and find the optimal solution for training speed and accuracy, the C3-Faster was respectively replaced in the eight C3 modules of the backbone and neck. The dataset and network training parameters mentioned above were used for training, and the comparative experimental results of the different replacement positions of the C3-Faster module are shown in
Table 2.
Experiments 1, 2, 3, and 4 in
Table 2 respectively replaced C3-Faster in the four C3 modules of the backbone. Experiment 1 replaced the first C3 module, Experiment 2 replaced the second, and so on. Experiments 5, 6, 7, and 8 respectively replaced C3-Faster in the four C3 modules of the neck. Experiment 5 replaced the first C3 module, Experiment 6 replaced the second, and so on. Experiment 9 was the original YOLOv5 model without any modifications. From the experimental results, it could be seen that the training time of the first four experiments was reduced compared to that of the original YOLOv5. When C3-Faster replaced the first C3 module, the training time was the lowest. However, the highest mAP was achieved when replacing the third C3 module, and the highest precision was achieved when replacing the fourth C3 module. To improve the training speed while maintaining the training accuracy, it was decided to replace the third and fourth C3 modules of the backbone with C3-Faster. Similarly, after analyzing experiments 5, 6, 7, and 8, it was decided to replace the first and fourth C3 modules of the neck with C3-Faster as they performed better.
In order to compare the differences between GIoU, DIoU, CIoU, WIoUv1, WIoUv2, and WIoUv3, in terms of helping to optimize model parameters and improve model accuracy, a control experiment was set up. The YOLOv5s model was used to conduct experiments on the six loss functions using the aforementioned dataset and training parameters. The results of the different loss function comparison experiments are shown in
Table 3.
Analysis of the experimental results showed that using the WIoU loss function led to improvements in both the mAP and precision while also further reducing training time. Among the different variants of the WIoU loss function, the WIoUv1 loss function performed best in terms of precision, while the WIoUv3 loss function performed best in terms of mAP and training time. The difference in precision between the WIoUv1 and WIoUv3 loss functions was not significant. To achieve a lightweight network and improve the speed and accuracy of car tire character recognition, the WIoUv3 loss function was used instead of the original CIoU loss function.
3.4. Ablation Experiment
To validate the performance improvement brought by the decoupled head, C3-Faster, and WIOU in the YOLOv5 network, ablation experiments were conducted. A total of five experiments were set up, including the original YOLOv5s network, YOLOv5s with a decoupled head and C3-Faster, YOLOv5s with a decoupled head and WIOU, YOLOv5s with C3-Faster and WIOU, and YOLOv5s with a decoupled head, C3-Faster, and WIOU. These experiments were conducted on the same device with the same parameters. The results of the ablation experiments for the different modules of the improved YOLOv5s are shown in
Table 4.
On the car tire dataset, the improved YOLOv5s network outperformed the original network model in the ablation experiment. The mAP of the improved YOLOv5s network was 97.1%, and the precision was 95.4%. Compared with the original model, the training time did not change much, but the other indicators were improved. The mAP increased by 3.7 percentage points, and the precision increased by 2.1 percentage points.
3.5. Comparison with Different Models
To further evaluate the model in this study, performance comparisons were conducted using the improved YOLOv5s model along with YOLOx, YOLOv7, YOLOv5s, and the largest model in YOLOv5, YOLOv5x. The comparative results of the different models are shown in
Table 5.
From
Table 5, it can be observed that among the tested models, YOLOv5x, which had the largest number of residual structures of the YOLOv5 models, exhibited enhanced feature extraction and fusion capabilities, resulting in improved detection accuracy and higher network precision. The test results showed the highest mAP value of 97.3% for YOLOv5x. However, it also required a longer training time. The YOLOv7 network demonstrated the highest precision value (95.6%), surpassing the improved YOLOv5s (95.3%). However, it fell behind in terms of mAP and training time compared to the improved YOLOv5s network. The improved YOLOv5s network had the shortest training time. Although its mAP and precision values were slightly lower than those of the YOLOv5x and YOLOv7 networks, its overall performance was better due to the shorter training time, making it more suitable for tire character recognition. Overall, the model proposed in this paper, after improvements, performed slightly poorer in terms of mAP and precision compared to the YOLOv5x and YOLOv7 networks. However, it was better than other networks in terms of training time, and its overall performance was better, which met the requirements of tire character recognition.
5. Conclusions
This study aimed to improve the efficiency and accuracy of recognizing tire specification characters for automobiles. Based on the YOLOv5s network model, a decoupled head was used to separate the classification and positioning branches, which improved the robustness and accuracy of the model and reduced the occurrence of overfitting. The C3-Faster module was proposed by improving the FasterNet Block to replace some C3 modules in the original backbone and head, reducing the number of convolution operations, speeding up the calculation of the parameters, and further reducing memory usage. Finally, WIoU-Loss was introduced to improve the regression loss function, and a gradient gain allocation strategy was used to reduce the harmful gradients produced by low-quality anchor boxes and further improve the network’s running speed.
Through the comparative experiments that replaced the different C3 modules with the C3-Faster module in different positions in the backbone, as well as the first and fourth C3 modules in the head with the C3-Faster module, and through the comparison of different loss functions, the WIoUv3 loss was used instead of the CIoU loss that was in the original YOLOv5. In the ablation experiments, the improved YOLOv5s network outperformed other network models, with a mAP improvement of 3.7 percentage points and a precision improvement of 2.1 percentage points. In the final method validation, the effectiveness of the proposed method was confirmed, which met the practical application requirements.