1. Introduction
The application of glass in numerous fields, such as mobile devices, household goods, industrial machinery, instruments and meters, and electronic products, has significantly boosted the demand for touch panel glass. The production of mobile phone flat glass requires deep processing operations including stretching, cutting, edging, and tempering [
1]. During these operations, glass, as a transparent and smooth product, can be easily subjected to damage caused by mechanical or human operational errors. With defects such as pitting, scratches, chipping, and watermarks, the damaged glass fails to move on to subsequent processing operations. In addition, transferring the production line of mobile phone flat glass following processing operations will un-avoidably result in dust in the air adhering to the surface of the mobile phone flat glass, thus degrading its quality. Currently, manual visual inspection work is a common method for detecting the defects in mobile phone flat glass. However, its relatively high subjectivity and uncertainty lead to low detection precision, a high false acceptance rate, and a high false rejection rate. With current conventional light sources, it is difficult to distinguish dust, a repairable defect, from other irreparable defects. For instance, the similar imaging characteristics of dust and point defects make it easy to make misjudgments, resulting in serious resource waste and quality problems. To solve the problems brought about by manual inspection, it has become urgent to realize the automatic defect inspection of mobile phone flat glass.
The realization of cell phone flat glass defect detection strives to (1) achieve industrialized real-time detection and a fast detection speed; (2) achieve a good generalization of the detection method, which can be applied to the detection of a variety of defects; (3) achieve a high detection accuracy and try to avoid omissions and misdetections in the detection process; (4) realize multi-scale defect detection; and (5) effectively identify point defects and dust defects. With the rise of intelligent equipment and the rapid progress of machine vision technology, intelligent inspection equipment based on machine vision is characterized by a fast inspection speed, high accuracy, low cost, and strong adaptability.
Since the end of the 1970s, scholars at home and abroad have conducted research on the defect detection of transparent media products, which roughly falls into two categories. One is identifying and detecting defects with traditional visual algorithms, and the other is with deep learning approaches. Zhang et al. [
2] proposed an improved K-means clustering method based on backlight illumination to identify the defects in mobile phone flat glass with the test precision reaching 96.5%. Nevertheless, this method turns out to be less satisfying when it comes to low-contrast defect tests. With backlight illumination, Qi et al. [
3] acquired images, conducted denoising and segmentation operations on the images, and then built a BP neural network to identify and classify them, gaining a detection precision of 90.2%. However, there was a small sample size and a high false detection rate. Professor Liu’s team from Huazhong University of Science and Technology [
4,
5,
6,
7,
8,
9] put forward a method combining texture and a BP neural network to detect float glass defects. The algorithm, however, has high requirements for the quality of collected images and fails to have ideal applicability. Jian [
10,
11,
12] adopted the difference image method to identify the defects in mobile phone flat glass and reached a detection precision of 92%. Nevertheless, that method is greatly subject to natural illumination, resulting in low robustness in the algorithm. In recent years, target detection based on a deep convolution neural network has come into being [
13,
14]. With the improved structure of the network model, the revised algorithm can perform better target recognition than the traditional machine vision algorithm by virtue of its superior feature learning ability [
15,
16,
17,
18,
19,
20]. Aware of the fact that the YOLO algorithm is equipped with a satisfying detection precision and detection speed in defect recognition, Guo et al. [
21] introduced a confidence loss function and a cross attention network based on a gradient-harmonized mechanism into the YOLOv4 model framework, with which they detected defects on ceramic substrates. Zhu et al. [
22] adopted Canopy with K-means clustering to improve the K-means clustering in YOLOv4 and then acquired the prior bounding box, thereby detecting the defects of the bare PCB board. Liu et al. [
23] applied YOLOv5 to the detection of appearance defects in cigarettes, such as spots, wrinkles, and the wrong teeth on the surface. However, there is still room for improvement in the precision of defect recognition and real-time detection speed when it comes to hidden defect features, multi-scale defects, and defects with similar features.
Therefore, this paper proposed an illumination method with total reflection lighting, which aims to accurately distinguish dust and point defects according to their characteristic differences in the imaging field. YOLOv5, a deep learning network with excellent performance, was employed to improve the method, allowing, in its application, the detection of scratches, point defects, stains, and dust on mobile phone flat glass, as well as industrial real-time detection.
3. Principle and Improvement of Defect Detection Algorithm
3.1. YOLOv5 Algorithm
At present, common target detection models are generally divided into single-stage and two-stage algorithms. The former algorithm is mainly dominated by YOLO and SSD, and the latter algorithm is represented by R-CNN [
24] and Faster R-CNN [
25]. YOLOv1 [
26], as a classic single-stage target detection algorithm, was proposed by Redmon and other researchers in 2015, which ushered in the YOLO family. Through continuous optimization and innovation, YOLOv2, YOLOv3, YOLOv4, and YOLOv5 came into being one after another [
27,
28,
29], with constantly improving performance. YOLO sums up the target classification and target position regression in Faster R-CNN as a regression issue, which significantly reduces the computing power and improves the detection speed, thus enabling its wider application in real-time detection. YOLOv5 works as an end-to-end target detection algorithm that directly transforms image information into a target position and target type recognition through a deep convolution neural network, whose detection results take the form of a target box and confidence. The improved network structure, as shown in
Figure 6, roughly falls into four parts. Input Section: The input module of YOLOv5 defines the specifications and format of the images received by the model. Typically, input images undergo preprocessing operations such as normalization and resizing for their effective processing within the model. During the training phase, the input section may also include information related to label data, such as the position and category information of target boxes. Backbone Section: YOLOv5 employs CSPDarknet53 as its backbone network, which is responsible for extracting feature information from the input images. CSPDarknet53 is a deep convolutional neural network with multiple convolutional layers and residual connections, gradually extracting semantic and spatial information from the images. These feature maps are then passed to subsequent layers for object detection. Neck Section: YOLOv5 utilizes PANet (Path Aggregation Network) as its neck section. The primary role of the neck section is to propagate and aggregate information between feature maps at different scales, enhancing the model’s ability to detect objects of varying sizes. PANet achieves this by introducing path aggregation modules and feature fusion modules, facilitating the interaction and integration of information between feature maps of different resolutions, thereby improving the model’s performance. Prediction Section: The prediction section of YOLOv5 comprises a series of convolutional layers and activation functions used to predict the position and category information of objects from feature maps. This network segment processes the feature maps to output the coordinates of object positions, category confidences, and other relevant information. The prediction section typically employs convolutions and activation functions to perform nonlinear transformations on the feature maps, extracting higher-level semantic information and ultimately generating object detection results. Through the collaborative effort of these modules, YOLOv5 accomplishes efficient and accurate object detection tasks. The input module handles the reception and preprocessing of input images, the backbone section extracts image features, the neck section enhances feature representation, and the prediction section is responsible for generating the final detection results.
3.2. Increase Attention Mechanism
The defects of mobile phone flat glass are micro-scale defects, which are difficult to detect. However, the detection performance of the YOLOv5 algorithm in terms of micro-defects is not satisfactory. Sometimes, there is a chance that it fails to detect defects or detect them accurately, resulting in low detection precision and a high missed detection rate, which falls short of conventional industrial needs. In order to boost its ability to detect micro-defects, the Convolutional Block Attention Module (CBAM) was added. Feature maps are generated when the input image is sliced by the Fouce structure, as shown in
Figure 7. Then, the convolution operation is followed. At that time, CSP1-CBAM, the new structure, is constructed by embedding the CBAM in the backbone network CSP1 structure, which conducts a feature aggregation of the feature map. The CBAM, as a mixed-domain attention module, is composed of channel and spatial attention modules. Attention maps are generated by deriving the intermediate feature mapping of the two independent channel and spatial dimensions. An adaptive refinement of the features can build up the network’s ability to express important features and suppress interference information, thereby improving detection precision.
The CBAM hybrid attention module is shown in
Figure 8.
Figure 8b is the channel attention module, which generates a channel attention graph based on the relationship between features, focusing on “what” is meaningful on a given feature graph. Its specific implementation involves aggregating spatial information on the feature map using average pooling and global maximum pooling operations to produce feature descriptions in two different spatial dimensions. One is the global pooling feature and the other is the maximum pooling feature.
These descriptions are then input into a multilayer perceptron (MLP) for learning. The outputs of the MLP are “summed up”, and then processed through the mapping of the Sigmoid function of the loss function to generate two-dimensional channel attention maps. The formula is as follows:
In this formula, represents the Sigmoid function, MLP represents the shared multilayer perceptron, F represents the input feature, represents the average pooling, and represents the global maximum pooling. The channel attention weight helps the model better select and utilize features, reduces redundant information, and enhances the model’s interpretability and applicability.
The spatial attention module is shown in
Figure 8c; it generates a spatial attention map according to the spatial relationship between features. Focusing on “where” is the informative part, and it complements the channel attention module. Specifically, it carries out average pooling and global maximum pooling operations on the given feature vectors, and then splices the output results into a feature map, whose channel information will be aggregated. Next, by inputting the output results into a standard convolution layer for connection and convolution operations, a two-dimensional spatial attention map will be generated through the mapping processing of the loss function and Sigmoid function. The formula is as follows:
In this formula, represents the convolution operation, with the convolution kernel being , and F represents the input feature vector. The spatial attention weight helps the model better capture local features and structural information, adapt to different scales of input data, reduce noise and interference, and enhance its interpretability.
To verify the effect of adding the CBAM on the model’s ability to detect small targets, a training and testing process is conducted. The validation results, depicted in
Figure 9, clearly demonstrate that the original model struggles to accurately identify and detect certain small objects on the flat glass of a cell phone. However, upon integrating the CBAM, the model exhibits improved performance by correctly identifying and detecting those tiny targets. This notable improvement confirms the real and effective enhancement brought about by the CBAM in the model’s detection of small targets.
3.3. Add a Small Target Detection Layer
In the process of extracting image features with the help of a convolutional neural network, the local receptive field can perceive the ever-expanding original image as the convolution layers deepen and the resolution of the feature map decreases. Moreover, the closer the feature map is to the top level, the more attention it tends to pay to global information. As a result, the deep features extracted by the deep neural network are extremely unsuitable for small target detection. It should be admitted that shallow feature maps, with relatively small local receptive fields and more position and detail information, can partially compensate for the deficiency of deep features. However, if single shallow features are employed to detect targets, they will cause serious false positives and missed detections due to a lack of guidance from advanced semantic information. In YOLOv5, the feature maps of 60 × 60, 30 × 30, and 15 × 15, which are obtained from an input image with a size of 480 × 480 after 8, 16, and 32 downsampling operations, are the exact places where the network conducts target detection. Among these feature maps at three scales, the feature map downsampled 8 times has the smallest local receptive field. Therefore, it is difficult to learn the feature information of small targets with a resolution lower than 8 × 8. To solve this problem, a detection header with 4 downsampling operations is added to the original network model. A feature output is added after the first CSP1 module of the backbone network, which is connected to the neck to conduct feature fusion at the same scale. After the CSP1 module and convolution operations, it is input into the newly added small target detection layer for prediction. The improved network model is represented in
Figure 10.
To validate the improving effect of adding a small target detection layer on the model’s ability to detect small targets, the model is trained and validated. The results are shown in
Figure 11. The improved model can correctly identify and detect small targets on the mobile phone’s flat glass, whereas the original model fails to detect some small targets. This proves the effect of such an improvement in enhancing the detection of small targets.
3.4. Data Augmentation and Improvement
Data augmentation is commonly employed to increase the diversity of training data in the case of limited data, thereby improving the generalization ability of network models. The YOLOv5 algorithm employs the Mosaic data augmentation method. Four images, which are randomly selected from the data samples, are scaled and clipped randomly. Then, they are randomly distributed and spliced to create a picture whose resolution remains consistent with that of the sample image. However, as the defect itself belongs to a pixel-level size, clipping the defect image of the mobile phone flat glass might lead to totally or partially clipped defects. As a result, there would be an absence of the target or target labeling box in the image area, which has an adverse effect on the training of the neural network. A failure in accurate labeling, labeling errors, and labeling inaccuracies are also likely to occur. Aiming to replace manual annotation with accurate annotation and realizing diversified data expansion, this paper proposed an image segmentation and fusion method to expand and augment data.
3.4.1. Frequency Domain Filtering
To reduce the impact of noise on image feature information, filter processing is applied to the image. Since filtering images in the spatial domain can easily result in the loss of edge features, while filtering in the frequency domain can preserve the edge information well, this paper adopts frequency domain filtering. The image is transformed from the spatial domain to the frequency domain for processing. The former refers to an image composed of image pixels, while the latter describes the distribution of image pixel values by spatial frequency and maps them out in the form of a spectrum. The transformation of the image domain can be achieved through a Fourier transform. After processing the image in the frequency domain, the processed image can be transformed back to the spatial domain using a Fourier inverse transform, which ensures the preservation of the information.
From a
two-dimensional image
, a two-dimensional discrete signal is collected.
represents the real variable of the discrete signal, while
works as its frequency; thus, the Fourier transform is represented as follows:
Among these formulae, is the forward transformation kernel and represents the inverse transformation kernel.
As shown in the frequency domain diagram in
Figure 12b, the bright central position belongs to the low frequency region, which features gray-scale transformation and belongs to a flat region in the spatial domain. The darker side parts belong to the high frequency region, which represents the edge and noise in the spatial domain. In order to retain the edge information while suppressing the noise, a Butterworth low-pass filter was adopted, and its transfer function is as follows:
In these formulae,
represents the cut-off frequency and
is the order. Assume
as the spectrum function of the original image in the frequency domain and
as the filter transfer function, and then the convolution operation of frequency domain filtering becomes what is shown in Equation (4):
The improved spectrum in frequency domain
is acquired through convolution operations, and then the filtered image is gained from the image being transformed into the spatial domain through an inverse Fourier transform. The comparative experiments conclude that the best performance comes when the cut-off frequency stands at 200 and the order is 6; the images of the results of the comparison experiment are shown in
Figure 13.
3.4.2. Entropy Segmentation
The image can be divided into a foreground image and background image, and the former is the ideal target to be detected. Edge-based, threshold-based, and region-based approaches are the common defect segmentation methods. The second method, which is the most intuitive, is simple and fast to process, making it the most commonly used image segmentation method. The defect area accounts for a small proportion of the defect image of the mobile phone flat glass. The image histogram takes the form of a single-peak distribution, and there are many kinds, scales, and shapes of defects in mobile phone flat glass. As a result, global threshold-based segmentation always fails to perform well. In addition, local threshold-based segmentation produces target images with poor connectivity and fails to retain the edge information of defects. In order to ensure that the edge information of the image can be preserved completely in the later image fusion, this paper divided the complete glass defect image into single images for later segmentation. In this way, the original edge information can be retained and the needs of the later image fusion can also be met.
This paper adopted the one-dimensional maximum entropy segmentation method. Information entropy is a concept created to measure the amount of information lost. Entropy actually defines the uncertainty of a random variable, where the maximum value means the maximum uncertainty of a random variable. The gray scale of a picture is L, whose probability distribution is represented by
,
,
, …,
,. Assume gray scale
t as the segmentation threshold and the image constituting the background (B) as a pixel lower than scale
t, and
represents the sum of the gray scale probabilities from 1 to
t. Then, the probability distribution of each gray scale in the whole background pixel is
,
,
, …,
.
The pixels constituting the target image (
O) are higher than the gray scale
t. Assume
represents the sum of the gray scale probabilities from
t + 1 to l, and then the probability distribution of each gray scale in the target image within the whole pixel of the target image is
,
,
, …,
.
The information entropy of the background and target images is calculated as follows:
represents the information entropy of the background,
is the information entropy of the target image, and the sum
of the information entropy of
B and
O is
We can search all the gray scales and find out the maximum value t, the optimal dividing line threshold.
The defect mask image, generated after image segmentation, is a binary image composed of 0 and 1. We multiply the original image with the mask image to acquire a completely extracted region with only the defect itself, and then eliminate the background area. If an image with a size of
is represented as
, its mask image is
, and its extracted target area is
ROI, then the calculation formula is as follows:
The extraction results are shown in
Figure 14.
3.4.3. Image Fusion
Image fusion involves segmenting the target area in an image to seamlessly integrate it into a new image, meeting the demands of data augmentation. Image fusion primarily consists of two processes: segmenting the target area and post-image fusion. Utilizing entropy segmentation as discussed earlier enables the effective extraction of defective images and the generation of defect masks. Common image fusion methods include PCA-based fusion, logical filtering, neural networks, and weighted averaging, among others. In this fusion process, a method based on Poisson editing [
30] is employed to merge the target defect and clean background into a new image. By utilizing gradient information from the background image and defect image, along with the boundary distribution information of the defect image, a new pixel distribution is constructed for the fused image using interpolation. In the synthesized image, the background image size is specified as 480 × 480 and the final synthesized image size is also the same, ensuring that all images are of a uniform standard size for their input into network models for training.
As represented in
Figure 15, S, a closed subset on the two-dimensional image
, represents the image domain.
is a closed subset on S with the boundary
.
is a known scalar function defined on S, and g is a known scalar function defined on
. Then we minus
inside, make
an unknown scalar function defined inside of
, and finally make
a vector field defined on
.
The essence of Poisson image editing is to introduce gradient
to conduct module interpolation and minimize the difference between the gradient field and the target gradient field, so as to solve the unknown scalar function
.
With the formula
, Formula (14) has a unique solution to Poisson’s equation with a Dirichlet boundary condition, as follows:
is the Laplace operator, is the divergence of , and .
As for discretizing images, the underlying discrete pixel grid can discretize the problem naturally. Based on that, a finite difference discretization of (14) produces Equation (12). For
, there is
p is the pixel point on
S.
is a collection of its four Ns.
represents the projection of
onto
. The solution of
satisfies the following linear equation:
For
pixels located inside Ω, there is no boundary condition term on the right of Equation (13), which can be simplified as:
The result of the fusion is shown in
Figure 16.
During the fusion process, the positions of the defects in the background image are randomly selected. The upper left vertex of the maximum circumscribed rectangle of defects is defined as
, the lower right vertex is defined as
, and the position information of the background image is
,
. To ensure the integrity of defects, the location of defects and the background should meet the following requirements:
In the end, the labeling and position information are output. Bright spot defects are marked as 0, scratch defects are marked as 1, stain defects are marked as 2, and dust defects are marked as 3. A normalization operation is conducted on the position information to improve the convergence speed and precision of the model and generate text data. The data distribution generated is shown in
Figure 17.
4. Test Results and Evaluation
4.1. Experimental Setup
The code of the defect detection algorithm in this paper is conducted on the Pytorch framework and runs on a windows10 system. The configuration of the computer is as follows: the CPU is AMD R9-5900H, GPU is RTX3070, memory is 32 G, video memory is 8 G, and the sample size input into the network is 480 × 480.
The glass samples acquired from a mobile phone flat glass manufacturing enterprise were collected through the built detection platform. Manual uniform cutting and classification were conducted on the images obtained under the combined light source, including 1000 images of bright spots, scratches, dust, and stains. The training set was generated through the previously mentioned data augmentation method. In order to simulate the real environment, 800 clean background images were added, leading to a total of 1600 images, including 2400 point, slice, dust, and dirt defects. There are 200 images in the validation set, including 300 point, slice, dust, and dirt defects. The testing set is composed of non-synthetic and unlabeled original images whose size is the same as that of the training samples. It has a total of 200 images, including 300 point, slice, dust, and dirt defects.
4.2. Evaluation Index
To provide an insightful evaluation, the indicators in
Table 1 are selected to objectively evaluate the model.
The specific formulae of the indicators are as follows:
TP (True Positive) suggests that the prediction is correct and the actual value is positive, TN (True Negative) indicates that the prediction is incorrect and the actual value is negative, and FN (False Negative) indicates that the actual value is positive and the prediction is incorrect, FP (False Positive) suggests that the actual value is negative and the prediction is positive, k represents the number of categories, and is the AP value for the I category.
4.3. Experiment and Result Analysis
Firstly, the network model parameters are initialized. During the training process, adjustments are made to hyperparameters such as the learning rate, training batch size, and weight magnitudes to optimize the experimental outcomes. The evaluation of the final experimental results is conducted on both the training and validation datasets. Four models are employed for ablation contrast experiments: the original YOLOv5 algorithm (denoted as YOLOv5), YOLOv5 algorithm with an attention mechanism (denoted as YOLOv5c), YOLOv5 algorithm with enhanced data augmentation methods (denoted as YOLOv5z), and YOLOv5 algorithm incorporating both enhanced data augmentation methods and an attention mechanism (denoted as YOLOv5_zc). The visual analysis of the training processes for these four models is presented in
Figure 18 and
Figure 19.
As shown in
Figure 18, the loss functions of the YOLOv5c and YOLOv5z converge faster and to smaller values compared to the YOLOv5 as the number of epochs increases. This indicates that both improvement methods yield better training results than the original model. Furthermore, with the combination of both improvement methods in the YOLOv5zc, the loss function converges even faster and to smaller values compared to the YOLOv5c and YOLOv5c alone. This suggests that the combined approach of the two methods enhances the training effectiveness of the model.
Figure 19 illustrates the curve of the mean average precision (MAP), varying with the increase in epochs for all four models during training. By comparison, it is evident that the YOLOv5zc achieves the highest MAP value and the most stable curve as the number of iterations increases.
The test results of the four optimal models on the validation set are presented in
Table 2. According to the data, the improved YOLOv5_zc witnesses a significant improvement in its detection performance compared with the original yolo v5 and the unilaterally improved YOLOv5_z and yolo v5_c, with an increase in its precision rate of 2.67%, a recall rate improvement of 2.74%, a MAP improvement of 2.62%, and a decreased missed detection rate as low as 5%, which means that it meets the requirements for industrial detection.
4.4. Testing Results and Analysis of Testing Set
After the comparative ablation experiment, the optimal model of the improved YOLOv5_zc is set as the detection model of the automatic optical detection system. An actual test on the model is carried out, where the NMS (iou) value of the testing set is 0.45.
Table 3 and
Figure 20 present the statistical results of defect correctness for each class and the confusion matrix, respectively.
Figure 21 illustrates the detection results of the four types of defects.
The test reveals that the correct identification rate of slice and dirt defects is 100%. Due to the similar image features of dust and bright spots and the relatively small size of dust and bright spots, false detections and missed detections occur. According to the detection results, the average detection correct rate of the YOLOv5_zc model in this system stands at 98.75%, which basically meets the requirements of detecting good and defective mobile phone flat glass products.
5. Conclusions
To enable the industrial inspection of mobile phone flat glass, solve the problem of dust interference in the production line, and reduce the error rate of follow-up supporting products, this paper independently designed a total reflection–grazing incidence light source according to the unique optical characteristics of point defects, including scratches, stains, and dust, which can effectively distinguish these four defects. In addition, the deep learning network YOLOv5 was introduced into the defect detection of mobile phone flat glass. Aiming at the problem of small target defect detection and inconspicuous defect features, the CBAM was embedded into the feature extraction module, and a small target detection layer was added. As for the problems of manual labeling, inaccurate labeling boxes, and data deficiency, a data augmentation method based on segmented image fusion and the automatic output of position information and labels was put forward. The comparative ablation experiments on the training set and testing set prove that the proposed algorithm provides a better performance than the YOLOv5 in the defect detection of mobile phone flat glass. The average precision of the optimal model on the validation set reaches 98.36%, the missed detection rate is 1.27%, the over-detection rate is 2.47%, the detection speed stands at 64 fps, and its average accuracy on the testing set reaches 98.75%. All of the data suggest that the model is applicable to multi-scale and multi-type glass defect detection and that it successfully meets the industrial requirements for detecting mobile phone flat glass defects. It is worth noting that the research in this paper is based on the flat region of mobile phone flat glass, and the task of detecting defects in the edge region is ongoing. Therefore, future research will include the task of detecting defects at the edges of mobile phone flat glass and establishing how to quickly deploy and bring online inspection models into different industrial environments.