1. Introduction
3D pattern film is a 2D film that appears as a 3D pattern depending on the amount and angle of light. This film was created for marketing purposes, and it is used to attract consumer attention and make products look more luxurious by attaching the film to the exterior of a product. 3D pattern film is shown in
Figure 1.
Figure 2 shows an example of the film’s application, where the left image is without the film attached, and the right image is with the film attached. Since this film was developed recently, the distinction between good and bad 3D pattern films is currently being made by visual inspection. However, to facilitate mass production, it is necessary to establish an inspection system capable of identifying defective products.
3D pattern film requires inspection to determine whether products are good or bad based on each pattern. For this purpose, research is being continuously published on methods that can be applied to determine if the shape of the products is defective. Among the rule-based methods, various approaches have been published, including those using segmentation to detect objects and assess images, methods employing image luminance or brightness values for inspection, and techniques based on the width at specific heights in the image histogram. In research utilizing machine learning, various methods have been continuously published, including techniques that apply Canny edge detection to detect objects followed by classifying images using Support Vector Machine (SVM), methods employing Canny edge detection followed by inspection with a convolutional neural network (CNN), approaches using few-shot models for datasets with a small number of images, and studies that classify products using deep-layer models like VGG16 and ResNet-50. However, in this paper, since the images used have a small amount of data and are similar to each other, using deep and complex deep learning models can result in a high probability of overfitting. Moreover, when dealing with a small amount of data, it is common to use pre-trained models or data augmentation, but it is challenging to always obtain a good performance with these methods. Additionally, a challenge with 3D pattern films is that their contours are not distinct, making pattern detection difficult. Furthermore, recent machine learning research papers demonstrate high performance and utilize highly complex model structures, but the experimental data mostly consist of complex image data. However, the data used in this study consists of images that are very similar each other and are simpler compared to the data used in other papers. Therefore, using complex models for such data can lead to overfitting and result in significantly lower classification accuracy.
To overcome these limitations, this paper proposes a method that includes a preprocessing step where widths at specific heights in the image histogram are calculated, as suggested by Lee et al. [
1,
2]. Following this, an MLP is used to classify the images. At this stage, the optimal hyper-parameters for the MLP are determined using the random search method. The proposed method, utilizing the width data obtained from the method proposed by Lee et al. [
1,
2], can solve the problem of detecting patterns in 3D films, which were previously difficult to segment, and can increase accuracy. Additionally, by reducing the complexity of the MLP model and finding the optimal hyper-parameters through the random search method, the probability of overfitting can be reduced.
In summary, the contributions of this study are as follows:
3D film data are image data for which object detection is challenging due to the faintness of the contours. To address this issue, we resolved the problem of object detection by preprocessing the 3D film image data. Specifically, we calculated the width of the bins for each interval of the image histogram, allowing us to effectively obtain information from the 3D film;
If the method proposed in early work [
1,
2] is employed, the threshold values and accuracies for classifying products as ‘good’ and ‘bad’ will vary depending on the sampled data. To address these limitations, in this study, we propose a method of classifying images using deep learning instead of threshold values;
Because of the high similarity between ‘good’ and ‘bad’ images in 3D film image data, complex structures of deep learning models can lead to overfitting and lower accuracy. In this study, we achieved high accuracy by employing a simple deep learning model that considers the characteristics of the data;
The 3D film images used in this study are a recent development, and as there have not been many validation methods researched yet, quality verification in industrial settings still relies on manual inspection by workers. Likewise, in manufacturing and certain sectors, there is a need for data similar to those used in this study, and there is ongoing development and utilization of such data, necessitating validation methods. Therefore, this study will be helpful for quality research on similar products.
The remainder of this paper is organized as follows:
Section 2 describes conventional and recently published methods for classifying good and bad images.
Section 3 describes the method proposed in this paper, and
Section 4 presents the experimental results obtained using the proposed method. Finally,
Section 5 presents the conclusions of this study.
2. Related Works
The quality of 3D pattern film images can be inspected using rule-based methods or machine learning methods. Firstly, in the rule-based method, techniques have been proposed that classify images using the luminance or brightness of the image [
1,
2,
3,
4]. Among these methods, Lee and Kim published a study where they calculated the image histograms for 3D pattern film images, determined the width at the 10th percentile height, and then classified the images based on a threshold value [
1]. This method was able to compensate for the issue of blurred contours in 3D pattern films, which makes pattern detection difficult, and, in experimental results, it demonstrated a high classification accuracy of up to 99.34%. And in the paper by [
2], widths for all heights of the image histogram were calculated and compared to delineate the ranges of good and bad pattern images. However, these methods have limitations in that the classification threshold values at specific heights in the image histogram, ranging from 1/10 to 9/10, can all be different, and these threshold values can change with varying amounts of data. Michelson contrast differentiates images based on the maximum and minimum luminance values of the image [
3]. The formula for this is as follows:
where
represents the Michelson contrast value of the
-th image,
is the maximum luminance value of the image, and
is the minimum luminance value of the image. However, as shown in Equation (1), Michelson contrast only considers the minimum and maximum luminance values of the image. Unlike the luminance values typically found in most good or bad images, there can be unusually large or small values in a minority of the luminance values. This variation can make it difficult to clearly distinguish between good and bad images.
In research on classifying the quality of images, segmentation, a method for detecting objects within images, is widely used. Segmentation refers to the process of differentiating objects in an image based on varying characteristics of pixels, such as color, texture, and brightness. Among the segmentation methods, there are techniques that allow for the morphological detection of objects, with morphological geodesic active contour being a prominent example. Morphological geodesic active contour is a method that combines the morphology snake method [
4] and the geodesic active contour method [
5], and this method progressively evolves the segmented regions before morphologically segmenting objects. Recently, studies have been published utilizing morphological geodesic active contour and image processing techniques to segment and classify welding beads for evaluating the performance of robots used in welding bead manufacturing [
6], as a deep learning model based on a non-parametric adaptive active contour method called fast morphological geodesic active contour (FGAC) for segmenting the left ventricle [
7], for automatic segmentation of the aorta from CT images using morphological geodesic active contour [
8], and for segmenting lung images based on ACM without prior training using FGAC [
9]. However, morphological geodesic active contour has the limitation that it can only achieve high accuracy when the exact position of the segmentation target is accurately set.
As a traditional approach in segmentation, there is the method of edge detection, which detects objects by utilizing areas where the brightness values of pixels change abruptly. There are various types of edge detection, including Sobel edge detection, Canny edge detection, and Laplacian edge detection. Sobel edge detection detects objects by highlighting areas where the first derivative of a function exhibits significant changes. It is more robust in detecting noise compared to other edge detection methods and is more sensitive to diagonal edges than to vertical and horizontal components. Canny edge detection is a method that employs a Gaussian filter to remove noise from an image and applies thresholding twice to determine its edges. It provides sharp edges and is known for its relatively accurate detection, making it a widely used method by default. Laplacian edge detection, unlike other edge detection methods, utilizes second-order differential equations and excels at detecting edges between light and dark regions. Edge detection is typically used as part of preprocessing, and recently, research combining it with machine learning techniques has been consistently emerging [
10,
11,
12,
13,
14,
15]. Mlyahilu et al. proposed a method where they detected the edges of a 3D pattern film using Canny, Sobel, and Laplacian edge detection during the preprocessing stage and then classified 3D pattern images using a convolutional neural network (CNN) [
10]. Salman et al. published a study in which they applied Canny edge detection to detect leaf contours and then used the Support Vector Machine (SVM) method for classification [
11]. Furthermore, Jun and Jung proposed a method for inspecting the quality of Printed Circuit Board (PCB) products using a combination of the Laplacian filter and CNN methods [
12]. The experimental results showed an improvement of 11.87% compared to the existing methods. Furthermore, in segmentation research, studies have proposed methods utilizing drones equipped with high-resolution proximity cameras for capturing images and then employing methods such as dual tree complex wavelet transform (DTCWT) and discrete wavelet transform (DWT) to segment and detect concrete cracks [
13], detecting fires and extracting fire features using different image-processing techniques such as Canny, Sobel, and HSV transformations [
14], and segmenting and detecting concrete cracks in images using edge detectors like Roberts, Prewitt, Sobel, and deep convolutional neural networks (DCNN) [
15]. However, 3D pattern films pose a challenge for pattern detection due to their blurred contours. Furthermore, finding optimal hyper-parameters for data classification in deep learning models and SVM still presents a challenge.
Among machine learning models used for image data classification, popular and high-performance models include VGG16 and ResNet50 [
16,
17,
18,
19,
20,
21]. VGG16 is a model composed of a total of 16 convolutional layers, pooling layers, and fully connected layers, making it widely used in image recognition and classification research [
16]. In their research using VGG16, Qu et al. proposed a method for detecting defects in paper using VGG16, particularly focusing on paper data with a small sample size [
17]. To overcome the issue of overfitting, especially when dealing with a small dataset, the authors froze the first seven layers of VGG16 and fine-tuned the remaining convolutional layers using paper defect images. Through this approach, they achieved a classification accuracy of 94.75% in their experimental results. Althubiti et al. presented a research study in which they developed a method for detecting defects in circuit manufacturing [
18]. They converted images to the HSV color space, identified regions of interest (ROI), and used VGG16 to detect faulty products. ResNet50, another widely used model, consists of a total of 50 layers and addresses the problem of vanishing gradients in deep layers by employing a technique known as ‘residual connections’ [
19]. Feng et al. published a research study in which they used ResNet50 to classify defects such as slag and scratches occurring on the surface of hot-rolled strip steel [
20]. In this study, to reduce the risk of misclassification during defect classification, Feng et al. added FcaNet and Convolutional Block Attention Module (CBAM) methods to ResNet50, achieving an approximate classification accuracy of 94.85%. Additionally, Kumar and Bai presented a method using ResNet50 to detect and classify defects (cut, color, hole, thread, metal contamination) occurring during fabric production, achieving a high accuracy of 96.4% in their experimental results [
21]. However, deep learning models with many layers can still experience overfitting, even when freezing some layers, especially when the dataset is small and the images are similar. Therefore, achieving high accuracy in such cases can be challenging.
Recently, research using few-shot learning has been published, using it as a method to address the challenges of using deep learning models when the dataset is small [
22,
23,
24]. Few-shot learning is designed to extract and adapt as much information as possible from a small amount of data, often using transfer learning and meta-learning techniques [
22]. Cao et al. proposed a method in which they fine-tuned only the parameters of the deep layers in a SqueezeNet-based model and integrated batch-size-independent Group Normalization (GN) for stable results. In their experimental results, they achieved accuracies of 97.69% and 82.92% on two different datasets, respectively [
23]. Nagy and Czúni proposed a method that combines few-shot learning using the EfficientNet-B7 deep neural network with randomized classifiers [
24]. In their experiments, they analyzed defect data from steel surfaces and achieved a high accuracy of over 99%. However, few-shot learning can be challenging to generalize since it uses a very small number of training samples, and obtaining a high performance in specific domains or complex problems can be difficult. Furthermore, recent deep learning models, while exhibiting high performance, tend to have highly complex structures. However, since the 3D film images used in this study are very similar to each other, using conventional deep learning models may result in a higher probability of obtaining a lower accuracy.
3. Proposed Method
The 3D pattern film images used in this study have the characteristics of blurred contours, similarity between images, and a small dataset, making them prone to overfitting when performing deep learning. However, as mentioned earlier, due to the characteristics of 3D pattern films, traditional methods such as segmentation, VGG, and ResNet still face limitations in image classification of these films. In addition, In the method proposed by Lee et al. [
1], while achieving a very high classification accuracy, there was a limitation of having different classification thresholds for each 10th percentile height of the image histogram. This required the cumbersome process of finding the optimal threshold and height before performing the classification. Furthermore, for the method of Lee et al. [
1], when classifying products into good and bad categories, the accuracy varies depending on the threshold value. In other words, while the threshold value may be suitable for the sampled data used in the experiment, the accuracy may vary when additional data are introduced. To address these challenges, in this study, we propose a method where we utilize the approach suggested by Lee et al. [
1] as a preprocessing step. After that, we calculate the widths at specific heights for each image and then use deep learning to classify the 3D pattern film images. The proposed method’s procedure is depicted in
Figure 3, and the details of step 1 are explained in
Section 3.1, while step 2 is described in
Section 3.2.
3.1. Calculating the Width at a Specific Height from the Image Histogram
We use the fast Fourier transform method proposed by Mlyahilu and Kim in [
25] to cut images for each pattern in the 3D pattern film, as shown in
Figure 2. Then, as shown in step 1 of
Figure 3, we calculate the width at a specific height for each image using the method proposed by Lee et al. in [
1]. In this context, the width, as described in Lee et al.’s [
1] paper, carries information about whether the image is classified as ‘good’ or ‘bad’. First, for each 3D pattern film image, we calculate the image histogram
representing the frequency of each pixel value. The formula is as follows:
where
represents the number of pixels with a brightness value
in the grayscale image,
and
are the width and height of the image, and
represents the value of the pixel at position
. Additionally, the term
represents an indicator function, which returns 1 if the
is equal to brightness value
, and 0 otherwise. After obtaining the image histogram, we calculate the heights
corresponding to the 10th percentile
of the image histogram as follows.
Then, in the image histogram, we calculate the minimum value
and maximum value
, which are the points of intersection with the x-axis at a specific height
, as follows.
Using the previously obtained minimum value
and maximum value
on the x-axis, we calculate the width
at a specific height
as given by the following equation.
We calculate and store the widths of the image histograms at heights ranging from 1/10 to 9/10 for each 3D pattern film image.
3.2. 3D Pattern Film Image Classification Using MLP
In the second step, we perform a classification using the MLP
on the width values at specific heights
obtained from the previously calculated image histogram.
where
represents weights and
is the number of hidden layers. In deep learning, there is a necessary process of setting hyperparameters, which are parameters that determine the configuration of the method. Hyperparameters have a significant impact on the method’s performance and learning capability, so finding the optimal values is important. However, since they are not automatically determined by the training data, users either manually set them or use hyperparameter optimization techniques to find the optimal values. In this paper, we use the random search optimization technique, which is well-known for efficiently exploring hyperparameter space within a given time frame [
26]. Among the hyperparameters, we search for optimal values related to the number of hidden layers, the number of hidden nodes, and the learning rate. This is because the 3D pattern film images used in this study have a small amount of data and are similar to each other, which can lead to the problem of overfitting. Therefore, setting an appropriate model complexity is crucial, which necessitates finding the optimal number of hidden layers and hidden nodes. Additionally, in defect detection tasks, processing speed is important in addition to accuracy, which is why we search for the optimal learning rate value. For the random search method, we set the ranges for hyperparameters as follows: the number of hidden layers
, the number of hidden nodes
, and the learning rate
to search for. The range for the number of hidden layers and the number of hidden nodes was set to values that allow the model to be sufficiently complex while still being able to learn effectively. The learning rate was chosen from commonly used values. After finding the optimal values for the number of hidden layers, number of hidden nodes, and learning rate through the random search method, these values are then applied to the MLP. The MLP is used to input 3D pattern film images and classify whether each image is ‘good’ or ‘bad’.
4. Experimental Results
To evaluate the performance of the proposed method in this paper, we conducted an analysis using 3D pattern film, as shown in
Figure 2. As shown in
Figure 2, the 3D pattern film consists of multiple patterns printed on a large film. We used the fast Fourier transform method to cut images for each pattern, and the results are depicted in
Figure 4. Therefore, the total number of image data is 2850, with 2136 being ‘good’ 3D pattern film images and 714 being ‘bad’ 3D pattern film images. To perform deep learning, we divided the data into training and test sets in an 8:2 ratio. In this split, the training data consisted of 1710 ‘good’ images and 570 ‘bad’ images, while the test data included 426 ‘good’ images and 144 ‘bad’ images for experimentation. The PC specifications were as follows: Windows 10 Pro, Intel
® Core™ i7010700k
[email protected], NVIDIA GeForce RTX 2080 SUPER, 16GB, and Python 3.6.
First, we performed preprocessing using the method proposed by Lee et al. [
1]. In the preprocessing step, we calculated the image histograms for the cropped 3D pattern film images, as shown in
Figure 5. Then, for each corresponding image histogram, we determined the minimum and maximum values on the x-axis that intersected with the height of 1/10 and used these values to calculate the width. In the same way, we calculated the widths of the image histograms for heights ranging from 2/10 to 9/10. After calculating the heights for all images from 1/10 to 9/10, we used deep learning to perform the classification. We utilized the MLP in deep learning, and to find the optimal hyperparameters used in the MLP method, we conducted a random search. The hyperparameters experimented with in the random search method included the number of hidden layers, the number of nodes in the hidden layers, and the learning rate. In this case, the ranges for each hyperparameter were: from two to four hidden layers, with each hidden layer having 64, 128, or 256 nodes, and learning rates of 0.001, 0.01, and 0.1. In addition to the mentioned hyperparameters, for MLP, the dropout was set to 0.1, the activation function for hidden layers was relu, the regularization was
, the loss function was binary cross-entropy, epochs were set to 20, the batch size was 100, and five-fold cross-validation was used; all were kept constant for the experiments. As shown in
Table 1, a total of 10 fittings were performed in the random search, and all but one achieved an accuracy of over 97%. The third experiment showed the highest accuracy of 99.3%, with the configuration of three hidden layers, with node counts of 64, 256, and 128 for each hidden layer, and a learning rate of 0.001. In addition, both the recall and precision were 99.5%. The total computation time was 3250.7 s, with 3240.2 s spent on preprocessing and 10.5 s spent on analysis using the MLP method.
To evaluate the performance of the proposed method, comparative experiments were conducted. In the comparative experiments, the following methods were used: Michelson contrast [
3], morphological geodesic active contour [
6], CNN with Canny [
10], SVM with Canny [
11], few-shot (five-shot), VGG16, and ResNet50. In general, it is known that image data achieve a better performance when using a CNN compared to the MLP method. Therefore, CNN models were used in the comparison method. Furthermore, Michelson contrast is a preprocessing method that is similar to the one used in this study, and morphological geodesic active contour is capable of detecting the desired objects morphologically among segmentation methods. Therefore, we compared their performance in the experiments. Among these methods, the Michelson contrast and morphological geodesic active contour used the similarity index (SSIM) for classifying good and bad images after analysis [
27]. The SSIM evaluates the similarity between two images using their structural information, enabling a comparison of the pixel structures that make up the images. The formula for the SSIM is as follows:
where
,
, and
represent the average brightness, contrast, and correlation of the two images, respectively. For Michelson contrast and morphological geodesic active contour, we set the SSIM thresholds at 0.5, respectively, to assess the similarity between two images. In this context, a higher SSIM value indicates greater similarity between the two images, while a lower SSIM value indicates lower similarity. For the CNN with Canny [
10], we followed the same configuration as in the referenced paper, with two hidden layers and node counts of 32 and 64 for each hidden layer. For few-shot, we conducted experiments using a five-shot approach. Additionally, in the case of the pre-trained models VGG16 and ResNet50, we augmented the data to increase its quantity before conducting the experiments. The reason for this is that both models have complex architectures, while the amount of data used in the experiments is relatively small, which could lead to overfitting.
The experiments were conducted based on defined criteria for comparison, and the results are presented in
Table 2 as average accuracy, average recall, and average precision. The confidence intervals were calculated using 2000 bootstrap replicates. In the comparative experiment results, the proposed method achieved an average accuracy of 99.3%, followed by SVM with Canny at 99.0%, VGG16 at 97.3%, ResNet50 at 83.3%, CNN with Canny at 75.3%, Morphological geodesic active contour at 74.9%, few-shot at 72.8%, and Michelson contrast at 67.6%. While VGG16 and ResNet50 are known for their good performance, it was observed that their performance is relatively lower for datasets with few data and similar-shaped images, as in the experiments conducted in this study. Additionally, few-shot, which is known as a deep learning model for use with a small number of data, exhibited an accuracy of 72.8%, which is lower than the 99.3% achieved by the proposed method. The computational times for each method were as follows: CNN with Canny took 58.0 s, Michelson contrast took 107.2 s, morphological geodesic active contour took 1931.1 s, and ResNet50 took 2882.9 s, all faster than the proposed method. However, their accuracy values were all more than 10% lower than that of the proposed method. On the other hand, SVM with Canny had a similar accuracy to the proposed method but had a high computational cost of 5238.0 s. Additionally, few-shot and VGG16 had significantly longer computation times of 26,710.9 s and 27,264.2 s, respectively. Therefore, in the comparative experiments, the proposed method had the highest accuracy, and among models that achieved an accuracy of over 90%, the proposed method also had the fastest computational time.
5. Conclusions and Discussion
In this paper, we proposed a classification method for inspecting 3D pattern film images. We employed a preprocessing step, where we calculated histogram widths at specific heights in the image histograms. Then, we used an MLP model to analyze the width data. In this process, we employed the random search method to find the optimal hyperparameters for the number of hidden layers, the number of nodes in each hidden layer, and the learning rate, which were used to build the MLP model. In the proposed method in this paper, we addressed the limitation of blurry contours in 3D pattern film images by using pixel histograms specific to each image in the preprocessing stage, thereby improving the accuracy of the analysis results. Furthermore, in the proposed method, we mitigated overfitting by constructing a simple MLP model for data with low sample sizes and similar image characteristics. In the experiments, the proposed method achieved an accuracy of 99.30%, which was the highest among the models tested. Comparatively, models with complex structures like VGG-16, ResNet50, CNN with Canny, and SVM with Canny had accuracies of 97.3%, 83.3%, 75.3%, and 99.0%, respectively, which were lower than that of the proposed method. The analysis results from the few-shot method, a method commonly used when data are scarce, also showed an accuracy of 72.8%, which was lower than the performance of the proposed method. Furthermore, the Michelson contrast method achieved an accuracy of 67.6%, while the remaining methods showed accuracies in the 70% range. Therefore, this paper demonstrated that, by using relatively simple deep learning models tailored to the characteristics of the data, it is possible to achieve a good performance.
The 3D film images used in this paper, as mentioned earlier, are a recently developed product that is continuously evolving. In addition, it is likely that other images currently under development will have similar characteristics to the images used in this study. Therefore, the results of this study could serve as foundational research for validation not only for newly developed 3D film images but also for similar products. In future research, we plan to conduct further studies to improve the preprocessing stage. The computation time for the MLP model used in the proposed method was approximately 10 s, but the preprocessing stage consumed most of the computation time. Therefore, we intend to conduct research which aims to improve the preprocessing stage to increase accuracy while reducing computation time.