1. Introduction
Surface mount technology (SMT) is a manufacturing method used to produce electronic circuits by assembling components onto the surface of a printed circuit board (PCB). Using SMT, solder paste is applied to the surface of a circuit board, a surface mount component (SMC) is installed, and the solder is then melted to produce the PCB. The automated optical inspection (AOI) machine of an SMT line inspects for assembly defects in the PCB, using a camera and classifies defects according to type. After classification, defects are used for maintenance tasks to identify root causes in the assembly and to improve the process sequence. In recent years, 3D AOI has become widespread, although 2D AOI remains commonly used due to its high economic efficiency.
Figure 1 shows a defect in a SMT assembly line obtained using a light-emitting diode (LED) light system. SMT assembly defects can occur either in the package or in the solder region. The defect is labeled as missing when the chip component is not in place according to the PCB design. The Manhattan defect occurs when the chip component is erected horizontally, and the Tombstone defect occurs when the chip component stands vertically on one solder, with the other side being separated from the solder. An over-soldering defect occurs when the amount of solder is higher than normal; an under-soldering defect occurs when the solder is less than a normal defect. The missing component, Manhattan defect and Tombstone defect can be considered package defects, while over-soldering and under-soldering are defects in the solder region.
Two-dimensional AOI-based inspection methods include printed circuit board inspection, component detection, and defect classification [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16]. In this paper, the defect classification method in the SMT line can be divided into non-learning-based and learning-based methods. Non-learning-based defect classification compares the threshold values of the template, features, and so on [
4,
5,
6,
7,
8,
9]. For instance, H. Xie [
4] proposed a defect classification method using the difference in images between the input image and the reference image. First, the variance of gray
of each defect is obtained from the training image dataset. Then, differences between a reference image for each defect and the input image are obtained. H. Xie [
4] classified defects by comparing the differences between images and the variance of gray
of each defect. H. Wu [
5] proposed a method for discriminating defects by extracting a chip component region. They segmented the chip electrode region by applying the Otsu binarization algorithm to the red color channel of the chip component image. They first extracted the vertical projection and horizontal projection histograms from the binarized image, and then extracted the location of the chip component using the gradient of the vertical and horizontal projection histograms. Defect identification is performed by comparing the extracted chip component’s location with the PCB design. F. Wu [
6] proposed a method for classifying defects in IC chip leads. They divided the IC chip leads into subregions and obtained the ratio and center of gravity of the highlight regions from each subregion. A highlight region is a set of pixels above a certain pixel threshold in a subregion. Each subregion and its features are defined as ‘logical features’. A decision tree based on logical features is then used to classify defects.
Learning-based methods classify defects through features such as the average pixel value of each red, green and blue (RGB) channel and the highlight value of each RGB channel [
10,
11,
15,
16], and normalized-cross-correlation (NCC) [
10,
11] extracted from massive chip component defect datasets obtained using an LED light system. The model analyzes and learns these features then classifies the new input data [
10,
11,
12,
13,
14,
15,
16]. H. Wu [
10] proposed a Bayesian classifier and support-vector machine (SVM)-based defect classification method. They extracted the features described above from the solder region and classified them using a Bayesian classifier and SVM. The chip components were first classified as either normal or non-normal using Bayesian classifier, and the defect type of non-normal was classified using SVM. F. Wu [
12] proposed a defect classification method using color grades, to be used in conjunction with features of the above method. The color grades that describe the color sequence from one side to the other in a pointed region have different RGB patterns depending on the defect. They improved defect classification performance using color grades and decision trees. While the learning-based classification method generates a complex and detailed model considering the accumulated data, the non-learning-based method simply compares thresholds such as an NCC of template matching. Similarly, H. Wu [
11] proposed a classification model that uses the loss of multilayer perceptron (MLP) in the genetic algorithm to select optimal features. Song [
15] further segmented the image into detailed regions to extract the features within. They specified nine subregions by considering the structure of the chip component. They extracted the average pixel value of each RGB channel and the highlight value of each RGB channels from the subregion. They then used the MLP to classify the defects found.
Learning-based methods are a recent advance in the field of defect classification. In particular, convolutional neural network (CNN) specialize in visual imagery; they are used in fields such as object detection, classification [
17,
18,
19,
20], and defect detection [
21,
22,
23]. Q. Xie et al., [
21] proposed a two-level hierarchical CNN system for the inspection of sewer defects. A normal-defect labeling CNN first classifies the input image as other than normal, and the inter-defect labeling CNN then classifies types of defects other than normal. The normal-defect labeling CNN classifies the input image into two types, normal and abnormal, and the inter-defect labeling CNN classifies the abnormal defect types. In terms of system structure, the method of Q. Xie [
21] can be said to be similar to that of H. Wu [
10]. H. Yang [
23] subsequently proposed a multi-scale feature clustering autoencoder (MS-FCAE) for the defect inspection of fabrics and metal surfaces. They detected defects using the differences between the original images and the image restored image using an autoencoder. They improved the accuracy of defect detection by increasing the restoration performance of the autoencoder using clustering on the latent vector.
A number of studies that use CNN to classify defects in a surface mount device (SMD) assembly [
24,
25] currently exist. For example, N. Cai [
24] proposed a cascaded CNN for the detection of soldering defects. The solder region of a chip component was divided into several regions, and the CNN corresponding to these regions were connected. Defects were then classified based on the output of the CNN. However, these methods were only able to detect defects occurring in the solder region, and not in the package region. Although previous methods could determine whether there was a defect or not, they were unable to classify defects by type. Y. G. Kim [
25] proposed a CNN that could detect and classify defects in both the soldering and package region using a single-stream CNN. However, the chip component image included the silk line background, which is unnecessary for defect classification. Therefore, this model can become inefficient as it includes irrelevant items in the image. In addition, the single-stream CNN-based defect classification model uses the full chip component image. This input image contains background that is not related to the defect of chip components, such as silk lines and circuit patterns. As such, in the training step, the weight of the convolution layer for the background can be higher than the weight of the actual defect region. In this case, there is a possibility that the CNN model will overfit the training dataset.
This paper proposes a new dual-stream CNN using two solder regions in the chip component image. It has a parallel stream-based CNN, and they merge after their own processing steps, as most defects feature in the chip component are included in the solder regions. For example, missing defects can be distinguished by the presence or absence of electrodes in the solder region. Tombstone and Manhattan defects can also be distinguished by the size of the solder region. Therefore, the proposed method can detect both soldering and packaging defects. We then used a merge step to combine the results of the dual-stream CNN. Merging refers to the process of combining two streams into one stream, as the proposed method uses a multiple-input single-output (MISO) structure in which two solder regions are used as the input and one defect type as the output. After the merge, the convolution layer is found. Through the convolution operation, the feature map having a high weight is transferred to the bottom layer, and the feature map having a low weight disappears during inference. This series of steps is similar to a kind of feature section. Here, merging is classified into two types: early merge and late merge. An early merge means performing the merge at the convolution layer level. The output feature map of the convolution layer, which is the front part of the CNN, is a low-level feature. Early merge combines the low-level feature map output from the convolution layer of each stream into one feature map. Late merge means performing the merge in the fully connected layer stage. The output feature map of the fully connected layer is a high-level feature because the fully connected layer is located at the end of the CNN. Late merge combines the high-level feature map output from the fully connected layer of each stream into one feature map.
The contributions of this paper can be summarized as follows:
We propose an inspection method that is optimized for classifying chip component defects using two solder regions. We propose a system that extracts features by dividing two solder region images into each stream.
We present experimental results according to the merge structure in a dual-stream CNN. We propose a CNN structure that is optimized for the defect classification of chip components by comparing an early-merge method in which the merge is performed in the convolution layer with a late-merge method in the fully connected layer.
In terms of the chip defect classification, the proposed method achieves a higher accuracy and F1-score than a single-stream CNN based on the full chip component image.
The structure of this paper is as follows.
Section 2 defines the assembly defect classification problem and
Section 3 illustrates the light system used in this paper.
Section 4 describes the structure of the assembly defect classification system.
Section 5 describes the algorithm used for extracting the solder regions.
Section 6 describes the proposed CNN structure.
Section 7 presents the results of the experiments and analyzes their accuracy. Finally,
Section 8 summarizes this paper.
4. System Structure
SMT assembly defects can be classified by CNN, a deep neural network that has shown high performance in image recognition and processing.
Figure 4 shows the structures of CNN for SMT assembly defect classification. In both cases, the probability
of the occurrence of each defect type
can be obtained. The defect type
for the input image is defined as:
Figure 4a is an assembly defect classification system having a single-stream CNN that uses the color channel images (
,
, and
) of each chip component image
as inputs. Although the chip component image includes both the package and solder regions, it cannot exclude the silk line background.
Figure 5 shows the PCB patterns included in the background region of the chip component image.
Figure 4b shows the structure of the assembly defect classification system with the dual-stream CNN. The inputs included the two solder region images (
and
) extracted from the chip component image
, and the color channel image (
,
, and
; and
,
, and
). Note that the solder region image does not include the silk line, which disrupts the feature extraction of CNN.
As described in the Introduction, most defect features are distributed in the solder region in the component, and as such the two solder regions can be used to classify both the package and solder defects.
Figure 6 shows the solder region image based on defect type, wherein the defect feature is distributed in the solder region. Removing unnecessary PCB patterns can reduce the chance of misclassification in solder region extraction, and thereby increase its accuracy. In addition, the size of the chip component image can also be reduced. This optimization simplifies the CNN internal structure, reducing the overall weight capacity and computational time.
6. Convolutional Neural Network (CNN) Structure
CNN is a machine-learning method that generates a model by extracting and analyzing the features of the input image data. Commonly, a single-stream CNN that takes one image datum is used, but this paper will examine a dual-stream CNN that uses two solder regions as the input data.
6.1. Single-Stream CNN
Figure 9 shows the structure of a single-stream CNN. Most CNNs, such as AlexNet [
28] and VGGNet [
29], have a single-stream structure. The input data of the single-stream CNN consists of the
,
and
channel image of the chip component image. Each color channel of the input data has a size of 64 × 64 pixels.
In
Figure 9, C#, P#, and R# represent a convolutional layer, max pooling layer, and ResBlock layer of
Figure 10, respectively. The convolution layer C# has a kernel size of 3 × 3, padding = 1, and stride = 1. The max pooling layer P# has a kernel size of 2 × 2, padding = 0, and stride = 2. The ReLu function was used as the activation function of the convolution layer and the fully connected layer. Unless otherwise noted, the convolution layer and max pooling layer will have the same configuration.
First, for local feature extraction, the input data are converted into a feature map 64. × 64 × 64 in size through C1. After the initial conversion, the feature map is again converted into a 64 × 64 × 384 feature map through two convolution layers C2 and C3. The output feature map of C3 passes through C4, and then for global feature extraction scales down to 32 × 32 × 512 through the max pooling layer P1. It is converted into a feature map 16 × 16 × 512 in size through C5 and P2. To avoid the gradient-vanishing problem, ResBlock layers [
30] R1 and R2, are attached at the rear of the network. R1 and R2 in
Figure 10 show the structure of the ResBlock layers used in this paper. The feature map is converted into 8 × 8 × 512 size through R1 and P3, and 4 × 4 × 512 through R2 and P4. For the fully connected layer input, the feature map output of C6 passes through a 4 × 4 convolution layer C7 with padding = 0 and is converted to 1 × 1 × 512. A probability value for six types of defect is finally output through a fully connected layer having a size of 1 × 1 × 1024.
Table 1 shows the detailed structure of the ResBlock layers R1 and R2 used in this paper.
The reason that our proposed method uses a 3 × 3 kernel size for the convolution layer is because the computational efficiency per receptive field is highest for this the kernel size. Since the computational power of the convolution layer is the sum of the product of the kernel weight and the input value, the computational power of the convolution layer is the square of the kernel size. For example, when the computation power of a 3 × 3 convolution layer is ; similarly, the computational power of a 5 × 5 convolution layer is . Two 3 × 3 convolution layers are required to have the same receptive field as a 5 × 5 convolution layer only using the 3 × 3 convolution layer. At this time, since the calculation of two 3 × 3 convolution layers is , the computational efficiency per receptive field is better than that of the 5 × 5 convolution layer. For these reasons, we used a 3 × 3 kernel size. We used padding because it maintains the size of the output feature map. With the 3 × 3 convolution layer, the size of the output feature map decreases. If padding size is set to ‘0′, the output has very small feature maps in the early stages of CNN. It will adversely affect the performance of CNN. Here, we set the padding size to ‘1′ to prevent degradation of CNN performance.
Common pooling methods include max pooling and average pooling. Max pooling selects the maximum value from the kernel, and average pooling calculates the average of the kernel values for pooling. In this paper, ReLu is used as the activation function, and a number of ‘0′ values are output due to the ReLu function. At this time, if average pooling is used, a down-scale weighting phenomenon occurs in which strong weights are reduced by the average operation, potentially leading to overall performance degradation in the model. Therefore, we used max pooling in the proposed method, which is less affected by down-scale weighting.
, , and are obtained by extending the solder region. The image size increases with the margin used, with the size of the single-stream CNN using it increasing proportionally. In addition, the extended regions of , , and include various background regions in the PCB such as the silk line. This may hinder the performance of the CNN since the background region is included in the defect classification.
6.2. Dual-Stream CNN
Dual-stream CNN generates a classification model using two solder images (
and
), which consist of the color channel image (
,
, and
; and
,
, and
).
Figure 11 shows two different structure of dual-stream CNN. Each input datum has a size of 32 × 32 pixels. The output of the final stage of the dual-stream CNN is the probability value for each defect, similar to that of the single-stream CNN.
The feature value is the output value extracted by calculating the weight and bias value in a layer included in a deep-learning network. The feature value
for the input data
in the CNN is defined by Equation (12).
where
is the weight of the kernel,
are the kernel sizes,
σ is the activation function, and
and
are the location of the input data used to calculate the feature map. The feature matrix consists of output values of the deep-learning network in a 2D image. Let the kernel stride be
, and padding be
. The feature matrix
is then defined as follows:
In general, the deep-learning network layer consists of a multichannel kernel. Therefore, the feature matrix is output in multiple channels called the feature map. Let the number of layers be
, the number of kernels be
, and the output feature matrix for each kernel be
make up the feature map of the
output layer
.
Consequently, merging is the process of combining different feature maps into one map. To calculate for the merge result
, let the number of kernels of the
output layer be
, and the feature map output from output layers with
kernel channels be
.
where,
By merging feature maps from different layers, multiple low-dimensional feature maps can be combined into a single high-dimensional feature map. This high-dimensional feature map can them be used to extract the global features directly related to defect classification, thereby improving the classification accuracy of the CNN model. Merging also simplifies the CNN layer by unifying feature maps, as dual-stream CNNs have a multi-stream structure with multiple layer branches. However, after merging, feature maps can be integrated into a single-stream structure with a single layer branch. Merging also reduces the weight of the CNN model by converting the CNN model from a two-stream structure to a single-stream structure.
6.2.1. Early-Merge Dual-Stream CNN
An early-merge dual-stream CNN is a network structure in which the merge is performed in front of the network (
Figure 11a). In an early-merge dual-stream CNN, the input data is converted into 32 × 32 × 32 feature maps through
. Each feature map output from
passes through an inception module (IM)
[
31]. The IM has a wide reception field, as it combines convolution layers having various kernel sizes, allowing for efficient feature map extraction.
and
(
Figure 12 and
Table 1) represent the structure of the inception module used in this paper.
Through merging, each 32 × 32 × 32 feature map extracted from the is transformed into a single feature map of 32 × 32 × 64. After extraction and transformation, the size of the feature map is converted to 32 × 32 × 48 through . The resulting feature map is of size 32 × 32 × 64 through . Through three convolution layers ( and ) and max pooling layers ( and ), the feature maps are sequentially converted to 15 × 15 × 192, 7 × 7 × 210, and 3 × 3 × 768. At the last convolution layer , the 3 × 3 × 768 feature map output through is converted into a 1 × 1 × 768 feature map for input into the fully connected layer.
6.2.2. Late-Merge Dual-Stream CNN
The difference between late-merge dual-stream CNN and early-merge dual-stream CNN is that late-merge dual-stream CNN merges immediately before the fully connected layer.
Figure 11b shows the structure of a late-merge dual-stream CNN.
The input data of the late-merge dual-stream CNN are converted into a 32 × 32 × 32 feature map through the and convolution layer. The feature map then goes through two inception modules ( and ) and two convolution layers ( and ). The 32 × 32 × 64 feature map which is output from , which passes through convolution layers and and the max pooling layer and . Accordingly, the size of the feature map changes, in the order of 15 × 15 × 128, 7 × 7 × 192, and 3 × 3 × 384. In the last convolution layer , we transformed the feature map of each layer into a 1 × 1 × 384 feature map. After the last convolution layer, two 1 × 1 × 384 feature maps are merged to obtain a 1 × 1 × 768 feature map. The feature map output from the merge then proceeds to the fully connected layer.
The dual-stream CNN reduced the weight of the network via the smaller input data size in comparison to single-input CNN. The actual solder region is smaller than 32 × 32 pixels. However, input data less than 32 × 32 pixels in size degrades the classification performance of CNN. Therefore, through experiment the optimal size of the input data was determined to be 32 × 32 pixels. In addition, unlike the chip component image used as the input of a single-input CNN, the solder region no longer includes the silk line background that is unnecessary for classifying defects. This further increases the accuracy of defect classification in a dual-stream CNN.
6.3. Training
The training step is a process of adjusting the weight of the convolution and the fully connected layer so that the CNN is able to classify the defect type of the chip component image. The training consists of five steps.
Training data consisting of a full chip component image and a one-hot encoded ground truth is loaded.
Two solder regions
and
are extract via the vertical and horizontal projection of full chip component images (
Section 5). Use two solder regions
and
as the input of each stream of the dual-stream CNN.
The solder region is converted into a feature vector of the same size as the one-hot encoded ground truth, using the convolution layer, pooling layer, and inception layer of the dual-stream CNN.
The cross-entropy loss between the output of dual-stream CNN and the ground truth is calculated.
The weight of the dual-stream CNN is updated by backpropagation [
32], and the cross-entropy loss is optimized using the Adam [
33] algorithm.
Steps 2–5 are repeated in order to the desired epoch.
8. Conclusions
In this paper, we proposed a CNN model that combines two solder regions as inputs. We extracted solder regions based on vertical and horizontal projections and then constructed a CNN model. Through defect classification experiments, we confirmed that this late-merge dual-stream CNN has a higher performance and lower weight than can be obtained by conventional methods. The proposed method improved the F1-score by 5.3% compared to single-stream CNN and reduced the weight by about 4000 KB. In addition, the inference time of the proposed method was slightly faster, about 0.3 ms, compared to a single-stream CNN. However, the proposed method requires an accurate solder region extraction. In addition, due to the loss of low-level features, the classification performance of defects was inferior to that of a single-stream CNN for some defects such as missing defects.
Notable advantages and disadvantages of our paper can be summarized as follows.
Advantages:
The proposed method achieves a higher accuracy and F1-score than single-stream CNN based on a full chip component image.
The weight of the model is less than for single-stream CNN.
The proposed method has a faster inference speed than single-stream CNN.
Disadvantages:
In the future, it would be useful to study both partial and full images, as defect classification performance can be increased by using features extracted from these images. In addition, it is likely possible to reduce the inference time of the proposed method by using only the feature region in which the defect appears as the input of the CNN.