**2. Methodology**

#### *2.1. Simulation Method of Histogram Patterns*

Seven typical HPs of the production process are shown in Figure 2, including normal (NOR) patterns, bimodal (BIM) patterns, left and right island (LI and RI) patterns, left and right skew (LS and RS) patterns and flat top (FT) patterns.

**Figure 2.** Seven typical histogram patterns.

The Monte Carlo simulation algorithm is recognized as a HP simulation method in the SPC field [48]. In this paper, the method is used to simulate the quality data of each HP.

If *y*(*t*) is the value of quality data measured at *t*, then the quality data in the NOR pattern is normal distribution:

$$y(t) \sim \mathcal{N}(\mu, \sigma^2) \tag{1}$$

where μ is the mean of quality data under controlled conditions, σ is the standard deviation.

The quality data of BIM pattern is a combination of two normal distributions, which can be simulated by the following formula [48]:

$$y(t) \sim a\mathcal{N}\left(\mu - b\_{1\prime}\left(\frac{3\sigma - b\_1}{3}\right)^2\right) + (1 - a)\mathcal{N}\left(\mu + b\_{2\prime}\left(\frac{3\sigma - b\_2}{3}\right)^2\right) \tag{2}$$

where, *a* and (1 − *a*) are the proportion of two normal distributions, and *b*1 and *b*2 are the distance between the center of two normal distributions and μ.

The LI and RI pattern can be seen as a small normal distribution next to the normal distribution., which can be simulated by the following formula:

$$y(t) \sim a \text{N}\{\mu, \sigma^2\} + (1 - a)\text{N}\left(\mu \pm b, \left(\frac{3\sigma - b}{3}\right)^2\right) \tag{3}$$

where, *a* and (1 − *a*) are the proportion of two normal distributions, and *b* is the distance between the center of small normal distributions and μ; "+" represents RI pattern, and "−" represents LI pattern.

The quality data of LS and RS pattern can be combined by several normal distributions [48], which can be simulated by the following formula [48]:

$$y(t) \sim \sum\_{i=0}^{m-1} \frac{1}{m} N\left(\mu \pm 3\sigma((\frac{2}{3})^i - 1), \ ((\frac{2}{3})^i \sigma)^2\right) \tag{4}$$

where, *m* is the number of normal distributions; "+" represents LS pattern, and "−" represents RS pattern.

FT pattern quality data can be formed by mixing uniform distribution with normal distribution, which can be simulated by the following formula:

$$y(t) \sim a\text{N}(\mu, \sigma^2) + (1 - a)\text{U}(\mu - 3\sigma, \mu + 3\sigma) \tag{5}$$

where, *a* and (1 − *a*) are the proportion of normal distribution and uniform distribution respectively.

#### *2.2. Simulation Method of Control Chart Patterns*

Nine typical CCPs of the production process are shown in Figure 3, including normal (NOR) patterns, cycle (CYC) patterns, systematic (SYS) patterns, upward and downward trend (UT and DT) patterns, stratification (STA) patterns, upward and downward shift (US and DS) patterns and mixture (MIX) patterns.

**Figure 3.** Nine typical CCPs.

Equation (6) is a general formula for simulating various CCP, which includes the process mean and two noise components [5]: *x*(*t*) is random noise and *d*(*t*) is a special fluctuations from specific factors in manufacturing process.

$$y(t) = \mu + \mathbf{x}(t) + d(t) \tag{6}$$

where *y*(*t*) is the quality data at time *t*. μ is the mean value of the product. Random noise *x*(*t*) obeys normal distribution, *x*(*t*) ∼ *N* 0, σ2.

A detailed description of the simulation methods of nine typical CCPs can be found in Section 4.6 or [49]. We will not repeat the description here.

#### *2.3. Long Short-Term Memory Network Model*

In recent years, deep learning has made remarkable achievements in various fields. The deep learning algorithm has advantages that the traditional shallow machine learning algorithm does not have, such as complex data preprocessing and feature engineering are no longer needed. The raw data can be used as the input of the model. The deep-seated of neural network and the special design of network structure make it can learn the potential deeper knowledge from the raw data and be competent for more complex tasks. The features it learns are the most suitable for this classification task and fares better than the features designed by human experts.

LSTM is a typical deep learning model and a variant of RNN. The basic idea is still to take the previous output of the network as the next input, as shown in Equation (7). Where *xt* is the *t*th input of the network, *h* is the output of the network, and *H* is the nonlinear transformation function.

$$h\_l = H(\mathbf{x}\_l, \mathbf{h}\_{t-1}) \tag{7}$$

Compared with traditional RNN, LSTM can capture long-distance dependency. It is di fferent from the standard RNN in structure. LSTM adds some gate structures for each cell, which allows information to pass selectively. It can be understood as a mechanism of feature learning selection and update [45]. The cell structure of LSTM is shown in Figure 4.

**Figure 4.** Typical LSTM cell structure.

The forget gate (the red part in Figure 4) is used to decide what information will be discarded from the cell state *C<sup>t</sup>*−1. A forget gate is mathematically represented by [45]:

$$f\_t = \text{sign}(\mathbf{W}\_f \cdot [\mathbf{x}\_{t\prime}, \mathbf{h}\_{t-1}] + \mathbf{b}\_f) \tag{8}$$

The input gate (the blue part in Figure 4) is used to determine what new information *C t* will be stored in the cell state. A input gate is mathematically represented by [45]:

$$\mathbf{i}\_{l} = \text{sigmoid}(\mathbf{W}\_{l^\cdot}[\mathbf{x}\_{l^\cdot}, \mathbf{h}\_{l-1}] + \mathbf{b}\_{l}) \tag{9}$$

$$\mathbf{C}'\_t = \tanh(\mathbf{W}\_{\mathbf{c}} \cdot [\mathbf{x}\_t, \mathbf{h}\_{t-1}] + \mathbf{b}\_{\mathbf{c}}) \tag{10}$$

Then the cell state is updated, and the decisions made in the previous steps are implemented to ge<sup>t</sup> a new cell state *C<sup>t</sup>*. The mathematical representation of the state update is as follows:

$$\mathbf{C}\_{t} = f\_{t} \* \mathbf{C}\_{t-1} + i\_{t} \* \mathbf{C}'\_{t} \tag{11}$$

The output gate (the green part in Figure 4) determines the final output *ht* according to the updated cell state. An output gate is mathematically represented by [45]:

$$\mathbf{o}\_{l} = \text{sigmoid}(\mathbf{W}\_{o} \cdot [\mathbf{x}\_{l}, \mathbf{h}\_{t-1}] + \mathbf{b}\_{o}) \tag{12}$$

$$h\_t = \sigma\_t \* \tanh(\mathbb{C}\_t) \tag{13}$$

where *<sup>W</sup>f* , *Wi*, *Wc* and *Wo* represent the weight of forget gate, input gate, current cell and output gate respectively. *bf* , *bi*, *bc* and *bo* represent the bias.

The final output *hT* of the last cell of LSTM is the feature extracted adaptively from the raw data (where *T* refers to the last cell). The final recognition can be achieved by inputting this feature into the traditional ANN. The last layer is the output layer, whose activation function is Softmax, and the number of neurons is *N*, representing the number of types that want to identify patterns. Finally, the stochastic gradient descent (SGD) algorithm is used to optimize the weight ( *<sup>W</sup>f* , *Wi*, *Wc* and *Wo*) and bias (*bf* , *bi*, *bc* and *bo*) in LSTM. Commonly used SGD methods include SGD with momentum (SGDM), root mean square propagation (RMSProp), and adaptive moment estimation (Adam) [50].

In this paper, Bi-LSTM is used to complete CCPR and HPR, which is slightly di fferent from standard LSTM, that is, Bi-LSTM consists of forward LSTM and backward LSTM. More details of the Bi-LSTM structure are introduced in the following sections.

## **3. Proposed Method**

This section describes the details of the proposed method. The input of the LSTM is the unprocessed data, such as quality data in the control chart and the frequency of each interval in the histogram, and the output is the category of the pattern. Feature extraction, and feature selection are all completed by LSTM through learning. Compared with unidirectional LSTM, Bi-LSTM can learn the front and back relation of sequence better.

Figure 5 is a structural diagram of the proposed method. It has a bidirectional recursive layer as well as a multilayer structure. When the input of the model passes through a multilayer structure, the information transmitted by each layer will be represented in di fferent dimension spaces. Therefore, data is gradually learned by increasing the number of layers of the network. The connection between input and output is improved to better describe the characteristics of the system [51]. In other words, bidirectional and multilayer recursive structure can raise the learning space and flexibility of the model [45]. Each small square in Figure 5 represents a LSTM cell.

The network consists of multilayer Bi-LSTM and a Softmax classifier. The former is used to extract features from the raw data, and the latter is used to classify various patterns. The input of Softmax layer is the combination of the last forward LSTM output *h*→ *T* and the last backward LSTM output *h*← *T* , as shown in Equation (14). The initial learning rate was 0.05, and each LSTM cell has 10 neurons. The next section discusses and optimizes other structural and training parameters of multilayer Bi-LSTM, such as optimization algorithm, training batch size and network layer number.

$$h\_T = \begin{bmatrix} h\_T^{\rightarrow}, h\_T^{\leftarrow} \end{bmatrix} \tag{14}$$

Because of the end-to-end ability of deep learning method, the specific implementation process of the method proposed in this paper is as follows. Firstly, the Monte Carlo algorithm is used to simulate training set and test set of HPR and CCPR respectively. The one-hot encoding method is used to label the samples. Then, training sets are used to train two Bi-LSTMs, which are used for HPR and CCPR respectively. Finally, the performance of the optimized two Bi-LSTMs are verified by the test set and real production data.

**Figure 5.** The structure of the multilayer Bi-LSTM.

#### **4. Experiment and Discussion**

In this section, some simulation data experiments and production data experiments were used to verify superiority of the multilayer Bi-LSTM. It was implemented by the MATLAB environment, and the experiment was carried out on a 3.10 GHz CPU with 4 GB RAM. The correct recognition rate (CRR) was used as the evaluation standard of model performance. At the same time, the discussion related to the experiment results was completed.

#### *4.1. Simulation Parameters of HPs*

In order to make the simulation data closer to the complex production data, the Monte Carlo simulation parameters of the seven HPs were randomly selected within a certain range, using the uniform distribution. The range of parameters is shown in Table 1. For example, parameter *a* of all BIM pattern samples is uniform distribution in the range [0.4, 0.6].


**Table 1.** The simulation parameters of seven typical HPs.

It is worth noting that the simulation results are the quality data of each HP, not the histogram. After that, interval statistics is needed to ge<sup>t</sup> histogram. The purpose of this simulation method is to be more in line with the actual situation. The quality data length of each sample is 500, and the number of groups of histogram is 25. The data set consists of 14,000 samples (2000 for each HP), which are randomly divided into two parts, of which 80% samples were used to train Bi-LSTM, and the rest was used for testing.

#### *4.2. Performance Comparison of Optimization Algorithm*

The SGD algorithm is a common optimization algorithm, which is often used to optimize the weight and bias of neural network in the training stage. However, the performance of its sub algorithm and variant is different. For this reason, in order to obtain the best training results, we compared several popular optimization algorithms, namely SGDM, RMSProp, and Adam [50]. Different optimization algorithms are used to train the Bi-LSTM. In this experiment, the training data set generated in Section 4.1 is used to complete the training. The initial learning rate is 0.05, and the batch size is 100. They were trained for 20 epochs and collected the corresponding training losses. The results are shown in Figure 6.

**Figure 6.** The training loss under different SGD algorithms.

It can be seen from Figure 6 that the SGDM algorithm has a slow convergence speed, and has a large training loss. The convergence speed of the RMSProp algorithm is fast, but the training process fluctuates greatly, and the training loss is not ideal. The Adam algorithm has the fastest convergence speed, the smallest loss, and the process is stable. Therefore, this optimization algorithm with the best performance is applied in the following experiments.

SGDM algorithm uses a single learning rate in the whole training process. Other optimization algorithms use different learning rates to improve the network training, and can automatically adapt to the loss function being optimized. This is how the RMSProp algorithm works. Adam updates with parameters similar to RMSProp, but adds a momentum term to that [50]. Therefore, the neural network can obtain fast and stable training effect.

#### *4.3. The Influence of Batch Size on Training Process*

The batch size determines the number of samples for network learning in each iteration., it is an important part of network training parameters. According to the past experience, this parameter has grea<sup>t</sup> influence on the training result and training time. Therefore, in the experiment, different batch sizes were compared and the results are shown in Figure 7.

**Figure 7.** The training time and loss under different batch sizes.

It can be seen from Figure 7 with the increase of batch size, the training time of each epoch decreases continuously, and tends to be stable when the batch size exceeds 200. However, with the increase of the batch size, the training loss first decreases and then increases, and it is the best when the batch size is 200. Therefore, when the batch size is 200, the recognition accuracy of Bi-LSTM can be guaranteed and the training time can be greatly reduced.

#### *4.4. The Influence of the Number of Network Layers*

In the field of deep learning, this view is recognized by scholars. With the deepening of the network, the performance of it will be improved. This is because multilayer structure can improve the capacity and flexibility of the model [45]. However, some scholars a point out that the increase of network layers will increase the risk of over-fitting [51]. Therefore, this experiment compares the influence of the number of different network layers on training loss and testing accuracy, as shown in Figure 8. In this experiment, the network structure of Figure 5 is used, that is, the output of the first forward layer of the network is the input of the second forward layer, and the output of the first backward layer is the input of the second backward layer. According to this stacking rule, a multilayer network is formed. The data and training parameters used are the same as the above experiments. The training loss and testing accuracy of different networks are collected.

**Figure 8.** The testing accuracy and training loss under different number of layers.

It can be seen from Figure 8, with the increase of the number of layers of the Bi-LSTM network, the training loss of the training set data continues to decrease, which shows that the network fits these training data better and better. On the contrary, the recognition rate of the trained multilayer Bi-LSTM network to the data of the test set increases first and then decreases. This is a typical over fitting phenomenon, that is, with the increase of the number of layers, the network continues to deepen, the fitting degree of the training set data is too high, and the generalization ability of the test set data is reduced. According to the above experimental results, it is finally determined that the number of layers of multilayer Bi-LSTM is 3.

#### *4.5. Comparison of HPR Results of Several Methods*

Through the above comparison experiments, the structural parameters and training parameters finally determined for multilayer Bi-LSTM are shown in Table 2.


**Table 2.** The structural parameters and training parameters of multilayer Bi-LSTM.

In order to further verify the effectiveness of the proposed method in HPR tasks, we compare multilayer Bi-LSTM with traditional machine learning methods (MLP) and several deep learning methods (DBN and 1D-CNN). These methods are described in detail as follows:


In order to observe the results of pattern recognition more intuitively, confusion matrix is used. The values of the elements on the diagonal of the confusion matrix respectively represent the proportion of the correct recognition of various patterns. Other values represent the misjudgment of the classifier. The total CRR is equal to the average of all elements on the diagonal [1]. The confusion matrix of various identification methods is shown in Figure 9, and the total CRR and the time consumption of each epoch in the training process are shown in Table 3.

**Figure 9.** The HPR confusion matrix for (**a**) the MLP and frequency, (**b**) the DBN and frequency, (**c**) the 1D-CNN and frequency, and (**d**) the multilayer Bi-LSTM and frequency.


**Table 3.** Comparison of different HPR methods.

Figure 9 shows that there are different levels of confusion in the recognition results of several HPR methods. However, the performance of three recognition methods based on deep learning is better than that of traditional machine learning methods. In the recognition results of several methods, there is a trend of confusion from the NOR pattern to the island pattern. This problem is the most serious in MLP method, followed by DBN method. This may be due to the similarity between the NOR pattern and the island pattern. This is an unacceptable result, which means that the HPR system based on these methods will have a very serious Type I error (false alarm), which will bring unnecessary trouble to the quality control work of the enterprise. On the contrary, the method based on 1-DCNN effectively improves the recognition results of nor pattern, and proves the effectiveness of processing one-dimensional data and the ability to capture the details of data. Due to the multilayer and bidirectional structure, the training time of multilayer Bi-LSTM is a little longer, but it gets a very satisfactory recognition result. In 2800 test samples, only 4 samples are misclassified, and the CRR is as high as 99.89%, achieving accurate HPR. Compared with other methods, it has obvious advantages.

#### *4.6. Simulation Parameters of CCPs*

The simulation parameters of nine common CCPs are shown in Table 4.


**Table 4.** The simulation parameters of nine typical CCPs.

For the MIX pattern, *p* is a binary integer, which is randomly selected from 0 or 1 at each sampling time *t* in the sample. The value of *v* is determined by where the mutation occurs. It is equal to 0 before the mutation occurs and 1 after the mutation occurs. The starting position obeys uniform distribution in the range [4,9]. The quality data length of each CCP is 25. The data set consists of 18,000 samples (2000 for each CCP), which are randomly divided into two parts, of which 80% samples were used to train multilayer Bi-LSTM, and the rest was used for testing.

#### *4.7. Comparison of CCPR Results of Several Methods*

In order to further verify the effectiveness of the proposed method in CCPR tasks, we compare multilayer Bi-LSTM with traditional machine learning methods (MLP) and several deep learning methods (DBN and 1D-CNN). Among them, MLP is widely used in CCPR field [19,21,22], which is the reason why we compare with it in this paper. The network structure of these methods is exactly the same as was used in Section 4.5. The difference is the input of the network.

The input of MLP becomes CCP feature set, which includes statistical features (mean, standard deviation, skewness and kurtosis) and shape features (S, NC1, NC2, APML and APSL). These features are designed by experts in this field, and have been proved to be very effective in many years of application [18,19,33–35]. Therefore, the first input layer of MLP has nine neurons, which are used to receive the above nine features respectively. However, the input of the three methods based on deep learning is still the raw data, that is, the quality data on the control chart. This is because they can adaptively extract the best features from the raw data. The results are shown in Figure 10 and Table 5.

Figure 10 shows that there are different levels of confusion in the recognition results of several CCPR methods. The most serious confusion is the DBN method. There are several of Type I errors and Type II errors (missed disturbances) in its identification results, which cannot be applied to the quality control of actual production. Although it is a deep learning method, which can extract features from the raw data through RBM pre-training, its learning ability is limited and cannot retrieve satisfactory recognition results. On the contrary, an expert feature set helps MLP ge<sup>t</sup> a better result, which effectively reduces the occurrence of two type errors. The validity of recognition method based on the expert feature set is proved. However, the disadvantage is that there is a little confusion between UT and US, DT and DS. More serious confusion occurs between CYC and MIX, which indicates that the existing expert feature set is not sensitive to distinguish the two CCPs. More acute features are ye<sup>t</sup> to be explored. In addition, 1D-CNN go<sup>t</sup> a better classification result, which reduced the confusion between CYC and MIX, and proves the effectiveness of processing one-dimensional data and the ability to capture the details of raw data. The recognition result of multilayer Bi-LSTM is the most satisfactory and the confusion rate is the lowest, which shows that multilayer Bi-LSTM has a strong ability of self-adaptive extraction of the best features from the raw data. The CRR reaches 99.26%, which can achieve accurate CCPR. Compared with other methods, it has obvious advantages.

**Figure 10.** The CCPR confusion matrix for (**a**) the MLP and feature set, (**b**) the DBN and quality data, (**c**) the 1D-CNN and quality data, and (**d**) the multilayer Bi-LSTM and quality data.



In order to further verify the feature learning ability of multilayer Bi-LSTM to the CCPs, the features extracted from the raw data are visually compared with expert features and the features extracted by 1D-CNN, and the results are shown in Figure 11. In this paper, the t-distributed stochastic neighbor embedding (t-SNE) [52] algorithm is used to reduce the dimensions of feature sets, so that they can be drawn in two-dimensional space for data visualization.

**Figure 11.** The feature visualization results for (**a**) the expert features, (**b**) the features extracted by 1-CNN, and (**c**) the features extracted by multilayer Bi-LSTM.

As we know, in the field of pattern recognition, as far as pattern classification is concerned, feature sets should present clustering distribution in feature space. The smaller the distance within the same class, the larger the distance between di fferent classes, indicating that the higher the quality of this feature set, the more conducive to the classification of the classifier. As shown in Figure 11a, the expert feature set has the trend of clustering distribution in the feature space, but the serious cross between each pattern indicates that the quality of existing expert features is not good. This is the root cause of the confusion in Figure 10a. In Figure 11b, the features extracted by 1D-CNN show clustering distribution in the feature space, and the clustering phenomenon increases significantly, but the distance between di fferent patterns is relatively close, and the classifier has the risk of misclassification. In contrast, the features extracted by multilayer Bi-LSTM are more closely distributed in the same pattern, and are better separated from other patterns in the feature space. This shows that it has learned more excellent features from the raw data. In addition, since the features are automatically extracted by the network, the accuracy of pattern recognition is improved, and the workload of quality control personnel is greatly reduced.

#### *4.8. Application in Real Production Data*

In order to verify the engineering value of the proposed method, it is applied to the quality control of real production. The diameter of the controlled object is the key to judge whether its quality is good or not, and its standard size is ϕ750+0.235 +0.158 mm. The width of the detection window is still 25. The multilayer Bi-LSTM is used to monitor the data of its production process, and some recognition results are listed in Figure 12.

**Figure 12.** Abnormal HPs and CCPs found by the multilayer Bi-LSTM from production data.

As shown in Figure 12, the proposed method can identify a variety of abnormal patterns from the quality data of the production environment. At the same time, the product quality distribution in this stage is skewed, which is also e ffectively recognized by the multilayer Bi-LSTM. It can be found that the key dimensions of this batch of products are generally smaller. According to the results of communication with factory engineers, the reason for this histogram pattern is likely to be the result of conscious processing by workers. It is worth noting that the multilayer Bi-LSTM trained with the above simulation data set is used in the quality control of real production data. So before the real data is input into the network, linear transformation is carried out to scale the data to the same range of simulation data. This transformation is very simple and will not a ffect the speed and automation of recognition. At the same time, it shows that the network trained by the simulation data can identify the real data well, and does not need to train the network repeatedly according to the di fferent product specifications. The proposed method can be easily integrated into the existing industry 4.0 system to provide intelligent SPC data analysis scheme for manufacturing enterprises and improve production efficiency and product quality. In addition, the SYS pattern in Figure 12 was recognized as CYC pattern by 1D-CNN in the previous study [18], which shows that the recognition results of the previous method are not precise enough. In this paper, the CCP is increased to nine, which can achieve more refined CCPR.
