**3. Methods**

### *3.1. Analysis of Case Retrieval in Industrial Operational Optimization*

To improve product quality and enhance economic benefits, operational optimization has been widely implemented in industrial processes. CBR can find the optimal operational settings by learning from the historical optimal operational settings in the case base, so it has been widely studied in the industrial operational optimization community. Suppose that there are *k* cases overall retrieved from the case base, and *Xi*(*i* = 1, 2, ··· , *k*) and *Yi*(*i* = 1, 2, ··· , *k*) represent the problem descriptions and the optimal solutions of the *i*th retrieved case, respectfully. Under the CBR framework, the suggested solution *Y* \$ t of the target problem *Xt* can be determined as follows:

$$\widetilde{Y}\_{l} = \frac{\sum\_{i=1}^{k} S(X\_{i\prime}X\_{l})Y\_{l}}{\sum\_{i=1}^{k} S(X\_{i\prime}X\_{l})} \tag{1}$$

where *<sup>S</sup>*(*Xi*, *Xt*) represents the similarity between the target problem *Xt* and the problem description of the *i*th historical case *Xi*. In fact, the suggested solution *Y* \$ t is a weighted sum of historical optimal solutions. Concretely, *k* historical cases are selected by the case retrieval step according to their similarity to the target problem. Moreover, Equation (1) shows that the weight of the suggested solution is only determined by the similarity between the target problem and the problem description of the selected historical case. In other words, the case retrieval step not only provides some helpful candidates for the suggested solution, but also determines their weights in the suggested solution. Hence, the accuracy of case retrieval is vital to the performance of industrial operational optimization.

Since CBR assumes that similar problem descriptions always have similar case solutions [32], most of the previous studies tend to discover the most similar cases to the target problem. Although the classic case retrieval methods have been proved effective in many fields, the accuracy of case retrieval is still inevitably affected by measuring error and by multiple working conditions. As a result, not all retrieved cases are helpful for solving the target problem. The concrete reasons are as follows.

### (a) Accuracy of case retrieval would be influenced by the measuring error

Industrial data are collected by various kinds of sensors installed in the factory. Since perturbations and noises are inevitable in industrial processes, measuring error is naturally introduced in the case base. Consequently, the descriptions of historical cases are not accurate. For the *i*th case, its measured description *X* ˆ *i* can be represented as follows:

$$
\hat{X}\_i = X\_i + \mathcal{W}\_i \tag{2}
$$

where *X*i and *W*i are the accurate description and the measuring error of the *i*th case, respectively. Considering the measuring error in its corresponding measured description, the true Euclidean distance between *Xi* and *Xt* are calculated as follows:

$$D(X\_i, X\_l) = \sqrt{\left(\left(\hat{X}\_i - \mathcal{W}\_l\right) - \left(\hat{X}\_l - \mathcal{W}\_l\right)\right)\left(\left(\hat{X}\_i - \mathcal{W}\_l\right) - \left(\hat{X}\_l - \mathcal{W}\_l\right)\right)^T} \tag{3}$$

Then the similarity between *Xi* and *Xt* can be calculated as follows:

$$S(X\_{i\prime}X\_{t}) = \frac{1}{1 + D(X\_{i\prime}X\_{t})} \tag{4}$$

Obviously, the measuring error in industrial data would degrade the accuracy of case retrieval and make it hard to evaluate the importance of historical cases in solving the target problem. Therefore, it is necessary to eliminate negative impacts from historical cases that have gross measuring error.

### (b) Accuracy of case retrieval would be influenced by the multiple working conditions

Industrial processes always run in many working conditions, which leads to some undesirable results if the number of retrieved cases is not appropriate. That is to say, not only the similarity *<sup>S</sup>*(*Xi*, *Xt*) but also the number *k* have an impact on the accuracy of case retrieval. Therefore, an appropriate parameter *k* is crucial for the success of industrial operational optimization under the CBR framework. However, for a particular process, there are different numbers of historical cases in different working conditions, suggesting that the case base is imbalanced. There are a larger number of cases in common working conditions and a smaller number of cases in uncommon working conditions. Therefore, it is easy to retrieve enough cases from a common working condition, ye<sup>t</sup> difficult to do the same from an uncommon working condition. Since the parameter *k* is fixed as a constant in classic CBR, it may perform well for some working conditions but perform poorly for others. The reason why classic CBR has a different performance in different working conditions is that some irrelevant cases from other working conditions may be retrieved if the target problem belongs to uncommon working conditions. Thus, the suggested solution may be inapplicable.

In summary, both the measuring error and the multiple working conditions would decrease the accuracy of case retrieval, which is going to affect the performance of operational optimization under the CBR framework. To decrease the negative impact from these abnormal cases, a local density-based abnormal case removal method for the case retrieval step is proposed in the following subsection.

### *3.2. Local Density-Based Abnormal Case Removal*

Most of the previous studies on case retrieval have only focused on similarity measurement, while the distribution of retrieved cases was neglected. The goal of case retrieval is, in essence, to search the case base for valuable cases in order to solve the target problem. In Section 3.1, the reasons as to why abnormal cases commonly exist in industry are thoroughly analyzed. Consequently, the retrieval results may not be reliable and the accuracy of case retrieval needs enhancing. In contrast to the model-based methods, CBR directly uses the operational information in retrieved cases, so the accuracy of retrieved cases is vital to the performance of CBR. In another words, abnormal cases are harmful for the

industrial operational optimization, so they must be removed before the case reuse. In this paper, it is believed that the distribution of retrieved cases can reflect their reliability. By eliminating low-reliability cases, the quality of the retrieved cases can be significantly enhanced. Figure 2 presents a demonstration of the relationship between the distribution and the reliability of cases.

**Figure 2.** Distribution and reliability of the retrieved cases in industrial processes.

As shown in Figure 2, the retrieved cases are not uniformly distributed in the whole space. Moreover, the accurate descriptions of historical cases are uncertain due to the existence of measuring error. In this paper, the measuring error is assumed to follow the Gaussian distribution. With a certain confidence level, accurate descriptions of historical cases lie in dashed circles centered in their corresponding measured descriptions. Since the similarity is usually calculated according to the measured descriptions, cases with the highest similarity are not necessarily the most helpful cases for the target problem. However, there exist some overlaps in the area with high-density cases, showing cases in the high-density area have higher reliability than other cases since the accurate descriptions are more likely to lie in the overlaps. Therefore, although cases in the low-density area may have a higher similarity to the target problem, they should not proceed to the case reuse step due to their lower reliability.

Another issue that impacts the accuracy of case retrieval is the multiple working conditions of industrial processes. For a target problem that lies on the edge of a working condition, its nearest neighbors probably include cases from other working conditions. Obviously, these cases will not help to solve the target problem and should not be included in the retrieved cases. This issue can be partly solved by assigning different number of retrieved cases to every working condition, but it requires identifying the working conditions in advance and setting a different *k* parameter for every working condition. Consequently, it demands more priori knowledge and becomes much more complicated. Considering the working condition identification problem can be transformed into a classic classification problem, the K-Nearest Neighbors (KNN) classifier believes that the target problem belongs to the working condition that the majority of its nearest neighbors belongs to. That is to say, the number of retrieved cases from other working conditions is less than the number of retrieved cases from the working condition that the target problem belongs to. Since all retrieved cases belong to the same neighborhood, cases from other working conditions are more likely to be in the low-density area, so they can be identified by calculating the density of retrieved cases.

To conclude, measuring error and multiple working conditions are two inevitable problems affecting the accuracy of case retrieval and degrading the performance of CBR. Therefore, developing an abnormal case removal method is urgen<sup>t</sup> and necessary. Since

cases in a high-density area are more reliable than those in a low-density area, the latter should be removed from the retrieved cases. In this subsection, a local density-based abnormal case removal algorithm is designed based on the Local Outlier Factor (LOF), which is a common index showing how isolated a data point is comparing with its nearest data points. The LOF of historical case *Xi* is defined as follows:

$$LOF(X\_i) = \frac{1}{m} \sum\_{q=1}^{m} \frac{lrd(X\_q)}{lrd(X\_i)}\tag{5}$$

where *m* is an adjustable parameter; *lrd*(*Xq*) and *lrd*(*Xi*) stand for the local reachability density of case *Xq* and *Xi*, respectively; *Xq* is the *q*th similar cases in the retrieved cases. Particularly, the *lrd*(*Xi*) can be represented as follows:

$$\operatorname{lrd}(X\_i) = \left(\frac{1}{m} \sum\_{q=1}^{m} D(X\_{i\prime} X\_q)\right)^{-1} \tag{6}$$

where *<sup>D</sup>*(*Xi*, *Xq*) is the Euclidean distance between *Xq* and *Xi*.

As shown in Equation (5), LOF reflects the average ratio of *lrd*(*Xq*) to *lrd*(*Xi*). Therefore, a bigger LOF indicates a smaller local density, and the corresponding case should be removed. Normally, the threshold of LOF is determined after the whole dataset has been analyzed, while in this paper, the threshold of LOF can be adaptively adjusted. To automatically eliminate the retrieved cases in a low-density area, the threshold of the local density-based abnormal case removal algorithm is designed as follows:

$$\xi = \mu + \kappa \sqrt{\frac{\sum\_{i=1}^{k} \left(X(i) - \mu\right)^{2}}{k - 1}} \tag{7}$$

where *α* is an adjustable parameter of the threshold *ξ*, and *μ* is the average LOF of the retrieved cases, which can be calculated as follows:

$$\mu = \frac{1}{k} \sum\_{i=1}^{k} LOF(X\_i) \tag{8}$$

In this paper, *k* is optimized according to the mean absolute error of the training set; *m* and *α* are optimized determined by orthogonal experiments. With the optimal parameter *k*, *m*, *α*, pseudo-codes of the designed local density-based abnormal case removal algorithm are shown in Algorithm 1.


Step 2: for a target problem, select *k* most similar cases from the case base and construct the original retrieved cases *Ci* = {*Xi*,*Yi*}(*<sup>i</sup>* = 1, ··· , *k*);

Step 3: employ the local density-based abnormal case removal algorithm to remove wrongly retrieved cases;

Step 4: acquire the suggested solution for the target problem according to Equation (1); Step 5: revise the suggested solution, if necessary;

Step 6: store it in the case base after the target problem is solved.
