*3.2. Characteristics of the Model*

Aiming at solving the problem that the rough set model has weak resistance to noisy information in a dataset, this study proposes the mixed-integer linear programming model for rough set-based classification with flexible attribute selection (MILP-FRST). This model integrates the mixed integer linear programming method with the rough set model to define the related concepts and to describe the related theories. It is not only an optimization of the original mining model, but also an extension of the rough set theory. This model has the following characteristics:

(1) The model can realize the process of filtering out attributes from the attribute set. In a practical application, the first step of analyzing a high-dimension dataset is the descending dimension. After the dimensionality reduction, the dataset can only contain partial information of the raw dataset; specifically, implementation of the dimensionality reduction process is at the expense of sacrificing the information contained in some raw datasets. MILP-FRST is able to eliminate the attributes that have little influence on the decisive accuracy, and to automatically complete the process of attribute selection. Therefore, only a simple preprocessing process based on data quality analysis needs to be performed, and the maximum extent of all of the information contained in the raw dataset is preserved.

(2) The model implements the partition of the attribute set to the universe, defines the lower approximation set and the lower approximation set, sets the variable precision, restricts the support of the lower approximation set, and calculates the determined region, and so on. All of the above are implemented in the linear model. The attribute set partitioning scheme that allows the decisive accuracy to reach the optimal value can be obtained.

(3) The model has strong extensibility. According to the specific object of this study, we can select the attribute set, and specific division of the universe and method to adapt to the dataset composed of various data types.

#### **4. Application Study on Data from Diesel Engines**

In this section, we report the results of computational experiments on an assembly clearance parameter dataset from a diesel engine to test the models and compare them. The MIP solver AMPL/CPLEX (version 12.6.0.1) was used to solve problem instances. All computational experiments were performed on a MacBook with a 2.90 GHz Intel Core i7 Processor and 8 GB memory.

This paper takes a certain type of marine diesel engine as the verification object. At present, this type of diesel engine has been put into the market for many years, and the production enterprises have accumulated a lot of valuable data. Table 1 lists the main technical parameters of this type of diesel engine.


**Table 1.** Main technical parameters.

Figure 1 the side view and main view of this diesel engine.

**Figure 1.** Side view and main view of this diesel engine.

#### *4.1. Data Set Introduction.*

The object of study is 29 16-cylinder diesel engines of the same type. The data set includes assembly clearance parameter data and quality grade data of the diesel engine. The assembly clearance parameter data of the diesel engine is numerical data, and the quality grade data of the diesel engine is classified data.

The marine diesel engine has a complex structure and many components, so there are many assembly clearance parameters. Chybowski L. and Gawdzixuska K. put forward the latest technology of component importance analysis for complex technical systems [28–30]. Choosing important components in complex systems is a key step. This type of diesel engine mainly includes four parts assembly clearance parameters: 2K, 5K, 10K and 11K. Among them, 2K refers to the mating clearance parameters of the crankshaft and the seat hole of the main bearing, 5K refers to the mating clearance parameters of the camshaft and the seat hole, 10K refers to the meshing clearance parameters of the gears, and 11K refers to the mating clearance parameters of the gear hole and the bearing. Table 2 lists the components involved in four types of assembly clearance parameters and the number of parameters.


**Table 2.** Assembling clearance parameters adopted.

A total of 28 assembly clearance parameters of the diesel engine were selected, that is, the experimental data set is 28-dimensional. The quality grade data comes from the test run of the diesel engine by the manufacturer before the diesel engine is delivered, including tests on flammability, diesel viscosity, and reliability. Through various test runs, the manufacturer determines the quality grade of the diesel engine. The quality grades are divided into three grades, Qualified, First grade, and High grade. Table 3 shows the part of data of the 28 assembly clearance parameters and the corresponding quality grades of the diesel engine.


**Table 3.** Assembly clearance data and quality grade of the diesel engine.

#### *4.2. Data Pre-Treatment*

After the correlation analysis of the dataset, it is obvious that there is a strong correlation between the assembly clearance parameters of the same part of the diesel engine, and this strong correlation will affect the effectiveness and efficiency of the model. Therefore, according to the correlation analysis of the assembly clearance parameters of the diesel engine, the principal component analysis method is used to reduce the dimension of the dataset. Taking all diesel engine assembly clearance parameters as the input, principal component analysis is carried out, and the cumulative variance contribution rate of each principal component is obtained.

As listed in Table 4, the cumulative variance contribution rate of the first 15 principal components is up to 89%; that is, these 15 principal components can cover most of the information of the assembly clearance parameters. A new dataset made up of these 15 principal components is presented in Table 5.


**Table 4.** Results of the principal component analysis.

**Table 5.** A new dataset made up of these 15 principal components.


The new dataset simplifies the original dataset and retains most of the information contained in the original dataset. Consequently, we can avoid a series of problems that the high-dimensional datasets creates in data mining. Simultaneously, the simplification of the original dataset can improve the efficiency of the model.

Finally, we need to integrate the assembly clearance parameters and whole-quality grades after dimension reduction, and obtain the final dataset that is directly applied to the subsequent computation (see Table 6).


**Table 6.** The final dataset.

#### *4.3. Demonstration of the Process of the Model*

Considering the specific object and dataset, the whole quality grades of diesel engine are known in this case, so that the result of partitioning the decisive attribute set to the universe is known in this instance. Therefore, we can simplify the model. We first remove the selected attributes in the decisive attribute set, and relative variables and constraints in the dividing universe. Then, we transfer the variable *q ik* into a parameter matrix, which is known to be a parameter that describes the result of universe partitioning by the decisive attribute set.

The model will be implemented in the MIP solver AMPL/CPLEX (version 12.6.0.1). In the operation of the model, the following parameters need to be set in advance:


The model input consists of the principal component data of the assembly clearance parameters obtained by dimensionality reduction processing, quality grades of the diesel engine, and preset parameters described above. The output of the model includes the selection results of the principal components of the input, division results of the universe according to the conditional attribute set, calculation results of the lower approximation set, and calculation results of the number of elements in the determined region.

Fifteen principal component attributes are included in the conditional attribute set, which is composed of the assembly clearance parameters of a diesel engine. The model can filter the attributes from the attribute set to eliminate the attributes that have little impact on the accuracy of the decision system, and its filtering result are expressed by the variable s*lc*. The result is:

$$\mathrm{sl}\_{\mathcal{L}} = \begin{cases} 1 & \mathcal{c} = 1 \\ 1 & \mathcal{c} = 2 \\ 1 & \mathcal{c} = 3 \\ 1 & \mathcal{c} = 4 \\ 1 & \mathcal{c} = 5 \\ 1 & \mathcal{c} = 6 \\ 1 & \mathcal{c} = 7 \\ 1 & \mathcal{c} = 8 \\ 1 & \mathcal{c} = 9 \\ 1 & \mathcal{c} = 10 \\ 1 & \mathcal{c} = 11 \\ 1 & \mathcal{c} = 12 \\ 1 & \mathcal{c} = 13 \\ 1 & \mathcal{c} = 14 \\ \hline \end{cases}$$

If the *sl* value of attribute *c* is 1, this attribute will be selected; otherwise, this attribute will be eliminated. Therefore, the result shows that all 15 principal component attributes will be selected.

The conditional attribute set partitioning the universe is an important step in the calculation process of the model. Meanwhile, it is also the prerequisite for the subsequent calculation; *k* = 10 represents the 10 approximate equivalence classes. If a diesel engine belongs to an approximate equivalence class, the value of the element in the matrix is 1; otherwise, it is 0. The result is:

$$Q\_k = \begin{cases} \quad 4 & k=1\\ \quad 4 & k=2\\ \quad 4 & k=3\\ \quad 3 & k=4\\ \quad 4 & k=5\\ \quad 3 & k=6\\ \quad 0 & k=7\\ \quad 3 & k=8\\ \quad 1 & k=9\\ \quad 3 & k=10 \end{cases}$$

This result indicates the number of diesel engines in each approximate equivalence class obtained by the partitioning of conditional attribute set to the universe. Among the 10 approximate equivalence classes, one has not been allocated any element; this approximate equivalence class will be deleted. One has been allocated only one element, and its number is less than the minimum support number;

therefore, it will also be deleted. Only eight approximate equivalence classes can be regarded as the lower approximation set.

$$E = \begin{bmatrix} 0 & 0 & 4 \\ 0 & 4 & 0 \\ 0 & 4 & 0 \\ 0 & 3 & 0 \\ 4 & 0 & 0 \\ 0 & 0 & 3 \\ 0 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 1 \\ 3 & 0 & 0 \end{bmatrix}$$

The *E* matrix is the most important part of the model output. The *E* matrix represents the number of elements that not only belong to approximate equivalence class *c*, but also belong to one quality grade. *E* matrix is an important basis for solving the lower approximation set. In the *E* matrix for this case, the 10 lines indicates that the number of approximate equivalence classes determined by conditional attribute set partitioning of the universe is 10. Similarly, the number of approximate equivalence classes determined by decisive attribute set partitioning of the universe is 3.

$$\mathbf{Y}\_{k} = \begin{cases} 4 & k=1 \\ 4 & k=2 \\ 4 & k=3 \\ 3 & k=4 \\ 4 & k=5 \\ 3 & k=6 \\ 0 & k=7 \\ 3 & k=8 \\ 0 & k=9 \\ 3 & k=10 \end{cases}$$

*Yk* is the number of elements in each lower approximation set. It can be concluded that eight approximate equivalence classes meet the condition of being members of the lower approximate set by analyzing the minimum support number and variable precision. Hence, the number of elements in the determined area is: <sup>15</sup>

$$\sum\_{k=1}^{15} Y\_k = 28$$

The area of the model is:

$$\lambda = \frac{\sum\_{k=1}^{15} \mathbf{y}\_k}{|I|} = 0.97$$

On the basis of the inferences of the rough set and the function dependence, 0 < λ < 1. Thus, there is partial dependence between the conditional attribute set and the decisive attribute set of the decision system:

{assembly clearance parameter} → 0.97{quality grade}

### *4.4. Performance Comparison of Models*

To validate the effectiveness and advantages of the model, experiments are performed to compare the accuracy of the models. The model that our model is compared to is the Φ-rough set.

As listed in Table 7, obviously, the accuracy of model MILP-FRST is higher than that of the model Φ-Rough set. The accuracy is close to one, which shows that our proposed model can establish an accurate decision-making rule between the diesel engine assembly clearance parameters and whole machine quality grades, and excavate a higher correlation between them.


**Table 7.** Comparison of the accuracy of the two models.

MILP-FRST is an extension of the rough set. An obvious characteristic of a linear model is its ability to find the optimum solution. This characteristic enables the model to find the best way to classify attributes, even if the dataset can also obtain ideal results merely through simple data preprocessing, and it considerably increases the ability to resist noisy data.
