1. Introduction
Mechanisms and data-driven models are playing more and more important roles in the field of engineering [
1]. Traditional machine learning algorithms can only be trained by labeled datasets [
2]. A lot of manpower and material resources are consumed when experts mark unlabeled samples, but unlabeled data is easily obtained in the production equipment. In addition, in some cases, manually labeled samples have a small number of errors, which affects the subsequent training. For traditional supervised learning methods, the use of only a few labeled samples results in a loss of information, because the information in unlabeled samples cannot be completely extracted [
3,
4,
5], and the accuracy of the trained classifiers is not ideal. The semi-supervised learning method directly affects the effectiveness of machine learning technology for practical tasks, because this method uses only a small number of labeled samples and makes full use of a large number of pseudo-labeled samples to improve the performance of the classifiers. In recent years, it has gradually become a study hotspot for research on current mainstream technology of unlabeled samples [
6].
Semi-supervised learning is an important research method in data mining and machine learning because it makes full use of limited labeled samples and select unlabeled samples to train the model. According to different learning tasks, semi-supervised learning is mainly divided into semi-supervised clustering methods and semi-supervised classification methods [
7]. Both methods improve the learning performance of limited labeled samples. The current semi-supervised learning methods can be roughly divided into the following four categories: generative model-based methods, semi-supervised SVM methods, graph-based methods, and disagreement-based methods [
8]. The method based on generative models originates from the assumption that all data (whether labeled or not) are generated by the same underlying model, such as the model by Shahshahani who assumed that the original sample satisfies the Gaussian distribution, and then used the maximum likelihood probability to fit a Gaussian distribution which is usually called the Gaussian mixture model (GMM) [
9], but the result of this method depended too much on the choice of model. As it turns out, the model assumptions must be accurate, and the hypothesized generative model must match the real data distribution. Semi-supervised support vector machine (SVM) is a generalization of the support vector machine for semi-supervised learning [
10]. The interval of the SVM on all training data (including labeled and unlabeled data) is adjusted to the maximum by determining the hyperplane and the label assignment for unlabeled data. Joachims proposed the TSVM (transductive SVM) method for binary classification problems which took specific test sets into account and tried to minimize the misclassification of these specific examples. This method has a large number of applications in terms of text classification and achieves very good effect [
11]. The graph-based method maps all labeled and unlabeled datasets into a graph, and each sample in the datasets corresponds to a node on the graph. Sample labels are propagated from labeled to unlabeled samples with a certain probability, and the performance of this method also depends too much on the construction method of the data graph [
12].
Learning the same dataset from multiple different perspectives, disagreement-based semi-supervised learning focuses on the "divergence" between multiple learners which also predicts unlabeled samples through the trained classifiers to achieve the purpose of expanding the training set. This type of technology is less affected by model assumptions, non-convexity of loss functions, and issues of data size. The theoretical foundation of the learning method is relatively well established and, it is so simple and effective that the applicable scope of this method is more extensive [
7].
The co-training process is a form of semi-supervised learning based on disagreement, which requires that the data have two fully redundant views that satisfy the conditional independence [
13]. The general process of standard co-training, as a typical representative of disagreement-based semi-supervised learning, is as follows: (1) Use labeled data to train a classifier on each view, (2) each classifier labels unlabeled samples and selects a certain number of labels with the highest confidence and the samples are added to the training set; (3) each classifier updates the model with the new samples; and (4) repeats steps two and three until, for each classifier, the number of iterations reaches the preset value or the classifier does not change [
14]. Usually, it is required that the data information is sufficient and redundant, and the conditions for finding two independent and complementary views are too harsh. It is difficult to ensure that the two models have different views, let alone different and complementary information. Researchers have found that complementary classifications sub-models can be used to relax the requirement [
15]. Even if the same dataset is used, the resulting prediction distributions are different, and therefore the classifiers can be enhanced according to the method of co-training [
16].
Abdelgayed proposed a semi-supervised machine learning method based on co-training to solve fault classification in transmission and distribution systems of microgrids [
17]. Yu proposed a multi-classifier integrated progressive semi-supervised learning method, using different classifiers obtained by random subspace technology and expanded the training set through a progressive training set generation process [
18]. Liu used a random forest algorithm for semi-supervised learning to select pseudo-labeled samples from unlabeled samples and considered the label confidence of the unlabeled samples and the positional relationship with the edge of the classification surface to improve the classification performance [
19]. Zhang used regression prediction results to integrate and solve the problem encountered in image processing of separating the foreground from the background of the image [
20].
While semi-supervised learning methods have been successfully applied in a variety of scenarios, the randomness of the selection of unlabeled samples during the training process has resulted in unstable and less robust results [
21]. The use of ensemble learning can alleviate this problem caused by the selection of unlabeled samples by unifying the results of different learning methods. Using semi-supervised ensemble learning methods to obtain more stable classification results is essential for industry [
22,
23]. At the same time, however, the use of semi-supervised ensemble learning methods has the following limitations when applied in industry:
1. A large amount of data collected in the industry is mainly labeled by workers through subjective and uncertain empirical knowledge [
24]. The impact of the training result due to wrongly labeled samples is huge when the total number of training sets is not large enough;
2. Most semi-supervised ensemble learning methods do not consider how to develop a strategy to select useful samples and eliminate redundant samples in order to expand the training set.
In order to solve the limitations of traditional semi-supervised ensemble learning methods, a semi-supervised neighbor ensemble learning method (SSNEL) based on the multivariate adaptive regression splines (MARS) and adakptive boosting (Adaboost) algorithms as sub-models is proposed in this paper. This method performs co-training based on the "complementarity" formed by the significant difference in the weighting rules of the MARS and Adaboost algorithms during the training process. On the one hand, the MARS algorithm pays attention to the influence of each feature on the classification results and the relationship between the features during the training process, and ignores the role of different samples on the ability of the classifier [
25]. On the other hand, the Adaboost algorithm is concerned with only the different roles of the samples, but ignores the differences of each feature and the connections between the features [
26]. Theoretically, using two strong subclassifiers, the co-training, based on the MARS and Adaboost algorithms, to a certain extent, reduces the harm caused by the wrongly labeled samples to the generalization ability of the model. In addition, it improves the robustness and prevents "convergence" between homogeneous classifiers during continuous training [
27]. After each time using the subclassifiers to obtain the pseudo-labeled samples, in order to find the wrong samples and the samples that do not significantly improve the generalization ability of the model, a series of redundant sample removal algorithms based on the near neighbor degree model are established. Therefore, the SSNEL finds the optimal pseudo-labeled samples to expand the training set and ensures that the best classification results are obtained after retraining with the training set after the addition of new pseudo-labeled samples. The experiment used multiple different real datasets from the University of California Irvine machine learning knowledge base for verification. The results reflect that the SSNEL performs better than traditional semi-supervised ensemble learning methods in most real datasets, and it is outstanding for industrial applications of aluminum electrolytic superheat state classification.
The MARS algorithm builds models of the form:
The model is the weighted sum of basis function values. Each
is a constant coefficient, and each
can be one of the following three forms: (1) a constant; (2) a hinge function which has the form
or
, the MARS algorithm automatically selects variables and values of those variables for knots of the hinge functions; and (3) a product of two or more hinge functions. These basis functions can model interaction between two or more variables. The MARS algorithm builds a model in two phases, i.e., the forward and the backward pass. This two-stage approach is the same as that used by recursive partitioning trees [
28].
The Adaboost algorithm first obtains a weak classifier by learning from a training sample set. Subsequently, in every retraining a new training sample set is formed by the misclassified samples and supplementary samples, and thus the ability of the classification of a weak classifier is improved [
29]. The classification of certain data is determined by the weight of each classifier and samples, but the links between different attributes are not taken into account during classification. Inspired by redundant view conditions for co-training, theoretically, the strong complementarity between the two algorithms can make them better in co-training.
This paper is divided into six chapters, and the main content of each chapter is summarized as follows:
Section 1 is an introduction, which describes the research background and core issues of this method;
Section 2 presents the components and design of semi-supervised neighbor ensemble learning;
Section 3 details the algorithm implementation and how to select useful pseudo-labeled samples; the validation work is in
Section 4 which includes the comparison of this method with its competitors on the University of California Irvine (UCI) datasets;
Section 5 presents the application and mechanism verification of this method on the aluminum electrolytic industry dataset; and the conclusions and contributions of this study are in
Section 6.
3. Algorithm Implementation
Algorithm 1 is an implementation step of the SSNEL. For all the data obtained, first, divide all data samples into a labeled sample set containing groups of samples and an unlabeled sample set containing groups of samples. , indicates labeled samples ( denotes the label of corresponding sample ) and , is unlabeled samples. Then, use the labeled sample set to initialize the multivariate adaptive regression spline and Adaboost classifiers. The two obtained classifiers are expressed as and . In the first iteration, given a subset of unlabeled sample set , the samples can be predicted by and . Next, a pseudo-labeled dataset selection algorithm based on near neighbor degree (DSSA) is applied to pick a useful result from the results obtained by and .
Algorithm 1. SSNEL |
Require:Input: the labeled training set ; the selected unlabeled sample set ; the number of samples of subset in every iteration ; the classifier set contains MARS and Adaboost .
|
Ensure:- 1:
Initialize the current classifier and using ; - 2:
While: - 3:
Select subset from ; - 4:
Obtain two pseudo-labeled sample set according to the classification results of with and ; - 5:
Call the pseudo-labeled dataset selection algorithm based on near neighbor degree (DSSA) in Algorithm 2 to generate a more reliable pseudo-labeled sample set ; - 6:
For the pseudo-labeled sample set, call the pseudo-labeled sample detection algorithm based on near neighbor degree (SPDA) in Algorithm 3 to find useful sample set ; - 7:
Update the training set and unlabeled dataset ; - 8:
Until the unlabeled samples is run out;
|
Output: Final training set including labeled samples and useful pseudo-labeled samples |
For a single classifier, the predicted labels can be denoted by
, and the pseudo-labeled sample set are denoted by
. Retrain MARS and Adaboost classifiers with corresponding
, and the result predicted by the classifier is
, the near neighbor degree of
and
is expressed as follows:
The near degree between selected subset and the labeled sample set in this iteration is also considered as follows:
where
is the number of categories in the dataset and
is the number of samples in the selected subset
.
On the one hand, it is obvious that reflects the effect of introducing pseudo-labeled samples to the training set on the classification. The larger is, the more reliable the result. On the other hand, reflects the near neighbor degree of pseudo-labeled samples and training set samples in the same category. When the value of is too small, it means that the distribution of the samples classified into the same class is significantly different from the training set. However, a value that is too large means that the pseudo-labeled samples bring little improvement in classification ability and even reduce the stability of classifiers.
Then, we obtain:
where
is a preset parameter, and in the two pseudo-labeled sample sets obtained by the MARS and Adaboost classifiers, select the one which has a larger
. The pseudo code of DSSA is Algorithm 2.
Algorithm 2. DSSA |
Require:Input: the labeled training set ; the selected unlabeled sample subset ; the classifier set contains MARS and Adaboost .
|
Ensure:- 1:
For a single classifier, generate the classification results of set and get a group of pseudo-labeled samples; - 2:
Retrain MARS and Adaboost with corresponding , and the results predicted by the classifiers is - 3:
Calculate the neighbor degree of and - 4:
Compare the samples in dataset and to calculate the near neighbor degree ; - 5:
Calculate the final near neighbor degree ; - 6:
In the two pseudo-labeled sample sets obtained by MARS and Adaboost, select the one which has a larger .
|
Output: A more reliable pseudo-labeled sample set. |
However, not all the samples in the selected pseudo-labeled dataset are useful for classification. Fortunately, the pseudo-labeled sample detection algorithm based on near neighbor degree (SPDA) is a useful approach to test the validity of a sample. The selected pseudo-labeled dataset is denoted by
, and the training sample subset in the training set corresponding to the category of
is
. The outlier factor of the pseudo-labeled sample can be calculated by:
where
is the near neighbor degree between sample
and sample
, and
represents the number of attributes,
represents the weight of the effect of the feature on the classification result.
For every unlabeled sample, count the number of samples whose near neighbor degree is greater than the mean, and get the outlier factor vector
. And the sign vector
for the selected pseudo-labeled dataset
(where
is the number of samples in
). And
is defined as follows:
The new training set
is augmented as follows:
And the unlabeled dataset is:
where
means the set of retained pseudo-labeled dataset.
Finally retrain the subclassifiers with the new training set in the next iteration until all unlabeled samples are used up.
Algorithm 3. SPDA |
Require:Input: the labeled training set ; the selected unlabeled sample subset ; the more reliable pseudo-labeled sample set selected in DSSA including and .
|
Ensure:- 1:
Divide current training set into groups according to the types of pseudo-tags(c is the number of categories included in the sample set); - 2:
For each : - 3:
Calculate the near neighbor degree of and the training set samples in the corresponding category; - 4:
According to the result of 3, compute the outlier factor of ; - 5:
Get the outlier factor vector and sign vector , screen out sample set that is beneficial to the improvement of classification ability
|
Output: the sample set . |
5. Application on Aluminum Electrolyzer Condition Dataset
The stable running state of the aluminum electrolytic cell can ensure that the aluminum electrolytic cell has a good shape of the bore and improve the current efficiency [
32]. By analyzing the relationship between the key process indicators and the state of the electrolyzer condition, 10 process indicators are selected in this experiment as the input of the prediction model, which are needle vibration, swing, effect peak pressure, effect number, aluminum level, electrolyte level, number of overfeeds, number of underfeeds, molecular ratio, and temperature [
33]. The judgment results include over stability index (H), moderate stability index (F), and low stability index (L). Experienced workers can add labels to a small number of data samples to judge the current status. In this paper, 2000 samples from a factory in Shandong Province are adopted to validate the applicability of the SSNEL in the field of engineering. We performed ten classification experiments on the same dataset for each algorithm. Because the selection process of the initial training set with labels and unlabeled samples is random, for an algorithm, different sample selections bring some fluctuations.
Table 3 shows the comparison of the results of this method and its competitors on the aluminum electrolytic cell condition dataset. The results contain the average
error_rates and the fluctuation ranges of every algorithm in ten experiments.
The comparison shows that when a supervised learning algorithm is used to deal with the problem of lacking labeled samples, the classification effect is not satisfactory and the classification of the error rate is greatly affected by the training set. When we use traditional semi-supervised learning algorithms such as co-training and M-Co-Forest, it is obvious that the error rate is lower than the supervised learning algorithm. This shows that semi-supervised learning uses unlabeled samples to extract information. Finally, the SSNEL has the best performance and the average error rate is only 0.186, which can relieve the reliance on manual observation to identify the condition of the electrolytic cell in the actual industry. The main reasons are as follows:
The SSNEL uses integrated classifiers instead of weak classifiers (such as decision tree, k-nearest neighbor) for semi-supervised learning. It can fully mine the information in unlabeled samples, and also reduces the impact of a small number of wrong labels.
The SSNEL applies near neighbor-based pseudo-labeled samples select strategies (DSSA and SPDA) to gradually obtain more accurate pseudo-labeled samples to improve the classifiers.
To sum up, in the absence of labeled samples, the algorithm the SSNEL proposed in this paper has a more satisfying classification effect. On the one hand, the classification error rate is lower. On the other hand, dealing with large uncertainties in the sample set, the SSNEL has some robustness and classification results which are not affected significantly by random factors.
In order to further illustrate the effectiveness of the method in the industrial field, the range of various parameters of the aluminum electrolytic cell under stable operating conditions is taken into account, and is summarized in
Table 4. The mechanism analysis of three samples determined to be unstable by the SSNEL is shown in
Table 5.
From a mechanistic perspective, workers can judge the samples according to data and experience [
34]. Take the sample on 21 June 2018 as an example. On the one hand, the temperature of the electrolyte was too high, which caused the superheat to be too high; and the molecular ratio was too high, which inhibited the separation effect of carbon slag and the electrolyte which both reduced current efficiency. Using mechanism data analysis can explain the reasons for the floating of the cell condition according to the change of the data. At the same time, it validates the aluminum electrolytic cell condition evaluation model from the point of view of mechanism. Therefore, the results of the SSNEL could be confirmed by mechanism analysis. In addition, the SSNEL can reduce the dependence on experience to a certain extent.