1. Introduction
As an indispensable component in industrial applications, mechanical equipment is an important force in promoting sustainable development and industrial upgrading [
1,
2]. However, any tiny failure may cause production downtime or even catastrophic consequences. Furthermore, it is prone to component failure when the mechanical equipment operates under high loads for a long time [
3]. It is of great significance to carry out an equipment fault diagnosis study to improve equipment safety and reliability, which has attracted increasing attention in the industrial safety community [
4,
5].
In the last decade, there has been a rapid development of information technology, which brings new perspectives and challenges for the traditional fault diagnosis methods of rotating machinery and promotes the development of fault diagnosis from traditional shallow models to deep learning models. Zhang et al. [
6] proposed an improved residual network (ResNet) based on hybrid attention for wind turbine gearbox fault diagnosis. Shao et al. [
7] presented a novel deep belief network (DBN) based on convolutional for bearing fault diagnosis. He et al. [
8] explored a transfer learning fault diagnosis method based on a convolutional neural network (CNN), Shao et al. [
9] provided a modified stacked autoencoder (SAE) based on adaptive Morlet wavelet for rotating machinery fault diagnosis. Nie et al. [
10] developed a fault diagnosis framework to relax the impact of noisy labels with recurrent neural networks (RNN). These existing deep learning models can achieve results and overcome the shortcomings of shallow models, which heavily rely on manual feature extraction. However, the number of samples selected to train the deep model will seriously affect the deep model training accuracy. Moreover, in real industrial applications, it is difficult or even impossible to collect a large amount of label data, which gives the deep learning-based methods poor generalization ability. To this end, it is essential to capture discriminative knowledge from limited training data to obtain a generalized deep model.
FSL (FSL) is an impressive scenario to utilize limited labeled samples to quickly learn and achieve stable classification results, which has received widespread attention and obtained encouraging progress [
11,
12]. Up to now, there several FLS methods have been reported, such as Prototypical Network (ProNet) [
13,
14], Match Network (MatNet) [
15,
16], Siamese Network (SiaNet) [
17,
18], etc. Among them, the ProNet transforms the classification problem into a distance measurement problem in the feature embedding space, which has lower time complexity and wildly applied in pattern recognition fields. Chowdhury et al. [
19] used the maximum mean discrepancy (MMD) to evaluate the influence of distributions, including and excluding the sample and obtained the sample weights by subtracting from 1 only. Wang et al. [
20] presented a weight prototypical network for bearing fault diagnosis, and the Kullback–Leibler (KL) divergence was adopted to estimate the influence of specific samples from a sample distribution. Gao et al. [
21] designed a novel prototypical network for noisy few-shot problems based on instance-level and feature-level attention schemes to accentuate the significance of instances and features, respectively. Ye et al. [
22] proposed a learning with a strong teacher framework for few-shot learning, in which a strong classifier was constructed to supervise the few-shot learner for image recognition. Zhao et al. [
23] employed a dual adaptive representation alignment network for cross-domain few-shot learning, which can update the support instances as prototypes and renew the prototypes with the differentiable. To this end, the above-mentioned FSL-based methods provide a new idea and make certain progress in solving the problem of training scarce samples. However, they only focus on how to evaluate the weight of samples and do not overcome the limitation of small samples. To tackle the problem of FSL from the root, semi-supervised learning (SSL), utilizing a few labeled samples and massive unlabeled samples to improve learning performance, can be divided into three categories: adversarial generation, consistency regularization, and pseudo-labeling. Pseudo-labeling techniques—explored to label unlabeled samples, which are easier to obtain for expanding the training set—have been paid increasing attention recently. He et al. [
24] proposed a semi-supervised prototypical network based on pseudo-labeling for bearing fault diagnosis. A fixed threshold was used to select pseudo-labeling and obtain the optimal threshold by a large number of experiments. Fan et al. [
25] presented a semi-supervised fault diagnosis method, screened pseudo-labeling with thresholds, and adjusted the dependence of the model on pseudo-labeling through learning. Zhang et al. [
26] explored a self-training semi-supervised method that selects unlabeled data with high predictive confidence on a trained model and extracts pseudo-labeling iteratively. Zhang et al. [
27] adopted Monte Carlo uncertainty as the threshold to screen the pseudo-labeling and built a gearbox fault diagnosis scenario with small samples based on a momentum prototypical network. Zhou et al. [
28] designed an adaptive prototypical network for few-shot learning with sample quality and pseudo-labeling screening to weaken the impact of unreliable pseudo-labeling.
Most of these existing SSL methods based on pseudo-labeling learning achieved impressive accuracy in few-shot learning tasks by increasing the number of training samples by labeling the unlabeled samples. However, there are still the following limitations: (1) A single threshold selected for pseudo-labeling screening cannot guarantee the accuracy of pseudo-labeling collections, which dramatically degrades the performance of SSL methods. (2) Insufficient consideration of iteration-stopping conditions can easily lead to the propagation and accumulation of incorrect information during the iteration process. To resolve the trouble of network degradation caused by the low accuracy of pseudo-labeling screening and insufficient selection samples, a semi-supervised learning method based on a pseudo-labeling multi-screening strategy for a few-shot bearing fault diagnosis is proposed. In this paper, a composite threshold for pseudo-labeling screening combined with Monte Carlo uncertainty and classification probability is explored to overcome the limitations of pseudo-labeling screening with a single threshold. Then, a multi-pseudo-labeling accumulation model based on network optimization is employed to solve the problem of network degradation caused by mislabeling. Finally, three well-known bearing datasets were used to verify the effectiveness of the proposed model. The main contributions are summarized as follows:
(1) A multi-screening strategy based on Monte Carlo uncertainty and classification was proposed for pseudo-labeling selection, which can assist in ensuring the accuracy of pseudo-labeling screening and improve generalization ability.
(2) A semi-supervised learning method based on AdaBoost adaptation was explored to integrate multiple samples into a class prototype to obtain a more accurate class prototype, which can overcome the drawbacks caused by low-quality label samples hidden in the dataset.
(3) An estimation strategy for individual sample contribution rate was presented to accurately obtain individual sample weights for improving the performance of AdaBoost adaptation, which can tackle the problem that ignored the impact from individual differences.
3. Semi-Supervised Learning Method of the Proposed
In this section, a pseudo-labeling multi-screening-based semi-supervised learning method for few-shot fault diagnosis is proposed. The overall structure is presented in
Figure 1, which includes three main components: (1) AdaBoost-based adaptive weighted prototypical network (AWPN); (2) pseudo-labeling multi-screening strategy; and (3) semi-supervised learning-based fault diagnosis.
3.1. Squeeze and Excitation-Based Feature Extractor
To make the model pay attention to the differences between different perspectives in the learning process and automatically learn the importance of features from different perspectives, Roy et al. [
35] proposed spatial and channel squeeze and channel excitation (scSE) for achieving feature recalibration in both space and channel.
The application of scSE to one-dimensional data is given in
Figure 2. Given that an input feature set
is a combination of
C channels
, and can also be rewritten as a combination of
D feature layer slices
. Vector
is generated by spatial squeeze, which is executed by the global average pooling layer, and vector
is generated by channel squeeze, obtained by convolution
,
. It is actually a projection of multi-channel features at a feature level. The vector
z is converted into vector
through two fully connected layers
, where
represents the bottleneck in the channel excitation. To ensure that the excitation channel remains within an appropriate range,
is mapped to [0, 1] through a sigmoid function
It is worth noting that after channel squeeze, the obtained feature projection is still applicable to the encode–decode operation. Therefore, two fully connected layers
are used to convert
o to
.
represents the bottleneck in the spatial excitation, and the sigmoid function
is also used to keep
within an appropriate range. Finally, the calibrated features are used in a max-out manner.
The calculation process of scSE is shown in Equations (4) to (10):
where
and
are the ReLU function and sigmoid function, respectively. * represents the convolution operation.
3.2. Adaptive Weighted Prototypical Network
Inspired by the AdaBoost theory that weak classifiers can be integrated into strong classifiers, a prototypical network is proposed that adaptively weights the features into strong feature representation. First, each sample is treated as a weak classifier, its weight is calculated by measuring the influence of missing specific samples against the whole sample distribution, which is weighted to build a strong feature representation, that is, class prototype.
As a commonly used criterion, maximum mean discrepancy is adopted to widely measure the distribution discrepancy between the two domains. For a given feature set
is the feature extractor based on squeeze and excitation mechanism.
represents the absence of feature
in feature set
U. Therefore, the influence of sample
against the distribution of the sample set can be converted to calculate the maximum mean distribution discrepancy between
U and
, as shown below:
where
represents the mapping function
for any
The smaller
is, the closer the samples are, and vice versa; the sample deviates from the sample distribution. When
and
are the same distribution. AdaBoost does not have high performance requirements for weak classifiers and only needs to be better than the random hypothesis, so
is projected to [0, 0.5) by (12).
Therefore, the weight of feature extractor
is rewritten:
Noting that may be set to 0, a sufficiently small positive number should be added into the denominator term in (13).
Define the support set
as the label for L-class samples.
is the set of samples with the labeled class
l, and the prototype
of support class
l can be calculated by
Assuming that sample
x needs to be classified, the feature extractor is used to obtain the feature space, and then the Euclidean distance
is adopted to compute the
and L prototype vectors, respectively. The probability that sample
x belongs to category
l is:
Therefore, the loss function
of the query set
is:
3.3. Pseudo-Labeling Multi-Screening Strategy
For data-driven classification models, the number of trainable samples affects the accuracy of the model, especially in semi-supervised learning. In this paper, a pseudo-labeling multi-screening strategy based on uncertainty and classification probability is proposed, which can effectively expand the training set and improve the model training accuracy.
There is a positive correlation between network correction and model output uncertainty. The lower the uncertainty of the model, the smaller the network correction error, and the higher the accuracy of the model [
36]. Therefore, the uncertainty of the model is used as one of the indicators of pseudo-label screening. The uncertainty of the output of each sample is calculated by using the Monte Carlo dropout model. In the forward propagation of dropout layers in the testing phase, the Monte Carlo dropout was employed to generate output distributions that emulate the variability observed across different network architectures. The predictive outcomes and the uncertainty of the model are calculated by averaging the outputs and the statistical variance.
Supposing
is the output of PN after t iterations of random dropout, its uncertainty can be calculated as in Equation (
17) [
37].
where
represents the predicted posterior mean. The model architectures do not need to be modified, which can reduce the overfitting of the network and improve the computation efficiency. It assesses the predictive mean and model uncertainty by collecting the results of stochastic forward passes.
A pseudo-labeling multi-screening strategy based on the dual threshold of Monte Carlo dropout uncertainty and softmax output probability is constructed as:
where
is the estimated uncertainty of sample
is the maximum value of the predicted probability.
and
represent the thresholds of uncertainty and prediction probability, respectively. When
is l, it means that sample
i is filtered as a pseudo-labeled sample.
In order to select as many trainable samples as possible on the premise of ensuring the accuracy of pseudo-labeling, a multi-accumulation strategy is proposed in this paper. The pseudo-label samples selected in the previous round were combined with the training samples to update the AWPN, and the updated AWPN was used for a new round of pseudo-label screening, which was accumulated layer-by-layer until the conditions for stopping were met. If incorrect pseudo-labeling is added to the shallow network, the error information will accumulate in the iteration up to the deep network layer, greatly reducing the accuracy of fault diagnosis. Therefore, a timely stop accumulation strategy is the key to ensure the accuracy of the model. The labeled training set is used to judge whether the network is degraded. If the addition of pseudo-labeling reduces the accuracy of the AWPN on the training set, it indicates that the false pseudo-labeling causes network degradation, and the accumulation strategy is stopped. The accumulation strategy stops in two cases: 1. the filtered pseudo-labeling sample is empty; 2. network degradation.
3.4. Overview of the Proposed Method
The semi-supervised few-shot learning based on an adaptive prototypical network and multiple accumulation of pseudo-labeling samples is proposed in this article, with the pseudo code shown in Algorithm 1.
Algorithm 1: The proposed learning strategy |
|