1. Introduction
Deep learning has been a focus of machine learning research since it was proposed by Hinton et al. [
1]. As a typical deep learning algorithm, the stacked autoencoder (SAE) [
2] extracts hierarchical abstract features from samples by the autoencoder (AE), and then maps the abstract feature to the output by a classifier or regression algorithm. Compared with traditional neural networks, the multilayer structure of SAE represents a strong feature extraction capability, avoiding the limitations of traditional machine learning algorithms in manual feature selection [
3]. Meanwhile, the greedy layer-wise training of the SAE determines the network parameters layer by layer and accelerates the convergence speed [
4]. By virtue of excellent performance, the SAE has been applied to mechanical fault diagnosis [
5,
6], disease association prediction [
7,
8] and network intrusion detection [
9,
10].
The SAE has been extensively studied, and many methods of improvement have been introduced into the SAE. Vincent et al. [
11] combined the SAE with the local denoising criterion and proposed the stacked denoising autoencoder (SDAE). Different from the SAE, the SDAE employs noise-corrupted samples to reconstruct noise-free samples, and it enhances the robustness of the abstract feature. To obtain a sparse feature representation, Ng et al. [
12] integrated the sparsity constraint into the SAE and proposed the stacked sparse autoencoder (SSAE). The SSAE can reduce the activation of hidden nodes and use a few network nodes to extract representative abstract features. Masci et al. [
13] proposed the stacked convolutional autoencoder (SCAE) by replacing the fully connected layer with convolutional and pooling layers to preserve the spatial information of the training images. By introducing the attention mechanism into the SAE, Tang et al. [
14] constructed the stacked attention autoencoder (SAAE) to improve the feature extraction capability. Tawfik et al. [
15] utilized the SAE to extract unsupervised features and merge the multimodal medical image. In addition, many other methods [
16,
17,
18,
19] have been proposed for the development and application of the SAE.
However, the manual labeling of large numbers of samples is impossible due to the limited knowledge and efficiency. In many fields, such as speech emotion recognition [
20], medical image classification [
21] and remote sensing image detection [
22], the unprocessed training samples are usually only partially labeled, while the majority of samples are unlabeled. The supervised learning of the SAE requires sample labels to train the network and is unable to exploit the feature and category information contained in unlabeled samples, making it difficult to improve its generalization performance for the semi-supervised classification tasks. To tackle this problem, some studies in recent years have combined the SAE with semi-supervised learning. For the classification of partially labeled network traffic samples, Aouedi et al. [
23] proposed the semi-supervised stacked autoencoder (Semi-SAE) to realize a semi-supervised learning of the SAE. This method needs unsupervised feature extraction for all samples in the pre-training stage and fine-tuning of the network parameters based on the classification loss of the labeled samples. By introducing the sparsity criterion into the Semi-SAE, Xiao et al. [
24] proposed the seme-supervised stacked sparse autoencoder (Semi-SSAE). The Kullback–Leibler (KL) divergence regularization term added to the loss function improves the sparsity of the network parameters, and the Semi-SSAE is applied to cancer prediction. These improved SAE algorithms use only part of the information from the unlabeled samples in the feature extraction stage and have a limited generalization performance for semi-supervised classification tasks.
The pseudo label [
25] is a simple and efficient method for implementing semi-supervised learning. It utilizes labeled samples to predict the class of unlabeled samples and integrates labeled and pseudo-labeled samples to train the network. Semi-supervised learning methods based on the pseudo label have been gradually applied to automatic speech recognition [
26] and image semantic segmentation [
27]. To overcome the limitations of the traditional supervised SAE and to improve the generalization performance, the pseudo label-based semi-supervised stacked autoencoder (PL-SSAE) is proposed by combining the SAE with the pseudo label. The PL-SSAE first stacks the AE to extract the feature information in all samples through layer-wise pre-training. Then, the supervised classification and iterative fine-tuning on the labeled samples are used for the class prediction of the unlabeled samples. Finally, the pseudo-label regularization term is constructed, and the labeled and pseudo-labeled samples are integrated to complete the training of the network. Different from the SAE and Semi-SAE, the PL-SSAE is able to exploit both feature information from unlabeled samples for feature extraction and category information for classification and fine-tuning, aiming to improve its semi-supervised learning performance. To the best of our knowledge, the PL-SSAE is the first attempt to introduce the pseudo label into the SAE, and it extends the implementation methods of the semi-supervised SAE.
The research contributions of this study can be summarized as follows:
A new semi-supervised SAE named the PL-SSAE is proposed. By integrating the pseudo label with the SAE, the pseudo labels of the unlabeled samples are generated and the category information in the unlabeled samples is effectively exploited to improve the generalization performance of the PL-SSAE. The experimental results on various benchmark datasets show that the semi-supervised classification performance of the PL-SSAE outperforms the SAE, SSAE, Semi-SAE and Semi-SSAE.
The pseudo-label regularization term is constructed. The pseudo-label regularization term represents the classification loss of the pseudo-labeled samples, and it is added to the loss function to control the loss balance between the labeled and pseudo-labeled samples and to prevent over-fitting.
The rest of this study is organized as follows. In
Section 2, a brief introduction to the AE and SAE is described. In
Section 3, the network structure and training process of the proposed PL-SSAE are detailed. In
Section 4, the evaluation implementation and results on benchmark datasets are presented. In
Section 5, the conclusion of this study is summarized.
4. Experiments
To verify the semi-supervised classification performance of the proposed PL-SSAE, the following evaluations were designed and carried out:
Experiment 1: Influence of different hyperparameters. Observe the accuracy change in the PL-SSAE with a variable regularization parameter, variable percentage of labeled samples and variable number of hidden nodes, then analyze their influence on the classification performance of the PL-SSAE.
Experiment 2: Comparison of semi-supervised classification. Record the classification accuracy of the SAE, SSAE, Semi-SAE, Semi-SSAE and PL-SSAE with different percentages of labeled samples and compare the semi-supervised learning capability of different algorithms.
Experiment 3: Comparison of comprehensive performance. Observe the accuracy, precision, F1-measure, G-mean, training time and testing time of the SAE, SSAE, Semi-SAE, Semi-SSAE and PL-SSAE to compare their generalization performance and computational complexity.
4.1. Experimental Settings
4.1.1. Data Description
Various benchmark datasets used in the evaluations are Rectangles, Convex, USPS [
28], MNIST [
29] and Fashion-MNIST [
30]. The datasets are taken from the UCI Machine Learning Repository [
31] and have been normalized to
. Details of the benchmark datasets are shown in
Table 1.
4.1.2. Implementation Details
All evaluations were carried out in Pytorch 1.9, running on a desktop with a 3.6 GHz Intel 12,700 K CPU, Nvidia RTX 3090 graphics, 32 GB RAM and a 2 TB hard disk. To avoid the uncertainty and ensure a fair comparison, all reported results are the averages of 20 repeated experiments, and the same network structure is utilized for different algorithms. The network structure of each algorithm used in Experiments 2 and 3 is shown in
Table 2.
The experimental details of Experiment 1 are as follows: The dataset is MNIST, the batch size is 100, the number of iterations is 100, the learning rate is 0.01 and the activation function is a ReLU function. Suppose parameter represents the percentage of labeled samples in the training data. When changing the regularization parameter and the percentage of labeled samples, the network structure is 784-300-200-100-10, the range of regularization parameter is and the range of label percentage is . When changing the number of hidden nodes, the network structure is , the range of is , the range of is , the regularization parameter is and the label percentage is .
The experimental details for Experiment 2 are as follows: The datasets are Convex, USPS, MNIST and Fashion-MNSIT. The batch size is 100, the number of iterations is 100, the learning rate is 0.01, the activation function is a ReLU function and the range of the label percentage is . The sparsity parameter of the SSAE and Semi-SSAE is , and the regularization parameter of the PL-SSAE is .
The experimental details for Experiment 3 are as follows: The batch size is 100, the number of iterations is 100, the learning rate is 0.01, the activation function is the ReLU function, and the range of label percentage is . The sparsity parameter of the SSAE and Semi-SSAE is and the regularization parameter of the PL-SSAE is . For multiclass classification tasks, the precision, F1-measure and G-mean are the averages of different classes.
4.2. Influence of Different Hyperparameters
As predetermined parameters of the network, the hyperparameters affect the semi-supervised learning and classification performance of the PL-SSAE. The regularization parameter, the percentage of labeled samples and the number of hidden nodes are important hyperparameters for the PL-SSAE. The regularization parameter controls the balance between the empirical loss and the regularization loss. The percentage of labeled samples determines the number of labeled and pseudo-labeled samples. The number of hidden nodes controls the structural complexity and fitting ability of the network. To analyze the specific influence of different hyperparameters, a variable regularization parameter, a variable percentage of labeled samples and a variable number of hidden nodes are utilized to observe the accuracy change in the PL-SSAE. The generalization performance of the PL-SSAE with different regularization parameters and label percentages is shown in
Figure 4. The generalization performance and training time of the PL-SSAE with different numbers of hidden nodes are shown in
Figure 5.
As is shown in
Figure 4, the semi-supervised classification performance of the PL-SSAE varies with the regularization parameter and the percentage of labeled samples. When the label percentage
is fixed, the classification accuracy of the PL-SSAE increases and then decreases as the regularization parameter
increases. When the regularization parameter
is fixed, the classification accuracy increases as the label percentage
increases. This is because the regularization parameter
controls the importance of the pseudo-label loss in the loss function. A proper regularization parameter
allows the PL-SSAE to exploit the feature and category information contained in unlabeled samples to improve its semi-supervised learning. However, an excessively large
will cause the PL-SSAE to ignore the labeled samples, and the difference between the pseudo labels and the true labels will lead to an under-fitting. Therefore, it is important to choose appropriate regularization parameters for different samples. However, the trial-and-error method used for regularization parameter selection in the PL-SSAE is time-consuming and inefficient. Meanwhile, the labeled samples are the prior knowledge for the network. With the increase in label percentage
, the number of labeled samples in the training data increases and more category information improves the generalization performance of the network.
As is shown in
Figure 5, the classification accuracy and training time of the PL-SSAE vary with the number of hidden nodes. As the number of hidden nodes increases, the generalization performance of the PL-SSAE increases and then decreases. The reason is that the hidden nodes control the function approximation ability of the network. As the number of hidden nodes increases, the generated pseudo labels are closer to the true labels and more category information contained in pseudo-labeled samples improves the semi-supervised learning of the PL-SSAE. However, too many hidden nodes will lead to the over-fitting of the network, and the difference between the training and testing samples will cause the classification accuracy to decrease. In addition, the training time of the PL-SSAE increases with the increase in hidden nodes. This is because the number of hidden nodes is positively correlated with the computational complexity of the network. When the computational power is fixed, the increase in the computational complexity leads to an increase in the training time.
4.3. Comparison of Semi-Supervised Classification
The semi-supervised classification performance is a direct reflection of the ability to learn from unlabeled training samples. To evaluate the semi-supervised classification performance of different algorithms, it is necessary to adopt different percentages of labeled samples, then record the accuracy change on the testing samples and plot the accuracy curves. The experiment in this section focuses on comparing the PL-SSAE with the SAE, SSAE, Semi-SAE and Semi-SSAE. The variation in classification accuracy of each algorithm on datasets with different label percentages is shown in
Figure 6.
As shown in
Figure 6, the semi-supervised classification performance of the PL-SSAE outperforms that of the SAE, SSAE and Semi-SAE and Semi-SSAE on different datasets. As the label percentage increases, the number of labeled training samples increases. Thus, more label information is exploited to learn the function mapping, and the generalization performance of each algorithm gradually increases. The classification accuracy of the PL-SSAE is higher than other algorithms at different label percentages. The reason is that the PL-SSAE is an effective semi-supervised algorithm. Compared with the supervised SAE and SSAE, the PL-SSAE uses the feature information and category information of the unlabeled samples to make the learned mapping function closer to the real mapping. Compared with the Semi-SAE and Semi-SSAE, the PL-SSAE not only utilizes the unlabeled samples for feature extraction but also exploits the pseudo-label information for classification mapping. The advantage of the PL-SSAE in the semi-supervised classification becomes more apparent when the percentage of labeled samples is small. However, when there are sufficient labeled samples, PL-SSAE tends to have no performance advantage and the inconsistency between pseudo labels and true labels will reduce the generalization performance of the PL-SSAE.
4.4. Comparison of Comprehensive Performance
To test the comprehensive performance of the PL-SSAE, all benchmark datasets mentioned above are used to compare the PL-SSAE with the SAE, SSAE, Semi-SAE and Semi-SSAE. Different metrics, such as accuracy, precision, F1-measure and G-mean, of each algorithm with different label percentages are recorded to evaluate the semi-supervised performance. The training and testing times of each algorithm are recorded to compare the computational complexity. The classification accuracy, precision, F1-measure, G-mean, training time and testing time of each algorithm are shown in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8, respectively (the numbers in bold indicate the best results). Since the experimental results are the averages of repeated experiments, the standard deviation of the results is listed after the average to reflect the performance stability of the algorithm.
As shown in
Table 3,
Table 4,
Table 5 and
Table 6 the comprehensive performance of the PL-SSAE is better than that of the SAE, SSAE, Semi-SAE and Semi-SSAE. For each dataset, the PL-SSAE has higher classification accuracy, precision, F1-measure and G-mean than other algorithms with different label percentages. The reason is that the SAE and SSAE do not use unlabeled samples in the training process, and the Semi-SAE and Semi-SSAE only use unlabeled samples in the feature extraction process. The PL-SSAE introduces the pseudo label and makes appropriate use of the labeled samples to generate the pseudo labels of the unlabeled samples. The category information contained in the pseudo-labeled samples guides the feature extraction and class mapping of the network, and this improves the semi-supervised learning and classification performance of the PL-SSAE. Moreover, the PL-SSAE integrates the pseudo-label regularization into the loss function. The balance between the classification loss of the labeled and pseudo-labeled samples avoids over-fitting and improves the generalization performance.
As shown in
Table 7 and
Table 8, the training time of PL-SSAE is slightly higher than that of the SAE, SSAE, Semi-SAE and Semi-SSAE, while the testing time of each algorithm is the same. The PL-SSAE requires additional fine-tuning of the pseudo-labeled samples. As a result, the computational complexity and training time of the PL-SSAE is twice that of the other algorithms. However, given the improvement in generalization performance, the increase in training time for the PL-SSAE is worthwhile. In the comparison of testing speed, the testing time is related to the sample size and network structure. Therefore, different algorithms with the same testing samples and network structure have the same testing speed.
5. Conclusions
To overcome the limitations of traditional SAE for unlabeled samples, this study integrates the pseudo label into the SAE and proposes a new semi-supervised SAE called PL-SSAE. The PL-SSAE assigns the pseudo labels to the unlabeled samples by the network trained on the labeled samples and adds a pseudo-label regularization term to the loss function. Different from the SAE, the PL-SSAE exploits the feature and category information contained in the unlabeled samples to guide the feature extraction and classification of the network. Various evaluations on different datasets show that the PL-SSAE outperforms the SAE, SSAE, Semi-SAE and Semi-SSAE.
However, the different hyperparameters of the PL-SSAE in this study are determined by the time-consuming trial-and-error method. Thus, it is important to combine the PL-SSAE with the particle swarm optimization algorithm [
32] or the ant colony algorithm [
33] to achieve automatic optimization of the hyperparameters. In addition, the PL-SSAE only determines the pseudo labels by taking the maximum value of the prediction probabilities. This method tends to introduce noise. Therefore, a more effective method needs to be investigated to further generate more reasonable pseudo labels.