1. Introduction
Nowadays, with the advancement of distributed hardware systems, substantial data are typically collected and stored at multiple nodes over different geographical regions [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10]. To handle these kinds of data, distributed learning (DL), where multiple nodes collaboratively perform the global-like task based on their own local data and limited information provided by one-hop neighboring nodes, has been developed and has attracted much research attention. DL is commonly used in many areas, such as anomaly detection [
3], the industrial Internet of Things [
4], environmental monitoring [
8], and data mining [
7,
10], due to its excellent learning performance and adaptability to node failures.
For these DL algorithms, having a sufficient number of high-quality data with complete features is a precondition for obtaining satisfactory learning performance. However, due to various causes, such as absent features or acquisition failures, the collected data vectors often contain a certain number of missing features [
6]. A lack of high-quality data can degrade classification performance. Recent years have witnessed some efforts to tackle this problem. Most current missing data classification (MDC) methods address the tasks of missing feature imputation and predictive model induction independently. To be specific, they firstly utilize some imputation methods, including mean imputation [
11], knn imputation [
12,
13], logistic regression imputation [
14], and auto-encoder imputation [
15,
16,
17], to induce an imputed model to fill in the missing features in the early stage and then learn the classifier based on the recovered features. Although extensive experiments have shown that these data imputation methods can boost learning performance to some extent, a certain amount of training data is complete features and precise labels are required during the induction of the imputed model, which may be infeasible in many real applications. In addition to the above methods, in the literature [
18,
19,
20], a probabilistic generative model was designed to seek the optimal completion solution based on the learned model. For such a method, although a complete data sample was not required to learn the imputation model, the missing features of some training data needed to be pre-imputed before training the classifier. Since the accuracy of pre-imputation is heavily dependent on accurate supervision information, it is difficult to obtain good learning performance when a large amount of training data is unlabeled or ambiguously labeled. Another strategy suggested in [
21] was to lessen the negative impact of missing features on classification performance by reducing the importance of training data that have many missing features. However, this method does not consider the information about data distribution when evaluating the weight of training data, which may lead to the degradation of the induced classifier’s performance. Lately, a few novel MDC methods have been proposed [
6,
22,
23], which jointly address the tasks of missing feature imputation and classifier induction in an integral framework. These MDC approaches usually require accurate label information to impute missing features and induce classifiers, which includes an underlying assumption that all the available labels of incomplete data are error-free. However, this assumption is not valid in many scenarios.
Actually, gathering a substantial quantity of data samples with missing features is simple, but labeling these incomplete data without any ambiguous information is an expensive/time-consuming process. It is more likely that only partially labeled data annotated with a series of ambiguous labels can be obtained. Therefore, it is preferable to use the valuable information from ambiguous labels to perform missing feature imputation in such cases.
Recently, partial label learning (PLL), which induces the classifier based on training data annotated with ambiguous labels, has emerged as a new approach in machine learning. Most traditional PLL algorithms are designed for multi-class classification (MCC) [
24,
25,
26], which typically employs disambiguation procedures to recover the correct label from the candidate label set and then train the classifier based on the recovered labels. For example, recently, two novel instance-dependent PLL algorithms were proposed in [
27], which characterize the latent label distribution to disambiguate the ambiguous labels by inferring variational posterior density and exploiting the mutual information between label and feature space. In [
9], a distributed, semi-supervised PLL was developed in which the model parameters, labeling confidence, and weights of training data are iteratively updated in a collaborative manner. Recently, PLL was extended to deal with the problem of multi-label classification (MLC) [
28,
29,
30,
31,
32,
33]. For example, in [
33], a distributed, partial, multi-label label method was introduced that identifies reliable labeling information from a series of ambiguous labels based on globally common basic data and induces a predictive model by making full use of identified credible label information. Although these approaches have been shown to be effective, a significant limitation is that they do not account for the effect of the missing features on the performance of the disambiguation strategy.
Jointly taking the above consideration into account, in this article, as shown in
Figure 1, the problem of distributed classification of partially labeled incomplete data is considered. For this study, an integrated framework was designed that could address jointly the tasks of missing feature imputation and predictive classifier induction over a network. Specifically, the main contributions of this paper are summarized as follows:
1. In the proposed algorithm, a distributed, information-theoretic, learning-based (ITL-based) data imputation method is developed based on the Gaussian mixture model (GMM), which exploits weakly supervised information of ambiguous labels to guide model parameter estimation. Then, the missing features can be imputed by computing the conditional expectation based on the observed features and estimated parameters.
2. To induce the classifier based on the imputed data, the information-theoretic measures, including logistic loss with respect to imputed data and the mutual information with respect to cluster centers of the Gaussian components, are used to design the cost function. By using the random feature map to replace the kernel feature map for the discriminant function construction, a non-linear multi-class classifier can be distributively learned. Moreover, in order to make the estimated labeling confidence more suitable for guiding missing feature imputation and model induction, we introduced a novel normalized sigmoid function to scale the value of labeling confidence.
3. We alternately established two steps in a collaborative manner and developed the dPMDC algorithm, which can address the issue of the distributed classification of training data presented by partially available features annotated with ambiguous labels.
The subsequent sections of this article are organized as follows:
Section 2 formulates the issue of the distributed classification of partially labeled incomplete data and presents relevant preliminaries. Then,
Section 3 describes the technical details of the proposed dPMDC algorithm. Following this,
Section 4 reports the experimental results of the dPMDC algorithm and state-of-the-art methods on multiple datasets. Finally, we conclude this paper in
Section 5.
4. Experiment
In this section, to validate the efficacy of the proposed approach, a series of experiments on several artificial and real PLL datasets are described, including the Double Moon [
9], mHealth [
41], Gas Drift [
41], Pendigits [
41], Segmentation [
41], Ecoli [
41], Vertebral [
41], Lost [
42], Birdsong [
43], and MSRCv2 [
44] datasets.
The profiles of these utilized datasets are shown in
Table 4. It is noted that seven artificial PLL datasets (Double Moon, mHealth, Gas Drift, Pendigits, Segmentation, Vertebral, and Ecoli datasets) were generated by adding a series of noisy labels into the set of candidate labels under the configuration of two controlled parameters,
s and
[
9]. In this context,
s represents the number of noisy labels inside the candidate label set and
represents the co-occurrence probability between a coupling noisy label and the correct label. That is, for each partially labeled data sample, a randomly selected coupling noisy label and the correct label occurred in a pair with probability
. For three real-world PLL datasets (Lost, Birdsong, and MSRCv2 datasets), the original labels of the training data were ambiguous and, thus, no extra noisy labels were added into the set of candidate labels.
To investigate the impact of the different proportions of missing features on classification performance under the MCAR assumption, a metric , namely, the percentage of missing features relative to the total number of features in the training dataset, was defined.
For each experiment, a total of 50 Monte Carlo cross-validation simulations were conducted, and the average results from these simulations are reported herein. Furthermore, all datasets utilized in each Monte Carlo simulation were arbitrarily partitioned into 10 folds; the training phase utilized 8 folds, while the testing phase employed the remaining 2 folds. To simulate the performance of a distributed network, here, an interconnected network consisting of 10 nodes and 23 edges was randomly generated. All training data were completely randomly partitioned into J parts with equal size and allocated to these nodes. To conduct the following experiments, the data instances needed to be preprocessed at the initial state, i.e., the values of attributes were normalized into [0, 1].
At each trial of simulation, for the proposed dPMDC, the parameters were set as , , and , ; the step size was set as ; and the weight parameters W are set as initially. The mean vector could be randomly initialized to a vector uniformly distributed in the range (0, 0.5) and the covariance matrix could be initialized to a D-dimensional identity matrix . Furthermore, in the initial state, we had the mixing parameter and probability .
Taking “mHealth” and “Gas Drift” as representatives of all the considered datasets, we depict the learning curves of the imputation error and classification accuracy of the proposed dPMDC algorithm in
Figure 3 and
Figure 4. In order to test the robustness of the algorithm under different network sizes, we also compare the changing curves of imputation error and classification accuracy of the proposed dPMDC on multiple distributed networks. It should be noted that in this experiment, the controlled parameters of the distributed network topology were characterized by the number of nodes
J and the number of edges
.
By observing the simulation results presented in
Figure 3 and
Figure 4, we can notice that the learning curves of imputation error converged significantly faster than those of classification accuracy. The values of imputation error rapidly decreased at the initial 15 iterations and converged to the stable state after about 15 iterations. The learning curve of classification accuracy was relatively smooth. During the first 50 iterations, the values of classification accuracy steadily increased. After about 70 iterations, it gradually converged to the optimal value. We also can observe that the learning curves of the proposed algorithm using different network topologies, either the imputation error or the classification accuracy, were very close to each other. The simulation results show that the size of the network did not significantly affect the learning performance of our proposed algorithm.
Furthermore, we also compare the CPU times of the proposed algorithm at each individual node under different networks on the “mHealth” and “Gas Drift” datasets in
Figure 5. To ensure fairness in this experiment, we set the amount of training data at each individual node to be the same. From
Figure 5, we can see that the CPU times of the proposed algorithm at each individual node remained nearly unchanged. Such a result indicates that different network topologies could not affect the computational efficiency of the algorithm, as long as the data size of a single node remained unchanged.
Additionally, to simulate the robustness of the proposed algorithm against the initial setting of the model parameters, we compared the learning performance of the proposed algorithm with different parameter settings. To be specific, two different cases were taken into consideration.
Case 1: We maintained the original settings. That is, the mean vector of each Gaussian component could be randomly initialized to a vector uniformly distributed in the range (0,0.5), the covariance matrix of each Gaussian component could be initialized to a D-dimensional identity matrix , and the weight parameter could be initialized to .
Case 2: We reset the initialized settings. That is, the mean vector of each Gaussian component was randomly initialized to the training data with complete attributes of the local node and the covariance matrix of each Gaussian component was initialized as . We set the initialized state of the weight parameter as .
We depict the learning curves of the imputation error and the classification accuracy of the proposed algorithm with different initial settings in
Figure 6 and
Figure 7. It should be noted that in order to distinguish them from each other, we name case 1 dPMDC with originally initial setting and case 2 as dPMDC with newly initial setting. From the simulation results in
Figure 6 and
Figure 7, we can observe that the changing curves of dPMDC with originally initial setting and dPMDC with newly initial setting were almost overlapping, indicating that our proposed algorithm was insensitive to the initial setting of the model parameters.
Moreover, we investigated the impact of different values of parameters
and
and the discretization level of random feature map
U on the classification performance of our proposed algorithm using the “mHealth” and “Gas Drift” datasets. In this experiment, we investigated the performance changing of the proposed algorithm by varying the value of one parameter while keeping the other parameters unchanged. We can see that the changing trends of the parameters
and
were similar. The simulation results presented in
Figure 8 indicate that as long as the values of parameters
and
were set within [0.1, 1] and [0.01, 0.1], good learning performance could be obtained. We can see that the classification accuracy of the proposed method gradually improved as the discretization level of random feature map
U increased. The possible reasons were analyzed and are presented as follows. When the value of
U increased, the random feature map could give a more precise approximation for the kernel feature map, which boosted the classification performance of the induced classifier to some extent. When the value of
U exceeded 4, the performance improvement resulting from the increment of
U diminished progressively. Since larger values of
U led to higher computational complexity and communication cost, an appropriate choice for
U could be set as 4 in order to achieve a balance between classification performance and computational complexity.
Considering that the value of
K had a significant effect on both imputation error and classification accuracy, we investigated the changing trends of imputation error and classification accuracy versus
K, shown in
Figure 9. The simulation results indicate that the learning performance of the proposed dPMDC gradually improved as the number of Gaussian mixture components
K escalated. When the value of
K was greater than 30, the extent of learning performance improvement rapidly decreased. These changing trends were similar to those of
U. Therefore, we set the number of Gaussian mixture components
K to 30 to strike a balance between computational complexity and learning performance.
Furthermore, we investigated the learning performance of the proposed dPMDC algorithm for different distribution data. Given the challenge of characterizing the distribution of existing real datasets, we adopted a common synthetic dataset known as “Double Moon”. Referring to the operations in [
9], we randomly generated 20,000 training data and divided the upper and the lower moon into two classes, as shown in
Figure 10. Then, a special number of noisy labels were added into candidate labels, such that the values of the controlled parameters were
and
. To simulate the learning performance of the proposed algorithm, we added noise from different distributions to the training data. Specifically, three cases were considered.
Case 1: We added zero-mean Gaussian noise to the training data so that the signal-to-noise ratio was equivalent to 15 dB.
Case 2: We added 0–1 noise with a magnitude of 0.3 and a probability of 0.5 to the training data.
Case 3: We added uniformly distributed noise with a magnitude of 0.3 to the training data.
For clarity, we depict the training data after adding noise in
Figure 10.
The learning curves of the imputation error and classification accuracy of the proposed dPMDC for three cases are presented in
Figure 11. We can see that the learning curves were quite similar except for the first few iterations and all converged to ideal levels, indicating that the GMM could effectively characterize the training data with different distributions.
To validate the efficacy of the proposed approach in imputing the missing features, we investigated the imputation error of the proposed dPMDC algorithm under the different types of missing data, including MCAR, missing at random (MAR), and missing not at random (MNAR).
MCAR: The missing data features were completely independent of the variable value. In the following experiments, we used to measure the probability of missing values.
MAR: The probability of missing data features was related to the observed variables and unrelated to the characteristics of the unobserved data. In the following experiments, we randomly selected a pair of strongly correlated features. When one feature was larger than threshold , the coupling feature was missing with probability .
MNAR: The missing data features entirely depended on the unobserved variable itself. In the following experiments, we assumed that when the value of an attribute was greater than the threshold , it was missing with probability .
In the following experiments, for all the artificially generated PLL datasets (Double Moon, mHealth, Gas Drift, Pendigits, Ecoli, Segmentation, and Vertebral datasets), a special number of noisy labels were added into candidate labels, such that the value of the controlled parameters the and . For three real PLL datasets (Lost, Birdsong, and MRSCv2 datasets), no extra noisy label was added.
For the purpose of comparison, the imputation errors of the other existing imputation methods, including kNN imputation [
13], subspace learning (SL) imputation [
6], logistic regression (LR) imputation [
14], extreme learning machine auto-encoder (ELM-AE) imputation [
15], multi-layer auto-encoder (MAE) imputation [
17], support vector regression imputation-based support vector machine (SVR-SVM) [
22], and missing-data-importance-weighted auto-encoder imputation (MIWAE) [
23] under the MACR, MAR, and MNAR assumptions, were also evaluated. All the simulation results are shown in
Table 5,
Table 6 and
Table 7.
From
Table 5 and
Table 6, it can be observed that the comparison algorithms performed similar imputation performances under the MCAR and MAR assumptions. The following observations can be made: First, with the same degree of
, the kNN imputation method performed the worst in most cases, which indicated that simply exploiting the attribute values of neighboring data could not accurately describe the global data distribution. SL, LR, AL-ELM, and SVR-SVM imputation methods induced the imputed model based on global data distribution, and, thus, they performed better than the kNN imputation method. The MAE imputation method is based on an auto-encoder framework and can achieve relatively good imputation accuracy. However, as a supervised algorithm, it requires sufficient complete data with unambiguous labels to train the auto-encoder network to fine-tune the imputation results. When the value of
was equivalent to 0.3, the ambiguous labels could have a negative effect on its imputation accuracy. Similarly, MIWAE outperformed most comparison algorithms by inducing the imputation model based on the auto-encoder framework. Even so, its imputation performance was still inferior to our suggested algorithm due to the coupling noise labels. Moreover, our suggested dPMDC performed significantly better than the other existing imputation approaches. Such a result indicates the superiority of the proposed imputation methods in characterizing the missing feature distribution under the MCAR and MAR assumptions.
By observing
Table 7, we notice that, unlike the results in
Table 5 and
Table 6, our proposed algorithm had no significant performance advantage when the MNAR assumption was used. Specifically, the imputation error of the proposed dPMDC algorithm was lower than that of kNN, SVR-SVM, LR, AE-ELM, and SL for all ten datasets. The imputation performance of the proposed dPMDC algorithm was better than that of MAE for six datasets and better than that of MIWAE for five datasets. We analyze the possible reasons in detail below. Under the MNAR assumption, all the features with values larger than the threshold were missing, making it challenging to characterize the distribution of these features. Therefore, it was difficult for our algorithm to achieve such excellent learning results, as in the cases of MCAR and MAR. Nevertheless, owing to the exploitation of weakly supervised information, the performance of our proposed algorithm was better than those of most comparison algorithms and close to that of the MAE and MIWAE methods.
In order to highlight the superiority of the proposed algorithm in imputing the missing features, the Friedman test was implemented to determine whether there was a substantial distinction among the comparison algorithms [
45]. Based on the amount of comparison methods
and the amount of used datasets
, we computed the Friedman statistic
and its associated critical value 2.16. Obviously, the value of the Friedman statistic was larger than the critical value. This result indicated that the null hypothesis, stating that there was no significant difference among the evaluated comparison algorithms, was rejected at the 0.05 significance level.
Additionally, the Bonferroni–Dunn test was employed to determine whether our suggested method significantly surpassed the other comparative algorithms [
45].
Figure 12 illustrates the average rankings of all comparison algorithms depicted by black lines, with the proposed dPMDC serving as the controlled algorithm in this evaluation.
The comparison algorithms whose average ranks fell within one CD compared to the controlled algorithms at a significance level of 0.05 and a critical difference (CD) of performance
are connected by red lines. Otherwise, no line links them.
Figure 12 illustrates that our suggested method achieved the highest ranking among all comparative algorithms. Furthermore, the results of the Bonferroni–Dunn test demonstrated that the performance of our proposed method was significantly superior to that of the LR, AE-ELM, SVR-SVM, and kNN imputation methods.
To assess the classification performance of the proposed dPMDC, we tested the classification accuracy of the proposed dPMDC method under the MCAR, MAR, and MNAR assumptions on ten datasets. We also evaluated the classification performance of four MDC methods, including dS
2MDC [
6], ALS-SVM [
21], SVR-SVM [
22], and MIWAE [
23], for comparison. Moreover, to highlight the advantages of the proposed algorithm in simultaneously handling the ambiguous labels and missing features, the performances of three novel, state-of-the-art PLL algorithms, including dS
2PLL [
9], LWS [
25], and VALEN [
27], were simulated based on the complete features imputed by the existing imputation method. According to the results in the previous experiments, the imputation error of MAE was second only to our proposed method and significantly outperformed the other comparison methods. So, we utilized MAE to impute the missing features of the training data in the following simulations. In order to distinguish them from the original algorithms, we respectively use im-dS
2PLL, im-LWS, and im-VALEN to denote them. All the simulation results are given in
Table 8,
Table 9 and
Table 10 and the average ranks of all the comparison algorithms are depicted in
Figure 13.
The simulation results in
Table 8,
Table 9 and
Table 10 indicate that the performance of ALS-SVM was significantly inferior to that of the other comparison algorithms. The main reason was that ALS-SVM only reduced the weight of data with a high proportion of missing attributes without filling in the missing attributes. Different from ALS-SVM, SVR-SVM trained the multi-class classifier based on the training data imputed by the induced SVR model, making its performance significantly better than the ALS-SVM. However, the simulation results in
Table 5,
Table 6 and
Table 7 show that the performance of the induced SVR imputation model was relatively poor, which, in turn, had a negative impact on the performance of the SVM classifier. The dS
2MDC algorithm performed better than the ALS-SVM since it benefited from the interplay of missing feature imputation and model induction in the learned subspace. On the other hand, by suffering from the negative effect of noisy labels, its performance was worse than that of our proposed algorithm. It can also be observed that the dS
2PLL, LWS, and VALEN algorithms achieved superior performance compared to ALS-SVM by imputing the missing features via the MAE imputation method and eliminating the effect of ambiguous labels using their disambiguation strategies. But, their performances were still poorer than that of the proposed dPMDC algorithm. MIWAE designed an integral framework that jointly addressed missing data imputation and classifier induction. Benefiting from this, the performance of the MIWAE algorithm ranked second among all the comparison algorithms. Additionally, our proposed dPMDC algorithm performed best among eight comparison algorithms in almost all cases. Although our algorithm had a relatively high imputation error under the MNAR assumption, the classifier’s performance still maintained a high level due to the interaction between the imputation model and the classifier.
Similar to the previous experiment, we used the Friedman test [
45] to justify the performance difference among the eight compared algorithms. According to the theory of the Friedman test, when the significance level was set as 0.05, we could calculate the Friedman statistic value
and its associated critical value
. We noticed that the Friedman statistic was significantly larger than the critical value, indicating that there existed a significant difference among the evaluated comparison algorithms.
Additionally, the Bonferroni–Dunn test was employed to compare the performance difference between our suggested method and the other comparative algorithms [
45]. All the results of the Bonferroni–Dunn test are depicted in
Figure 13.
Figure 13 illustrates that the learning performance of our suggested method was significantly better than that of the im-VALEN, dS
2MDC, im-dS
2PLL, SVR-SVM, and ALS-SVM algorithms.
To testify to the robustness of the proposed algorithm against coupling noisy labels, we compared the classification accuracy of the comparison algorithms versus
on four artificial PLL datasets (Pendigits, Ecoli, Segmentation, and Vertebral datasets). From
Figure 14 and
Figure 15, we can see that as the probability of co-occurring between the correct label and another coupling noisy label
gradually increased, the classification performances of all the considered algorithms gradually deteriorated. Among all the comparison algorithms, we found that our proposed algorithm demonstrated superior performance, especially when the value of
was smaller than 0.3. When the proportion of coupling noisy labels was larger than 0.3, the performance significance of the dPMDC algorithm gradually reduced. Such a simulation result validates the benefits of our proposed algorithm in handling a small amount of coupling noisy labels.
Finally, we evaluated the CPU time of all the considered algorithms on the ten used datasets, shown in
Table 11. It should be noted that for the distributed learning algorithms, all the computation operations were performed at the
J node over a network. For centralized learning algorithms, all the computation operations were centralized at a single fusion node. Owing to the distributed parallel computation, the CPU times of all the distributed learning methods were significantly lower than those of the centralized learning methods. Compared with im-dS
2PLL, our proposed dPMDC and the dS
2MDC required fewer computations during the process of imputation model induction. Therefore, the CPU times of the proposed dPMDC and dS
2MDC were significantly shorter than those of im-dS
2PLL.