Distributed Partial Label Learning for Missing Data Classification

Xu, Zhen; Chen, Zushou

doi:10.3390/electronics14091770

Open AccessArticle

Distributed Partial Label Learning for Missing Data Classification

by

Zhen Xu

^1,2,*,†

and

Zushou Chen

^1,†

¹

College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325006, China

²

The Metaverse and Artificial Intelligence Institute of Wenzhou University, Wenzhou 325006, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(9), 1770; https://doi.org/10.3390/electronics14091770

Submission received: 18 March 2025 / Revised: 18 April 2025 / Accepted: 25 April 2025 / Published: 27 April 2025

(This article belongs to the Special Issue Computational Intelligence and Machine Learning: Models and Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Distributed learning (DL), in which multiple nodes in an inner-connected network collaboratively induce a predictive model using their local data and some information communicated across neighboring nodes, has received significant research interest in recent years. Yet, it is challenging to achieve excellent performance in scenarios when training data instances have incomplete features and ambiguous labels. In such cases, it is essential to develop an efficient method to jointly perform the tasks of missing feature imputation and credible label recovery. Considering this, in this article, a distributed partial label missing data classification (dPMDC) algorithm is proposed. In the proposed algorithm, an integrated framework is formulated, which takes the ideas of both generative and discriminative learning into account. Firstly, by exploiting the weakly supervised information of ambiguous labels, a distributed probabilistic information-theoretic imputation method is designed to distributively fill in the missing features. Secondly, based on the imputed feature vectors, the classifier modeled by the random feature map of the

χ^{2}

kernel function can be learned. Two iterative steps constitute the dPMDC algorithm, which can be used to handle dispersed, distributed data with partially missing features and ambiguous labels. Experiments on several datasets show the superiority of the suggested algorithm from many viewpoints.

Keywords:

distributed processing; partial label classification; missing data classification; random feature map of

χ^{2}

kernel

1. Introduction

Nowadays, with the advancement of distributed hardware systems, substantial data are typically collected and stored at multiple nodes over different geographical regions [1,2,3,4,5,6,7,8,9,10]. To handle these kinds of data, distributed learning (DL), where multiple nodes collaboratively perform the global-like task based on their own local data and limited information provided by one-hop neighboring nodes, has been developed and has attracted much research attention. DL is commonly used in many areas, such as anomaly detection [3], the industrial Internet of Things [4], environmental monitoring [8], and data mining [7,10], due to its excellent learning performance and adaptability to node failures.

For these DL algorithms, having a sufficient number of high-quality data with complete features is a precondition for obtaining satisfactory learning performance. However, due to various causes, such as absent features or acquisition failures, the collected data vectors often contain a certain number of missing features [6]. A lack of high-quality data can degrade classification performance. Recent years have witnessed some efforts to tackle this problem. Most current missing data classification (MDC) methods address the tasks of missing feature imputation and predictive model induction independently. To be specific, they firstly utilize some imputation methods, including mean imputation [11], knn imputation [12,13], logistic regression imputation [14], and auto-encoder imputation [15,16,17], to induce an imputed model to fill in the missing features in the early stage and then learn the classifier based on the recovered features. Although extensive experiments have shown that these data imputation methods can boost learning performance to some extent, a certain amount of training data is complete features and precise labels are required during the induction of the imputed model, which may be infeasible in many real applications. In addition to the above methods, in the literature [18,19,20], a probabilistic generative model was designed to seek the optimal completion solution based on the learned model. For such a method, although a complete data sample was not required to learn the imputation model, the missing features of some training data needed to be pre-imputed before training the classifier. Since the accuracy of pre-imputation is heavily dependent on accurate supervision information, it is difficult to obtain good learning performance when a large amount of training data is unlabeled or ambiguously labeled. Another strategy suggested in [21] was to lessen the negative impact of missing features on classification performance by reducing the importance of training data that have many missing features. However, this method does not consider the information about data distribution when evaluating the weight of training data, which may lead to the degradation of the induced classifier’s performance. Lately, a few novel MDC methods have been proposed [6,22,23], which jointly address the tasks of missing feature imputation and classifier induction in an integral framework. These MDC approaches usually require accurate label information to impute missing features and induce classifiers, which includes an underlying assumption that all the available labels of incomplete data are error-free. However, this assumption is not valid in many scenarios.

Actually, gathering a substantial quantity of data samples with missing features is simple, but labeling these incomplete data without any ambiguous information is an expensive/time-consuming process. It is more likely that only partially labeled data annotated with a series of ambiguous labels can be obtained. Therefore, it is preferable to use the valuable information from ambiguous labels to perform missing feature imputation in such cases.

Recently, partial label learning (PLL), which induces the classifier based on training data annotated with ambiguous labels, has emerged as a new approach in machine learning. Most traditional PLL algorithms are designed for multi-class classification (MCC) [24,25,26], which typically employs disambiguation procedures to recover the correct label from the candidate label set and then train the classifier based on the recovered labels. For example, recently, two novel instance-dependent PLL algorithms were proposed in [27], which characterize the latent label distribution to disambiguate the ambiguous labels by inferring variational posterior density and exploiting the mutual information between label and feature space. In [9], a distributed, semi-supervised PLL was developed in which the model parameters, labeling confidence, and weights of training data are iteratively updated in a collaborative manner. Recently, PLL was extended to deal with the problem of multi-label classification (MLC) [28,29,30,31,32,33]. For example, in [33], a distributed, partial, multi-label label method was introduced that identifies reliable labeling information from a series of ambiguous labels based on globally common basic data and induces a predictive model by making full use of identified credible label information. Although these approaches have been shown to be effective, a significant limitation is that they do not account for the effect of the missing features on the performance of the disambiguation strategy.

Jointly taking the above consideration into account, in this article, as shown in Figure 1, the problem of distributed classification of partially labeled incomplete data is considered. For this study, an integrated framework was designed that could address jointly the tasks of missing feature imputation and predictive classifier induction over a network. Specifically, the main contributions of this paper are summarized as follows:

1. In the proposed algorithm, a distributed, information-theoretic, learning-based (ITL-based) data imputation method is developed based on the Gaussian mixture model (GMM), which exploits weakly supervised information of ambiguous labels to guide model parameter estimation. Then, the missing features can be imputed by computing the conditional expectation based on the observed features and estimated parameters.

2. To induce the classifier based on the imputed data, the information-theoretic measures, including logistic loss with respect to imputed data and the mutual information with respect to cluster centers of the Gaussian components, are used to design the cost function. By using the random feature map to replace the kernel feature map for the discriminant function construction, a non-linear multi-class classifier can be distributively learned. Moreover, in order to make the estimated labeling confidence more suitable for guiding missing feature imputation and model induction, we introduced a novel normalized sigmoid function to scale the value of labeling confidence.

3. We alternately established two steps in a collaborative manner and developed the dPMDC algorithm, which can address the issue of the distributed classification of training data presented by partially available features annotated with ambiguous labels.

The subsequent sections of this article are organized as follows: Section 2 formulates the issue of the distributed classification of partially labeled incomplete data and presents relevant preliminaries. Then, Section 3 describes the technical details of the proposed dPMDC algorithm. Following this, Section 4 reports the experimental results of the dPMDC algorithm and state-of-the-art methods on multiple datasets. Finally, we conclude this paper in Section 5.

2. Problem Formulation and Preliminaries

In this section, the issue of the distributed classification of training data represented by partially available features annotated with a series of ambiguous labels over a network is formulated. To ensure that this paper is self-contained, some fundamental preliminaries should be briefly introduced. To improve the readability of this article, we present the explanation of professional terms and notation in Table 1 and Table 2, respectively.

2.1. Problem Formulation

In this paper, a network including J inter-connected nodes spread across a geographically dispersed region is considered. Without loss of generality, we model the considered network using an undirected graph

G = (J, E)

, where

J

indicates the set of the nodes and

E

indicates the set of the edges. Each node j in this network performs the global-like computation by exchanging the information with its neighboring nodes

i \in B_{j}

, where

B_{j}

stands for the set of neighboring nodes composed of all the one-hop neighbors and node j itself.

Assuming that the features of the training data are missing completely at random (MCAR), there exist

N_{j}

partially labeled incomplete data

{Ω_{j, n} x_{j, n}, y_{j, n}}_{n = 1}^{N_{j}}

at each individual node j, just as shown in Figure 1. Here,

Ω_{j, n} x_{j, n} \in X^{D}

represents the observed input vector and

y_{j, n}

denotes the collected candidate label vector. To represent these kinds of data, we use

x_{j, n}

to denote a D-dimensional originally input vector with complete features and introduce

Ω_{j, n}

to stand for a D-dimensional diagonal matrix, where its d-th diagonal element

Ω_{j, n, d} = 1

if the corresponding feature is accessible; otherwise,

Ω_{j, n, d} = 0

. So, in the observed feature vector

Ω_{j, n} x_{j, n}

, all accessible features are preserved, while the absent features are assigned to be zeros. The candidate label vector

y_{j, n}

is a C-dimensional vector, which is composed of C specific classes, with the c-th entry being 1 if the c-th label is a candidate label and 0 otherwise.

Assuming that the discriminant function is non-linear, we can express the value of the output of the discriminant function with respect to the c-th class as

f_{c} (x_{n}) = \sum_{h} α_{c, h} k (x_{h}, x_{n}),

(1)

where

k (x_{h}, x_{n}) = < ϕ (x_{h}), ϕ (x_{n}) >

denotes the kernel function, with

ϕ (\cdot)

being the infinite-dimensional kernel map in the reproducing kernel Hilbert space.

α_{c, h}

denotes the weighted coefficient with respect to the c-th class. Based on the kernel feature map

ϕ (\cdot)

, we can rewrite (1) as

f_{c} (x_{n}) = w_{c}^{T} \cdot ϕ (x_{n}),

(2)

where

w_{c}

denotes the weight parameter with respect to the c-th class, which can be calculated by

w_{c} = \sum_{h} α_{c, h} ϕ (x_{h})

.

The objective of the proposed algorithm is to seek the optimal weight parameter so that the high-precision classifier can be obtained.

2.2. Random Feature Map

Since the kernel feature map

ϕ (\cdot)

cannot be explicitly expressed, the weight vector

w_{c}

composed of a linear combination of the kernel feature maps cannot be adaptively updated and freely exchanged among neighboring nodes.

To address this issue, we utilize a limited-dimensional random feature map

\hat{ϕ} (\cdot)

to substitute the original kernel feature map

ϕ (\cdot)

for model parameter construction. But, calculating the values of some common kernel functions, including the Gaussian kernel and Laplacian kernel, requires the components of the feature vectors to be complete. If we apply the random feature map of a Gaussian kernel or Laplacian kernel to the incomplete data, then the approximation error of missing feature imputation may increase and the properties inside the approximated feature space may be deteriorated. Taking this into account, the

χ^{2}

kernel, whose features are independent of each other, is employed in this paper. Specifically, the kernel function of the

χ^{2}

kernel is calculated by [34]

k_{χ^{2}} (x_{h}, x_{n}) = \sum_{d = 1}^{D} \frac{2 x_{h, d} x_{n, d}}{x_{h, d} + x_{n, d}},

(3)

Correspondingly, the random feature map with respect to the

χ^{2}

kernel function can be constructed by

\hat{ϕ} (x_{n}) = {[\hat{ϕ} {(x_{n, 1})}^{T}, \dots, \hat{ϕ} {(x_{n, D})}^{T}]}^{T},

(4)

with each element

{[\hat{ϕ} (x_{n, d})]}_{k} = \{\begin{matrix} \sqrt{x_{n, d} H {\hat{κ}}_{0}}, & k = 0, \\ \sqrt{2 x_{n, d} H {\hat{κ}}_{\frac{k + 1}{2}}} cos (\frac{k + 1}{2} H log x_{n, d}), & k > 0, odd, \\ \sqrt{2 x_{n, d} H {\hat{κ}}_{\frac{k}{2}}} sin (\frac{k}{2} H log x_{n, d}), & k > 0, even, \end{matrix}

(5)

where

{\hat{κ}}_{k}, k = 1, \dots, U

denotes the discrete spectrum, which is sampled from the corresponding continuous function

{\hat{κ}}_{ω}

. H stands for a parameter whose value depends on particular facts. Readers can refer to [34] for a detailed discussion.

3. dPMDC Algorithm

This section presents the technical details of the dPMDC algorithm. In general, our proposed algorithm consists of two main parts. For the first part, we present an ITL-based missing feature imputation approach that can fill in missing features by fully using the weakly supervised information of ambiguous labels. In the second step, based on imputed data features, a multi-class classifier can be distributively induced at each individual node, which can be used to update the labeling confidences. We perform these two steps alternately until convergence.

3.1. Missing Feature Imputation

In this subsection, a fully decentralized ITL-based missing feature imputation method is described based on the GMM model. At first, the parameters of the Gaussian mixture components can be adaptively estimated via fully decentralized expectation-maximization (EM) procedures. Then, the missing features of each training datum can be imputed based on the estimated parameters of the GMM and the partially observed features.

To be specific, referring to the theory of the GMM model [20], the distribution of each training datum with full features may be described as a mixture of K Gaussian components, as indicated by

p (x_{n}) = \sum_{k = 1}^{K} π_{k} p (x_{n} | μ_{k}, Σ_{k}),

(6)

where

π_{k}

denotes the mixing parameter. The probability density function (pdf)

p (x_{n} | μ_{k}, Σ_{k})

characterizes the distribution of training data

x_{n}

with respect to the k-th component in the GMM, which can be expressed by

p (x_{n} | μ_{k}, Σ_{k}) = \frac{exp (- \frac{1}{2} {(x_{n} - μ_{k})}^{T} Σ_{k}^{- 1} (x_{n} - μ_{k}))}{{(2 π)}^{\frac{D}{2}} {| Σ_{k} |}^{\frac{1}{2}}},

(7)

with

μ_{k}

denoting the mean vector of the k-th Gaussian component and

Σ_{k}

representing the covariance matrix of the k-th Gaussian component, respectively.

In this approach, weakly supervised information from ambiguous labels is incorporated into the GMM model and then the data distribution (6) can be reformulated as [35]

p (x_{n}) = \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{c k} π_{k} p (x_{n} | μ_{k}, Σ_{k}),

(8)

where

r_{c k}

denotes the probability that the k-th Gaussian component belongs to the c-th class.

To fill in the missing features, we formulate the decentralized framework of missing feature imputation, which is composed of the log-likelihood function

Q (θ)

and a series of consensus-based constraints, i.e.,

\begin{matrix} Q (θ) = \sum_{j = 1}^{J} Q_{j} (θ_{j}) = \sum_{j = 1}^{J} \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k} π_{j, k} p (x_{j, n} | μ_{j, k}, Σ_{j, k}), \\ s . t . μ_{j, k} = μ_{i, k}, Σ_{j, k} = Σ_{i, k}, r_{j, c k} = r_{i, c k}, π_{j, k} = π_{i, k}, j \in J, i \in B_{j}, \end{matrix}

(9)

where

g_{j, n, c, t}

stands for the confidence of the c-th candidate label with respect to

x_{j, n}

at iteration t and

N_{j}^{c}

denotes the amount of training data belonging to the c-th class (their candidate label sets contain the c-th class) at node j. For simplicity, we use

θ_{j}

to represent the parameter set of the GMM, where

θ_{j} = {μ_{j, k}, Σ_{j, k}, π_{j, k}, r_{j, c k} | k = 1, . . ., K} .

Referring to the general GMM [36], we would like to utilize the EM method to obtain the optimal solution of (9). But, since partial components in the data vector

x_{j, n}

are absent, the EM procedures cannot be directly executed. To address this problem, we use the marginal probability distribution with respect to the observable features to substitute the original one

p (x_{j, n} | μ_{j, k}, Σ_{j, k})

for parameter estimation. To explicitly express this marginal probability distribution, we divide the whole training data sample as

x_{j, n} = {[x_{j, n}^{o, T}, x_{j, n}^{m, T}]}^{T}

, where

x_{j, n}^{o}

and

x_{j, n}^{m}

respectively denote the observed and missing features of

x_{j, n}

. Correspondingly, we have the new forms of

μ_{j, k}

and

Σ_{j, k}

as follows:

μ_{j, k} = {[μ_{j, k}^{o, T}, μ_{j, k}^{m, T}]}^{T}, Σ_{j, k} = [\begin{matrix} Σ_{j, k}^{o o} & Σ_{j, k}^{o m} \\ Σ_{j, k}^{m o} & Σ_{j, k}^{m m} \end{matrix}] .

(10)

By employing the above marginal probability distribution, we reformulate the loss function (9) as follows:

\begin{matrix} Q^{'} (θ) = \sum_{j = 1}^{J} \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k} π_{j, k} p (x_{j, n}^{o} | μ_{j, k}^{o}, Σ_{j, k}^{o o}), \\ s . t . μ_{j, k} = μ_{i, k}, Σ_{j, k} = Σ_{i, k}, r_{j, c k} = r_{i, c k}, π_{j, k} = π_{i, k}, j \in J, i \in B_{j} . \end{matrix}

(11)

Update of mean vector

μ_{j, k, t + 1}

: We can update the mean vector

μ_{j, k}

by solving the following optimization problem:

\begin{matrix} μ_{j, k, t + 1} & = arg max Q^{'} (θ | θ_{t}) \\ = arg max \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k, t} π_{j, k, t} p (x_{j, n}^{o} | μ_{j, k}^{o}, Σ_{j, k, t}^{o o}) \\ = arg max \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} \frac{r_{j, c k, t} π_{j, k, t}}{{(2 π)}^{\frac{D}{2}} {| Σ_{j, k, t}^{o o} |}^{\frac{1}{2}}} \\ \cdot exp (- \frac{1}{2} {(x_{j, n}^{o} - μ_{j, k}^{o})}^{T} Σ_{j, k, t}^{o o, - 1} (x_{j, n}^{o} - μ_{j, k}^{o})), \\ s . t . μ_{j, k} = μ_{i, k}, j \in J, i \in B_{j} . \end{matrix}

(12)

The aforementioned objective function (12) can be equivalently expressed in a full vector form, i.e.,

\begin{matrix} μ_{j, k, t + 1} & = arg max \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} \frac{r_{j, c k, t} π_{j, k, t} {| Σ_{j, k, t}^{o o} |}^{- \frac{1}{2}}}{{(2 π)}^{\frac{D}{2}}} \\ \cdot exp (- \frac{1}{2} {(x_{j, n} - μ_{j, k})}^{T} Ω_{j, n} (Ξ_{j, k, t} - Λ_{j, k, t}) Ω_{j, n} (x_{j, n} - μ_{j, k})), \\ s . t . μ_{j, k} = μ_{i, k}, j \in J, i \in B_{j}, \end{matrix}

(13)

where

Ξ_{j, k} \in R^{D \times D}

is the inverse matrix of

Σ_{j, k}

, which is composed of four parts, i.e.,

Ξ_{j, k} = [\begin{matrix} Ξ_{j, k}^{o o} & Ξ_{j, k}^{o m} \\ Ξ_{j, k}^{m o} & Ξ_{j, k}^{m m} \end{matrix}]

. Considering that the value of

Ξ_{j, k}^{o o}

is not equivalent to the value of

Σ_{j, k}^{o o, - 1}

, the matrix

Λ_{j, k}

is introduced to solve this problem, which is given by

Λ_{j, k} = [\begin{matrix} Σ_{j, k}^{o o, - 1} Σ_{j, k}^{o m} Ξ_{j, k}^{m m} Σ_{j, k}^{m o} Σ_{j, k}^{o o, - 1} & 0_{o m} \\ 0_{m o} & 0_{m m} \end{matrix}]

.

To obtain a global consensus estimation of the mean vector, referring to the diffusion cooperative strategy in [10], the update process can be established by two steps, adaptation and cooperation. In the adaptation step, the mean vector can be locally estimated depending on the local data. In the cooperation step, we can update the mean vector by combining the instantaneous estimations exchanged from neighboring nodes with the cooperative coefficient.

To be specific, taking the partial derivative of (13) and setting the results to 0, we have

\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} Ω_{j, n} (Ξ_{j, k} - Λ_{j, k}) Ω_{j, n} (x_{j, n} - μ_{j, k}) = 0_{D \times 1} .

(14)

It is noted that, for simplicity,

γ_{j, k, n, t}

is introduced to denote the posterior probability belonging to the k-th component using the current estimations of parameters

{μ_{j, k, t}^{o}, Σ_{j, k, t}^{o o}}

, which can be calculated by

γ_{j, k, n, t} = \frac{\sum_{c = 1}^{C} r_{j, c k, t} π_{j, k, t} p (x_{j, n}^{o} | μ_{j, k, t}^{o}, Σ_{j, k, t}^{o o})}{\sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k, t} π_{j, k, t} p (x_{j, n}^{o} | μ_{j, k, t}^{o}, Σ_{j, k, t}^{o o})} .

(15)

Based on (14), each component of

μ_{j, k}

can be locally updated using the local observed features and the posterior probability

γ_{j, k, n, t}

, i.e.,

μ_{j, k, t + 1}^{'} (d) = \frac{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} x_{j, n} (d)}{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} Ω_{j, n} (d)}, d = 1, . . ., D .

(16)

Then, by combining the estimation exchanged from neighbors, we have

μ_{j, k, t + 1} = \sum_{i \in B_{j}} v_{j i} μ_{i, k, t + 1}^{'},

(17)

where

v_{j i}

represents the cooperative coefficient, which can be designed according to the Metropolis rule [10].

Update of covariance matrix

Σ_{j, k}

: We can derive the update equation of the covariance matrix

Σ_{j, k}

by addressing the subsequent optimization problem:

\begin{matrix} Σ_{j, k, t + 1} & = arg max Q^{'} (θ | θ_{t}) \\ = arg max \sum_{j = 1}^{J} \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k, t} π_{j, k, t} p (x_{j, n}^{o} | μ_{j, k, t}^{o}, Σ_{j, k}^{o o}) \\ = arg max \sum_{j = 1}^{J} \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} log \sum_{c = 1}^{C} \sum_{k = 1}^{K} r_{j, c k, t} π_{j, k, t} \frac{| Σ_{j, k}^{o o} |^{- \frac{1}{2}}}{{(2 π)}^{\frac{D}{2}}} \\ \cdot exp (- \frac{1}{2} {(x_{j, n} - μ_{j, k, t})}^{T} Ω_{j, n} (Ξ_{j, k} - Λ_{j, k, t}) Ω_{j, n} (x_{j, n} - μ_{j, k, t})), \\ s . t . Σ_{j, k} = Σ_{i, k}, j \in J, i \in B_{j} . \end{matrix}

(18)

Taking the partial derivative of (18) with respect to

Σ_{j, k}

and letting its results equal zero, we have

\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} Ω_{j, n}^{'} \circ (Λ_{j, k, t}^{'} - Σ_{j, k}^{- 1} Ω_{j, n} (x_{j, n} - μ_{j, k, t}) {(x_{j, n} - μ_{j, k, t})}^{T} Ω_{j, n} Σ_{j, k}^{- 1}) = 0_{D \times D},

(19)

where ∘ denotes the Hadamard product and

1_{D \times D}

is a

D \times D

dimensional square matrix, with all the entries being 1. The auxiliary variable matrices

Ω_{j, n}^{'} = Ω_{j, n} 1_{D \times D} Ω_{j, n}

and

Λ_{j, k}^{'} = [\begin{matrix} Λ_{j, k}^{' o o} & Λ_{j, k}^{' o m} \\ Λ_{j, k}^{' m o} & Λ_{j, k}^{' m m} \end{matrix}]

with

\begin{matrix} Λ_{j, k}^{' o o} = & Σ_{j, k}^{o o, - 1} - Σ_{j, k}^{o o, - 1} Σ_{j, k}^{o m} Ξ_{j, k}^{m o} (x_{j, n}^{o} - μ_{j, k}^{o}) {(x_{j, n}^{o} - μ_{j, k}^{o})}^{T} Σ_{j, k}^{o o, - 1} - Σ_{j, k}^{o o, - 1} (x_{j, n}^{o} - μ_{j, k}^{o}) {(x_{j, n}^{o} - μ_{j, k}^{o})}^{T} \\ \cdot Ξ_{j, k}^{o m} Σ_{j, k}^{m o} Σ_{j, k}^{o o, - 1} - Σ_{j, k}^{o o, - 1} Σ_{j, k}^{o m} Ξ_{j, k}^{m o} (x_{j, n}^{o} - μ_{j, k}^{o}) {(x_{j, n}^{o} - μ_{j, k}^{o})}^{T} Ξ_{j, k}^{o m} Σ_{j, k}^{m o} Σ_{j, k}^{o o, - 1} . \end{matrix}

These auxiliary variables

Λ_{j, k}^{' o m}

,

Λ_{j, k}^{' m o}

, and

Λ_{j, k}^{' m m}

are introduced to recover the complete matrix

Σ_{j, k}

. It is noted that regardless of what values are assigned to these auxiliary variables, Equation (19) remains valid since their corresponding Hadamard coefficients in

Ω_{j, n}^{'}

are zeros.

Using

Σ_{j, k}

to pre-multiply and post-multiply both sides of (19), we can obtain

\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} Ω_{j, n}^{'} \circ (Σ_{j, k} Λ_{j, k, t}^{'} Σ_{j, k} - Ω_{j, n} (x_{j, n} - μ_{j, k, t}) {(x_{j, n} - μ_{j, k, t})}^{T} Ω_{j, n}) = 0_{D \times D}

(20)

The update equation of the

(d_{1}, d_{2})

-th element of

Σ_{j, k, t + 1}

can be obtained based on the local data, which are given by

Σ_{j, k, t + 1}^{'} (d_{1}, d_{2}) = \frac{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} S_{j, n, t} (d_{1}, d_{2})}{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t} Ω_{j, n}^{'} (d_{1}, d_{2})},

(21)

where

S_{j, n, t} = Ω_{j, n}^{'} \circ (x_{j, n} - μ_{j, k, t}) {(x_{j, n} - μ_{j, k, t})}^{T} + [\begin{matrix} Σ_{j, k, t}^{o o} Λ_{j, k, t}^{' o o} Σ_{j, k, t}^{o o} - I_{o o} & 0_{o m} \\ 0_{m o} & 0_{m m} \end{matrix}]

.

Similar to the update of the mean vector

μ_{j, k}

, we can update the covariance matrix at node j by combining the instantaneous estimation exchanged from neighboring nodes

i \in B_{j}

,

Σ_{j, k, t + 1} = \sum_{i \in B_{j}} v_{j i} Σ_{i, k, t + 1}^{'}

(22)

Update of mixing parameter

π_{j, k}

: Based on the latest estimation

μ_{j, k, t + 1}

and

Σ_{j, k, t + 1}

, we can update the mixing parameter

π_{j, k}

via adaptation and cooperation steps:

π_{j, k, t + 1}^{'} = \frac{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t + 1}}{\sum_{k = 1}^{K} \sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} γ_{j, k, n, t + 1}},

(23)

π_{j, k, t + 1} = \sum_{i \in B_{j}} v_{j i} π_{i, k, t + 1}^{'} .

(24)

Update of probability

r_{j, c k}

: Similarly, by keeping the other parameters unchanged, we can update the probability

r_{j, c k}

as follows:

r_{j, c k, t + 1}^{'} = \frac{\sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} δ_{j, c, n, k, t + 1}}{\sum_{c = 1}^{C} \sum_{n = 1}^{N_{j}^{c}} g_{j, n, c, t} δ_{j, c, n, k, t + 1}},

(25)

r_{j, c k, t + 1} = \sum_{i \in B_{j}} v_{j i} r_{i, c k, t + 1}^{'},

(26)

where the probability

δ_{j, c, n, k, t + 1} = \frac{π_{j, k, t} r_{j, c k, t} p (x_{j, n}^{o} | μ_{j, k, t + 1}^{o}, Σ_{j, k, t + 1}^{o o})}{\sum_{c = 1}^{C} \sum_{k = 1}^{K} π_{j, k, t} r_{j, c k, t} p (x_{j, n}^{o} | μ_{j, k, t + 1}^{o}, Σ_{j, k, t + 1}^{o o})}

denotes the posterior probability of the k-th component of the n-th training data belonging to the c-th class.

Then, based on the observed features

x_{j, n}^{o}

and the parameters of GMM, we compute the conditional mean of the missing features

{\hat{x}}_{j, n}^{m | o}

to impute the missing feature

x_{j, n}^{m}

[37], which is given by

\begin{matrix} {\hat{x}}_{j, n, t + 1}^{m | o} & = \sum_{k = 1}^{K} γ_{j, k, n, t + 1} μ_{j, k, t + 1}^{m | o} \\ = \sum_{k = 1}^{K} γ_{j, k, n, t + 1} \cdot [μ_{j, k, t + 1}^{m} + Σ_{j, k, t + 1}^{m o} Σ_{j, k, t + 1}^{o o, - 1} (x_{j, n}^{o} - μ_{j, k, t + 1}^{o})], \end{matrix}

(27)

where

μ_{j, k, t + 1}^{m | o}

denotes the corresponding conditional expectation.

3.2. Classifier Induction

In this subsection, we describe the induction of the classifier using the imputed data and the learned Gaussian component centers.

To be specific, the logistic loss of all the imputed data is designed to leverage the supervised information from the ambiguous labels. Subsequently, the objective function can be expressed as follows:

\begin{matrix} min_{w_{c}} & - \sum_{j = 1}^{J} \sum_{n = 1}^{N_{j}} λ_{A} log \sum_{c = 1}^{C} y_{j, n, c} g_{j, n, c, t} p (c | {\hat{x}}_{j, n, t + 1}), \\ s . t . & w_{j, c} = w_{i, c}, j \in J, i \in B_{j}, \end{matrix}

(28)

where

λ_{A}

stands for the weighted parameter with respect to the logistical loss of training data. The conditional probability

p (c | {\hat{x}}_{j, n, t + 1})

can be characterized by the kernel logistic regression functions modeled by the random feature map, which is given by

p (c | {\hat{x}}_{j, n, t + 1}) \propto exp (w_{c}^{T} \cdot \hat{ϕ} ({\hat{x}}_{j, n, t + 1})) .

(29)

Then, the mutual information between the Gaussian component centers and their corresponding class label is used to construct a regularization term, which helps explore the hidden structure of data and group data items into the corresponding classes [38], as shown below:

\begin{matrix} I_{W} (a; K) = \sum_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{c = 1}^{C} \frac{1}{J K} p (c | a_{j, k, t + 1}) log \frac{p (c | a_{j, k, t + 1})}{p (c)}, \end{matrix}

(30)

where the empirical distribution of class labels can be estimated based on K learned Gaussian components, i.e.,

p (c) = p_{j} (c) = \frac{1}{K} \sum_{k = 1}^{K} p (c | a_{j, k, t + 1})

for

j \in J

. Here,

a_{j, k, t + 1}

is the estimated center of the k-th learned Gaussian component, which can be obtained by the mean vector of imputed training data belonging to the k-th Gaussian component.

Incorporating the mutual information regularization term into the objective function, we have

\begin{matrix} min_{W} F = \sum_{j = 1}^{J} F_{j} & = - \sum_{j = 1}^{J} \sum_{n = 1}^{N_{j}} λ_{A} log \sum_{c = 1}^{C} y_{j, n, c} g_{j, n, c, t} p (c | {\hat{x}}_{j, n, t + 1}) \\ - \sum_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{c = 1}^{C} \frac{λ_{B}}{J K} p (c | a_{j, k, t + 1}) log \frac{p (c | a_{j, k, t + 1})}{p (c)}, \\ s . t . & w_{j, c} = w_{i, c}, j \in J, i \in B_{j}, \end{matrix}

(31)

where

λ_{B}

denotes the weighted parameter with respect to the mutual information regularization term.

Update of model parameter

w_{j, c}

: Considering that the objective function is too complicated to obtain the closed-form solution, we optimize the model parameter

{w_{j, c}}_{c = 1}^{C}

using the steepest gradient descent (SGD) method and diffusion cooperative approach.

To be specific, at the iteration

t > 0

, we have the update equation of the model parameter

{w_{j, c}}_{c}

:

\begin{matrix} w_{j, c, t + 1}^{'} = w_{j, c, t} - ζ_{t + 1} ▿_{w_{j, c}} F_{j, t}, \end{matrix}

(32a)

\begin{matrix} w_{j, c, t + 1} = \sum_{i \in B_{j}} v_{j i} w_{j, c, t + 1}^{'}, \end{matrix}

(32b)

where

ζ_{t + 1}

denotes the time-varying learning rate. The gradient

▿_{w_{j, c}} F_{j}

\begin{matrix} ▿_{w_{j, c}} F_{j} = - \sum_{n = 1}^{N_{j}} λ_{A} (\frac{y_{j, n, c} g_{j, n, c, t} p_{j, n, c} (1 - p_{j, n, c}) - \sum_{c \neq h} y_{j, n, h} g_{j, n, h} p_{j, n, h} p_{j, n, c}}{\sum_{c = 1}^{C} y_{j, n, c} g_{j, n, c, t} p_{j, n, c}}) \cdot \hat{ϕ} ({\hat{x}}_{j, n, t + 1}) \\ - \sum_{k = 1}^{K} \frac{λ_{B}}{K J} p_{j, k, c}^{a} (log \frac{p_{j, k, c}^{a}}{p_{c}} - \sum_{h = 1}^{C} p_{j, k, h}^{a} log \frac{p_{j, k, h}^{a}}{p_{h}}) \cdot \hat{ϕ} (a_{j, k, t + 1}) . \end{matrix}

For clarity, we use

p_{j, n, c}

,

p_{j, k, c}^{a}

, and

p_{c}

to denote the abbreviations of

p (c | {\hat{x}}_{j, n})

,

p (c | a_{j, k})

, and

p (c)

, respectively.

Update of labeling confidence

g_{j, n, c}

: In this approach, labeling confidence

g_{j, n, c}

plays two important roles. First of all, it may be applied directly to the imputation of missing features. Secondly, it also serves as a guide for model induction. Obviously, the value of the labeling confidence will heavily affect the imputed features as well as the induced classifier.

The detailed update processes of labeling confidence

g_{j, n, c}

are presented as follows.

At the initial state

t = 0

, the value of labeling confidence of candidate labels can be set as

g_{j, n, c, 0} = \{\begin{matrix} \frac{1}{\sum_{c = 1}^{C} y_{j, n, c}} & if y_{j, n, c} = 1, \\ 0 & if y_{j, n, c} = 0 . \end{matrix}

(33)

At the following iterations

t > 0

, the labeling confidence can be updated based on the conditional probability

p_{j, n, c, t}

, which is given by

g_{j, n, c, t + 1} = \{\begin{matrix} \frac{h_{j, n, c, t + 1}}{\sum_{l = 1}^{C} h_{j, n, l, t + 1}} & if y_{j, n, c} = 1, \\ 0 & if y_{j, n, c} = 0 . \end{matrix}

(34)

In this article, to make the labeling confidence more suitable for guiding data imputation and classifier induction, we utilize a novel map

h (\cdot)

to adjust its value as follows:

h_{j, n, c, t + 1} = \frac{1 / (1 + exp (- ν (p_{j, n, c, t + 1} - 0.5))) - 1 / (1 + exp (0.5 ν))}{1 / (1 + exp (- 0.5 ν)) - 1 / (1 + exp (0.5 ν))},

with

ν

denoting a time-varying scaling factor. Initially, the performance of the induced classifier may be coarse due to the impact of noisy labels. At this point, a small value of

ν

is employed to assign low confidence to the induced classifier. Through a sufficient number of iterations, the impact of noisy labels diminishes, yielding more reliable classification performance. At this point, a large value of

ν

is utilized to give a high confidence to the classification result of the induced model. Considering this, we set the parameter

ν = log (t + 1)

.

By alternatively executing the steps of missing feature imputation and model induction, we can obtain the optimal classifier. For clarity, the pseudo-code of the proposed dPMDC algorithm is summarized in Algorithm 1.

For illustration, the diagram of the main steps of the proposed dPMDC algorithm is presented in Figure 2.

Algorithm 1 dPMDC algorithm

Require:: Input partially labeled incomplete data ${Ω_{j, n} x_{j, n}, y_{j, n}}_{n = 1}^{N_{j}}$ , and initialize $w_{j, c, 0} = 0_{D U}$ for each node j.
1:: for $t = 0, \dots$ do
2:: for $j \in J$ do
3:: Calculate $μ_{j, k, t + 1}^{'}$ and $Σ_{j, k, t + 1}^{'}$ via (16), (21), and exchange them with the neighbors $B_{j}$ .
4:: end for
5:: for $j \in J$ do
6:: Calculate $μ_{j, k, t + 1}$ , $Σ_{j, k, t + 1}$ via (17) and (22).
7:: end for
8:: for $j \in J$ do
9:: Calculate $π_{j, k, t + 1}^{'}$ and $r_{j, c k, t + 1}^{'}$ via (23) and (25), and exchange them with the neighbors $B_{j}$ .
10:: end for
11:: for $j \in J$ do
12:: Calculate $π_{j, k, t + 1}$ and $r_{j, c k, t + 1}$ via (24) and (26).
13:: end for
14:: for $j \in J$ do
15:: Impute the missing features of the training data $x_{j, n, t + 1}^{m}$ via (27).
16:: Obtain the random feature map $\hat{ϕ} ({\hat{x}}_{j, n, t + 1})$ via (4).
17:: end for
18:: for $j \in J$ do
19:: Calculate $w_{j, c, t + 1}^{'}$ via (32a) and exchange it with neighbors $i \in B_{j}$ .
20:: end for
21:: for $j \in J$ do
22:: Calculate $w_{j, c, t + 1}$ via (32b).
23:: end for
24:: for $j \in J$ do
25:: Calculate the new confidence $g_{j, n, c, t + 1}$ via (34).
26:: end for
27:: end for
28:: Output: Optimal multi-classifier $f_{c}^{*}$ .

3.3. Performance Analysis

In this subsection, we conduct a theoretical analysis of the convergence, the computational complexity, and the communication cost of the proposed algorithm.

We first present some common assumptions before conducting the convergence analysis.

Assumption 1.

For a connected network

G

, the cooperative matrix V with its element

V_{j i} = v_{j i}

satisfies the following conditions:

V 1_{J} = 1_{J}

and

1_{J}^{T} V = 1_{J}^{T}

. Additionally, the spectrum norm of the matrix

V - (1 / J) 1_{J} 1_{J}^{T}

is no larger than 1.

Lemma 1.

For a connected, distributed network, if Assumption 1 holds, then all the local estimates of model parameters at each individual node j, including

μ_{j, k, t}

,

Σ_{j, k, t}

,

π_{j, k, t}

,

r_{j, c k, t}

, and

W_{j, t}

, converge to their average values through a sufficient number of iterations, i.e.,

{lim}_{t \to \infty} | | μ_{j, k, t} - {\bar{μ}}_{k, t} {| |}_{2}^{2} = 0

,

{lim}_{t \to \infty} | | Σ_{j, k, t} - {\bar{Σ}}_{k, t} {| |}_{F}^{2} = 0

,

{lim}_{t \to \infty} {| π_{j, k, t} - {\bar{π}}_{k, t} |}^{2} = 0

,

{lim}_{t \to \infty} {| r_{j, c k, t} - {\bar{r}}_{c k, t} |}^{2} = 0

, and

{lim}_{t \to \infty} | | w_{j, c, t} - {\bar{w}}_{c, t} {| |}_{F}^{2} = 0

. It should be noted that

{\bar{μ}}_{k, t}

,

{\bar{Σ}}_{k, t}

,

{\bar{π}}_{k, t}

,

{\bar{r}}_{c k, t}

, and

{\bar{w}}_{c, t}

denote the average values of

{μ_{j, k, t}}_{j = 1}^{J}

,

{Σ_{j, k, t}}_{j = 1}^{J}

,

{π_{j, k, t}}_{j = 1}^{J}

,

{r_{j, c k, t}}_{j = 1}^{J}

, and

{w_{j, c, t}}_{j = 1}^{J}

, respectively.

Proof.

See [39]. □

Theorem 1.

For a connected network

G

, if Assumption 1 and Lemma 1 hold, the parameters of the GMM

θ_{j, t} = {μ_{j, k, t}, Σ_{j, k, t}, π_{j, k, t}, r_{j, c k, t}}_{k}

can converge to the optimal value

θ^{*}

as t tends to infinity.

Proof.

See Appendix A of [40]. □

Theorem 2.

For a connected network

G

, if Assumption 1 and Lemma 1 hold, the model parameter of the multi-class classifier

W_{j, t}

can converge to its optimal value

W^{*}

when

t \to \infty

.

Proof.

See Appendix A. □

Based on Theorems 1 and 2, we know that all the local estimates of variables at each individual node can converge to the optimal values through a sufficient number of iterations, provided that the distributed network keeps connected, which verifies the effectiveness of the dPMDC method in theory.

The computational complexity of the algorithm is measured by the number of addition operations (AOs) and multiplication operations (MOs) at each node at each iteration. Table 3 summarizes the amount of the addition and multiplication operations of the proposed algorithm. To clearly illustrate the calculation methods of AO and MO, we cite two relevant examples:

Example 1.

Taking the addition of two

N * N

matrices as an example, the whole process requires 0 multiplication operations and

N^{2}

addition operations.

Example 2.

Taking the multiplication of two

N * N

matrices as an example, the whole process requires

N^{3}

multiplication operations and

N^{2} (N - 1)

addition operations.

By observing Table 3, we can see that the computational complexity of the algorithm is related to multiple factors, including the proportion of missing features

P_{m}

at MCAR assumption, the dimension of non-linear mapping

D U

, the number of Gaussian components K, and the network topology, in addition to the characteristics of the data themselves. Therefore, when the number of neighbor nodes in the network is moderate, the computational complexity of the algorithm is acceptable as long as the value of K and

D U

is controlled within a reasonable range.

In addition, we also analyzed the communication complexity of the algorithm. At each iteration t, each node j needed to exchange

K (D + D^{2} + 1 + C) + C D U

scalars to its neighboring nodes. So, the communication cost of the proposed dPMDC was deemed to be acceptable.

4. Experiment

In this section, to validate the efficacy of the proposed approach, a series of experiments on several artificial and real PLL datasets are described, including the Double Moon [9], mHealth [41], Gas Drift [41], Pendigits [41], Segmentation [41], Ecoli [41], Vertebral [41], Lost [42], Birdsong [43], and MSRCv2 [44] datasets.

The profiles of these utilized datasets are shown in Table 4. It is noted that seven artificial PLL datasets (Double Moon, mHealth, Gas Drift, Pendigits, Segmentation, Vertebral, and Ecoli datasets) were generated by adding a series of noisy labels into the set of candidate labels under the configuration of two controlled parameters, s and

ϵ

[9]. In this context, s represents the number of noisy labels inside the candidate label set and

ϵ

represents the co-occurrence probability between a coupling noisy label and the correct label. That is, for each partially labeled data sample, a randomly selected coupling noisy label and the correct label occurred in a pair with probability

ϵ

. For three real-world PLL datasets (Lost, Birdsong, and MSRCv2 datasets), the original labels of the training data were ambiguous and, thus, no extra noisy labels were added into the set of candidate labels.

To investigate the impact of the different proportions of missing features on classification performance under the MCAR assumption, a metric

P_{m}

, namely, the percentage of missing features relative to the total number of features in the training dataset, was defined.

For each experiment, a total of 50 Monte Carlo cross-validation simulations were conducted, and the average results from these simulations are reported herein. Furthermore, all datasets utilized in each Monte Carlo simulation were arbitrarily partitioned into 10 folds; the training phase utilized 8 folds, while the testing phase employed the remaining 2 folds. To simulate the performance of a distributed network, here, an interconnected network consisting of 10 nodes and 23 edges was randomly generated. All training data were completely randomly partitioned into J parts with equal size and allocated to these nodes. To conduct the following experiments, the data instances needed to be preprocessed at the initial state, i.e., the values of attributes were normalized into [0, 1].

At each trial of simulation, for the proposed dPMDC, the parameters were set as

λ_{A} = 1

,

λ_{B} = 0.1

, and

K = 30

,

U = 4

; the step size was set as

ζ_{t + 1} = 0.25 / t^{0.5}

; and the weight parameters W are set as

w_{j, c, 0} = 0_{D U}

initially. The mean vector could be randomly initialized to a vector uniformly distributed in the range (0, 0.5) and the covariance matrix could be initialized to a D-dimensional identity matrix

I_{D}

. Furthermore, in the initial state, we had the mixing parameter

π_{j, k, 0} = \frac{1}{K}

and probability

r_{j, c k, 0} = \frac{1}{C}

.

Taking “mHealth” and “Gas Drift” as representatives of all the considered datasets, we depict the learning curves of the imputation error and classification accuracy of the proposed dPMDC algorithm in Figure 3 and Figure 4. In order to test the robustness of the algorithm under different network sizes, we also compare the changing curves of imputation error and classification accuracy of the proposed dPMDC on multiple distributed networks. It should be noted that in this experiment, the controlled parameters of the distributed network topology were characterized by the number of nodes J and the number of edges

| E |

.

By observing the simulation results presented in Figure 3 and Figure 4, we can notice that the learning curves of imputation error converged significantly faster than those of classification accuracy. The values of imputation error rapidly decreased at the initial 15 iterations and converged to the stable state after about 15 iterations. The learning curve of classification accuracy was relatively smooth. During the first 50 iterations, the values of classification accuracy steadily increased. After about 70 iterations, it gradually converged to the optimal value. We also can observe that the learning curves of the proposed algorithm using different network topologies, either the imputation error or the classification accuracy, were very close to each other. The simulation results show that the size of the network did not significantly affect the learning performance of our proposed algorithm.

Furthermore, we also compare the CPU times of the proposed algorithm at each individual node under different networks on the “mHealth” and “Gas Drift” datasets in Figure 5. To ensure fairness in this experiment, we set the amount of training data at each individual node to be the same. From Figure 5, we can see that the CPU times of the proposed algorithm at each individual node remained nearly unchanged. Such a result indicates that different network topologies could not affect the computational efficiency of the algorithm, as long as the data size of a single node remained unchanged.

Additionally, to simulate the robustness of the proposed algorithm against the initial setting of the model parameters, we compared the learning performance of the proposed algorithm with different parameter settings. To be specific, two different cases were taken into consideration.

Case 1: We maintained the original settings. That is, the mean vector of each Gaussian component could be randomly initialized to a vector uniformly distributed in the range (0,0.5), the covariance matrix of each Gaussian component could be initialized to a D-dimensional identity matrix

I_{D}

, and the weight parameter could be initialized to

w_{j, c, 0} = 0_{D U}

.

Case 2: We reset the initialized settings. That is, the mean vector of each Gaussian component was randomly initialized to the training data with complete attributes of the local node and the covariance matrix of each Gaussian component was initialized as

0.5 I_{D}

. We set the initialized state of the weight parameter as

w_{j, c, 0} = 0.01 \times 1_{D U}

.

We depict the learning curves of the imputation error and the classification accuracy of the proposed algorithm with different initial settings in Figure 6 and Figure 7. It should be noted that in order to distinguish them from each other, we name case 1 dPMDC with originally initial setting and case 2 as dPMDC with newly initial setting. From the simulation results in Figure 6 and Figure 7, we can observe that the changing curves of dPMDC with originally initial setting and dPMDC with newly initial setting were almost overlapping, indicating that our proposed algorithm was insensitive to the initial setting of the model parameters.

Moreover, we investigated the impact of different values of parameters

λ_{A}

and

λ_{B}

and the discretization level of random feature map U on the classification performance of our proposed algorithm using the “mHealth” and “Gas Drift” datasets. In this experiment, we investigated the performance changing of the proposed algorithm by varying the value of one parameter while keeping the other parameters unchanged. We can see that the changing trends of the parameters

λ_{A}

and

λ_{B}

were similar. The simulation results presented in Figure 8 indicate that as long as the values of parameters

λ_{A}

and

λ_{B}

were set within [0.1, 1] and [0.01, 0.1], good learning performance could be obtained. We can see that the classification accuracy of the proposed method gradually improved as the discretization level of random feature map U increased. The possible reasons were analyzed and are presented as follows. When the value of U increased, the random feature map could give a more precise approximation for the kernel feature map, which boosted the classification performance of the induced classifier to some extent. When the value of U exceeded 4, the performance improvement resulting from the increment of U diminished progressively. Since larger values of U led to higher computational complexity and communication cost, an appropriate choice for U could be set as 4 in order to achieve a balance between classification performance and computational complexity.

Considering that the value of K had a significant effect on both imputation error and classification accuracy, we investigated the changing trends of imputation error and classification accuracy versus K, shown in Figure 9. The simulation results indicate that the learning performance of the proposed dPMDC gradually improved as the number of Gaussian mixture components K escalated. When the value of K was greater than 30, the extent of learning performance improvement rapidly decreased. These changing trends were similar to those of U. Therefore, we set the number of Gaussian mixture components K to 30 to strike a balance between computational complexity and learning performance.

Furthermore, we investigated the learning performance of the proposed dPMDC algorithm for different distribution data. Given the challenge of characterizing the distribution of existing real datasets, we adopted a common synthetic dataset known as “Double Moon”. Referring to the operations in [9], we randomly generated 20,000 training data and divided the upper and the lower moon into two classes, as shown in Figure 10. Then, a special number of noisy labels were added into candidate labels, such that the values of the controlled parameters were

s = 1

and

ϵ = 0.3

. To simulate the learning performance of the proposed algorithm, we added noise from different distributions to the training data. Specifically, three cases were considered.

Case 1: We added zero-mean Gaussian noise to the training data so that the signal-to-noise ratio was equivalent to 15 dB.

Case 2: We added 0–1 noise with a magnitude of 0.3 and a probability of 0.5 to the training data.

Case 3: We added uniformly distributed noise with a magnitude of 0.3 to the training data.

For clarity, we depict the training data after adding noise in Figure 10.

The learning curves of the imputation error and classification accuracy of the proposed dPMDC for three cases are presented in Figure 11. We can see that the learning curves were quite similar except for the first few iterations and all converged to ideal levels, indicating that the GMM could effectively characterize the training data with different distributions.

To validate the efficacy of the proposed approach in imputing the missing features, we investigated the imputation error of the proposed dPMDC algorithm under the different types of missing data, including MCAR, missing at random (MAR), and missing not at random (MNAR).

MCAR: The missing data features were completely independent of the variable value. In the following experiments, we used

P_{m}

to measure the probability of missing values.

MAR: The probability of missing data features was related to the observed variables and unrelated to the characteristics of the unobserved data. In the following experiments, we randomly selected a pair of strongly correlated features. When one feature was larger than threshold

δ

, the coupling feature was missing with probability

P_{c}

.

MNAR: The missing data features entirely depended on the unobserved variable itself. In the following experiments, we assumed that when the value of an attribute was greater than the threshold

δ

, it was missing with probability

100 %

.

In the following experiments, for all the artificially generated PLL datasets (Double Moon, mHealth, Gas Drift, Pendigits, Ecoli, Segmentation, and Vertebral datasets), a special number of noisy labels were added into candidate labels, such that the value of the controlled parameters the

s = 1

and

ϵ = 0.3

. For three real PLL datasets (Lost, Birdsong, and MRSCv2 datasets), no extra noisy label was added.

For the purpose of comparison, the imputation errors of the other existing imputation methods, including kNN imputation [13], subspace learning (SL) imputation [6], logistic regression (LR) imputation [14], extreme learning machine auto-encoder (ELM-AE) imputation [15], multi-layer auto-encoder (MAE) imputation [17], support vector regression imputation-based support vector machine (SVR-SVM) [22], and missing-data-importance-weighted auto-encoder imputation (MIWAE) [23] under the MACR, MAR, and MNAR assumptions, were also evaluated. All the simulation results are shown in Table 5, Table 6 and Table 7.

From Table 5 and Table 6, it can be observed that the comparison algorithms performed similar imputation performances under the MCAR and MAR assumptions. The following observations can be made: First, with the same degree of

ϵ

, the kNN imputation method performed the worst in most cases, which indicated that simply exploiting the attribute values of neighboring data could not accurately describe the global data distribution. SL, LR, AL-ELM, and SVR-SVM imputation methods induced the imputed model based on global data distribution, and, thus, they performed better than the kNN imputation method. The MAE imputation method is based on an auto-encoder framework and can achieve relatively good imputation accuracy. However, as a supervised algorithm, it requires sufficient complete data with unambiguous labels to train the auto-encoder network to fine-tune the imputation results. When the value of

ϵ

was equivalent to 0.3, the ambiguous labels could have a negative effect on its imputation accuracy. Similarly, MIWAE outperformed most comparison algorithms by inducing the imputation model based on the auto-encoder framework. Even so, its imputation performance was still inferior to our suggested algorithm due to the coupling noise labels. Moreover, our suggested dPMDC performed significantly better than the other existing imputation approaches. Such a result indicates the superiority of the proposed imputation methods in characterizing the missing feature distribution under the MCAR and MAR assumptions.

By observing Table 7, we notice that, unlike the results in Table 5 and Table 6, our proposed algorithm had no significant performance advantage when the MNAR assumption was used. Specifically, the imputation error of the proposed dPMDC algorithm was lower than that of kNN, SVR-SVM, LR, AE-ELM, and SL for all ten datasets. The imputation performance of the proposed dPMDC algorithm was better than that of MAE for six datasets and better than that of MIWAE for five datasets. We analyze the possible reasons in detail below. Under the MNAR assumption, all the features with values larger than the threshold were missing, making it challenging to characterize the distribution of these features. Therefore, it was difficult for our algorithm to achieve such excellent learning results, as in the cases of MCAR and MAR. Nevertheless, owing to the exploitation of weakly supervised information, the performance of our proposed algorithm was better than those of most comparison algorithms and close to that of the MAE and MIWAE methods.

In order to highlight the superiority of the proposed algorithm in imputing the missing features, the Friedman test was implemented to determine whether there was a substantial distinction among the comparison algorithms [45]. Based on the amount of comparison methods

k_{c} = 8

and the amount of used datasets

N_{u} = 10

, we computed the Friedman statistic

F_{s} = 40.81

and its associated critical value 2.16. Obviously, the value of the Friedman statistic was larger than the critical value. This result indicated that the null hypothesis, stating that there was no significant difference among the evaluated comparison algorithms, was rejected at the 0.05 significance level.

Additionally, the Bonferroni–Dunn test was employed to determine whether our suggested method significantly surpassed the other comparative algorithms [45]. Figure 12 illustrates the average rankings of all comparison algorithms depicted by black lines, with the proposed dPMDC serving as the controlled algorithm in this evaluation.

The comparison algorithms whose average ranks fell within one CD compared to the controlled algorithms at a significance level of 0.05 and a critical difference (CD) of performance

C D = 2.576

are connected by red lines. Otherwise, no line links them. Figure 12 illustrates that our suggested method achieved the highest ranking among all comparative algorithms. Furthermore, the results of the Bonferroni–Dunn test demonstrated that the performance of our proposed method was significantly superior to that of the LR, AE-ELM, SVR-SVM, and kNN imputation methods.

To assess the classification performance of the proposed dPMDC, we tested the classification accuracy of the proposed dPMDC method under the MCAR, MAR, and MNAR assumptions on ten datasets. We also evaluated the classification performance of four MDC methods, including dS²MDC [6], ALS-SVM [21], SVR-SVM [22], and MIWAE [23], for comparison. Moreover, to highlight the advantages of the proposed algorithm in simultaneously handling the ambiguous labels and missing features, the performances of three novel, state-of-the-art PLL algorithms, including dS²PLL [9], LWS [25], and VALEN [27], were simulated based on the complete features imputed by the existing imputation method. According to the results in the previous experiments, the imputation error of MAE was second only to our proposed method and significantly outperformed the other comparison methods. So, we utilized MAE to impute the missing features of the training data in the following simulations. In order to distinguish them from the original algorithms, we respectively use im-dS²PLL, im-LWS, and im-VALEN to denote them. All the simulation results are given in Table 8, Table 9 and Table 10 and the average ranks of all the comparison algorithms are depicted in Figure 13.

The simulation results in Table 8, Table 9 and Table 10 indicate that the performance of ALS-SVM was significantly inferior to that of the other comparison algorithms. The main reason was that ALS-SVM only reduced the weight of data with a high proportion of missing attributes without filling in the missing attributes. Different from ALS-SVM, SVR-SVM trained the multi-class classifier based on the training data imputed by the induced SVR model, making its performance significantly better than the ALS-SVM. However, the simulation results in Table 5, Table 6 and Table 7 show that the performance of the induced SVR imputation model was relatively poor, which, in turn, had a negative impact on the performance of the SVM classifier. The dS²MDC algorithm performed better than the ALS-SVM since it benefited from the interplay of missing feature imputation and model induction in the learned subspace. On the other hand, by suffering from the negative effect of noisy labels, its performance was worse than that of our proposed algorithm. It can also be observed that the dS²PLL, LWS, and VALEN algorithms achieved superior performance compared to ALS-SVM by imputing the missing features via the MAE imputation method and eliminating the effect of ambiguous labels using their disambiguation strategies. But, their performances were still poorer than that of the proposed dPMDC algorithm. MIWAE designed an integral framework that jointly addressed missing data imputation and classifier induction. Benefiting from this, the performance of the MIWAE algorithm ranked second among all the comparison algorithms. Additionally, our proposed dPMDC algorithm performed best among eight comparison algorithms in almost all cases. Although our algorithm had a relatively high imputation error under the MNAR assumption, the classifier’s performance still maintained a high level due to the interaction between the imputation model and the classifier.

Similar to the previous experiment, we used the Friedman test [45] to justify the performance difference among the eight compared algorithms. According to the theory of the Friedman test, when the significance level was set as 0.05, we could calculate the Friedman statistic value

F_{s} = 40.38

and its associated critical value

2.16

. We noticed that the Friedman statistic was significantly larger than the critical value, indicating that there existed a significant difference among the evaluated comparison algorithms.

Additionally, the Bonferroni–Dunn test was employed to compare the performance difference between our suggested method and the other comparative algorithms [45]. All the results of the Bonferroni–Dunn test are depicted in Figure 13. Figure 13 illustrates that the learning performance of our suggested method was significantly better than that of the im-VALEN, dS²MDC, im-dS²PLL, SVR-SVM, and ALS-SVM algorithms.

To testify to the robustness of the proposed algorithm against coupling noisy labels, we compared the classification accuracy of the comparison algorithms versus

ϵ

on four artificial PLL datasets (Pendigits, Ecoli, Segmentation, and Vertebral datasets). From Figure 14 and Figure 15, we can see that as the probability of co-occurring between the correct label and another coupling noisy label

ϵ

gradually increased, the classification performances of all the considered algorithms gradually deteriorated. Among all the comparison algorithms, we found that our proposed algorithm demonstrated superior performance, especially when the value of

ϵ

was smaller than 0.3. When the proportion of coupling noisy labels was larger than 0.3, the performance significance of the dPMDC algorithm gradually reduced. Such a simulation result validates the benefits of our proposed algorithm in handling a small amount of coupling noisy labels.

Finally, we evaluated the CPU time of all the considered algorithms on the ten used datasets, shown in Table 11. It should be noted that for the distributed learning algorithms, all the computation operations were performed at the J node over a network. For centralized learning algorithms, all the computation operations were centralized at a single fusion node. Owing to the distributed parallel computation, the CPU times of all the distributed learning methods were significantly lower than those of the centralized learning methods. Compared with im-dS²PLL, our proposed dPMDC and the dS²MDC required fewer computations during the process of imputation model induction. Therefore, the CPU times of the proposed dPMDC and dS²MDC were significantly shorter than those of im-dS²PLL.

5. Conclusions

In this article, we addressed the issue of the distributed classification of partial label incomplete data and proposed the dPMDC algorithm. In our proposed algorithm, by jointly exploiting information from the estimated labeling confidence and partially observable features, a GMM-based imputed model is learned, which can distributively impute the partially missing features. Based on the imputed data, a high-precision classifier modeled by the non-linear random feature map is induced. A series of simulations were performed to validate the efficacy of the suggested dPMDC method. Simulation results across several datasets indicate that our proposed approach surpasses current imputation methods in imputation accuracy and outperforms state-of-the-art PLL algorithms using complete features imputed by the existing methods.

A potential limitation of this algorithm is that it can only be used to handle the problems of MDC in static networks. Therefore, developing algorithms for dynamic network topologies is a possible future direction. Considering that the rich semantics of the training data are simultaneously characterized by multiple label variables in many real applications, as opposed to a single label variable, we would like to extend the partial label missing data classification method into the domain of multi-dimensional classification. Furthermore, developing an imputation method for the case of MNAR is also an interesting research direction for the future.

Author Contributions

Conceptualization, Z.X.; Methodology, Z.X.; Software, Z.C.; Validation, Z.C.; Formal analysis, Z.X.; Investigation, Z.C.; Resources, Z.C.; Data curation, Z.C.; Writing—original draft, Z.C.; Writing—review and editing, Z.C.; Visualization, Z.C.; Supervision, Z.X.; Project administration, Z.C.; Funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the National Natural Science Foundation of China (grant no. 62201398).

Data Availability Statement

The data presented in this study are openly available in UCI Repository of Machine Learning Databases at http://archive.ics.uci.edu/ml/datasets.html [41].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorem 2

For convenience, we combine the objective function of the imputation model and the multi-class classifier, i.e.,

L_{j, t} = - Q^{'} (θ_{j} | θ_{j, t}) + F_{j, t}

(A1)

Based on Lemma 1, we know that all the local estimates

θ_{j, t}

and

W_{j, t}

can converge to the corresponding average values

{\bar{θ}}_{t}

and

{\bar{W}}_{t}

through a sufficient number of iterations. It should be noted that here, in order to make the subsequent proof clearer, we merge all the parameters of the GMM into one variable

\bar{θ}

and combine all the weight vectors

{{\bar{w}}_{c}}_{c}

into a uniform matrix

\bar{W}

.

Based on this, we can rewrite the main update equation of the proposed dPMDC as follows:

\begin{matrix} {\bar{θ}}_{t + 1} = arg min L_{j} (\bar{θ}, {\bar{W}}_{t}), \end{matrix}

(A2a)

\begin{matrix} {\bar{W}}_{t + 1} = {\bar{W}}_{t} - ζ_{t + 1} ▿_{\bar{W}} L_{j} ({\bar{θ}}_{t + 1}, \bar{W}), \end{matrix}

(A2b)

Let

L_{j, t} = L_{j} ({\bar{θ}}_{t}, {\bar{W}}_{t})

and

L_{j, t}^{'} = L_{j} ({\bar{θ}}_{t + 1}, {\bar{W}}_{t})

; then, we can obtain

L_{j, t}^{'} - L_{j, t + 1}^{'} = L_{j} ({\bar{θ}}_{t + 1}, {\bar{W}}_{t} - ζ_{t + 1} ▿_{\bar{W}} L_{j, t}^{'}) .

(A3)

Referring to [6], we can see that for the time-varying variable

s_{t + 1} = s_{t} - ζ_{t + 1} ▿ g (s_{t})

, we have the following inequality:

g (s_{t + 1}) \leq g (s_{t}) + < ▿ g (s_{t}), s_{t + 1} - s_{t} > + \frac{M}{2} | | s_{t + 1} - s_{t} {| |}_{2}^{2} .

(A4)

By combining the above equations, we can obtain

g (s_{t}) - g (s_{t + 1}) \geq (ζ_{t + 1} - \frac{ζ_{t + 1}^{2} M}{2}) | | s_{t + 1} - s_{t} {| |}_{2}^{2} .

(A5)

Here, when

ζ_{t + 1} < 2 / M

, the coefficient

ζ_{t + 1} - \frac{ζ_{t + 1}^{2} M}{2}

is positive.

By induction, we know that

\begin{matrix} L_{j, t}^{'} - L_{j, t + 1}^{'} \geq Z | | ▿_{\bar{W}} L_{j, t}^{'} {| |}_{F}^{2}, \end{matrix}

(A6)

where the coefficient

Z = ζ_{t + 1} - \frac{ζ_{t + 1}^{2} M}{2} > 0

.

On the basis of the assumption that

L

is convex, we can further obtain

\begin{matrix} L_{j, t}^{'} - L_{j}^{' *} & \leq ▿_{\bar{W}} L_{j, t}^{'} ({\bar{W}}_{t} - {\bar{W}}^{*}) \\ \leq | | {\bar{W}}_{t} - {\bar{W}}^{*} {| |}_{F} | | ▿_{\bar{W}} L_{j, t}^{'} {| |}_{F}^{2} \\ \leq G | | {\bar{W}}_{t} - {\bar{W}}^{*} {| |}_{F} \end{matrix}

(A7)

where G denotes the upper bound of

| | ▿_{\bar{W}} L_{j, t}^{'} | |

.

Based on the above equations, we can obtain

\begin{matrix} L_{t}^{'} - L_{t + 1}^{'} & \geq Z | | ▿_{\bar{W}} L_{j, t}^{'} {| |}_{F}^{2} \\ \geq \frac{Z}{| | {\bar{W}}_{t} - {\bar{W}}^{*} {| |}_{F}^{2}} {({\bar{W}}_{t} - {\bar{W}}^{*})}^{2} \\ \geq \frac{Z}{R^{2}} {({\bar{W}}_{t} - {\bar{W}}^{*})}^{2} \end{matrix}

(A8)

where

R = max {| | {\bar{W}}_{t} - {\bar{W}}^{*} | |_{F}}

.

In order to carry out the following analysis, a lemma in [6] should be given.

Lemma A1.

For a non-negative sequence

β_{t}

satisfying the conditions

β_{t} - β_{t + 1} \geq a β_{t}^{2}

and

β_{0} \leq \frac{1}{a i}

, we have

β_{t} \leq \frac{1}{a (t + i)},

(A9)

where a and i are positive coefficients.

Recalling Lemma A1, we let

β_{t} = L_{j, t}^{'} - L_{j, t}^{' *}

,

a = Z / R^{2}

and

i = R / (Z G)

and then obtain

L_{j, t}^{'} - L_{j}^{' *} \leq \frac{R^{2}}{Z (t + \frac{R}{Z G})} .

(A10)

Therefore, we can derive that

lim_{t \to \infty} L_{j, t}^{'} - L_{j}^{' *} = 0 .

(A11)

Based on (A11), we have

{lim}_{t \to \infty} {\bar{W}}_{t} - W * = 0

. The proof is complete.

References

Shen, X.; Liu, Y. Privacy-preserving distributed estimation over multitask networks. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 1953–1965. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y. Robust distributed parameter estimation of nonlinear systems with missing data over networks. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 2228–2244. [Google Scholar] [CrossRef]
Miao, X.D.; Liu, Y.; Zhao, H.Q.; Li, C.G. Distributed online one-class support vector machine for anomaly detection over networks. IEEE Trans. Cybern. 2019, 49, 1475–1488. [Google Scholar] [CrossRef]
Liu, M.; Yang, K.; Zhao, N.; Chen, Y.; Song, H.; Gong, F. Intelligent signal classification in industrial distributed wireless sensor networks based industrial internet of things. IEEE Trans. Ind. Inf. 2020, 17, 4946–4956. [Google Scholar] [CrossRef]
Shen, X.; Liu, Y. Distributed differential utility/cost analysis for privacy protection. IEEE Signal Process. Let. 2019, 26, 1436–1440. [Google Scholar] [CrossRef]
Xu, Z.; Liu, Y.; Li, C. Distributed semi-supervised learning with missing data. IEEE Trans. Cybern. 2021, 51, 6165–6178. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Xu, Z.; Li, C. Distributed online semi-supervised support vector machine. Inf. Sci. 2018, 466, 236–257. [Google Scholar] [CrossRef]
Carminati, M.; Kanoun, O.; Ullo, S.L.; Marcuccio, S. Prospects of distributed wireless sensor networks for urban environmental monitoring. IEEE Aerosp. Electron. Syst. Mag. 2019, 34, 44–52. [Google Scholar] [CrossRef]
Liu, Y.; Xu, Z.; Zhang, C. Distributed semi-supervised partial label learning over networks. IEEE Trans. Artif. Intell. 2022, 3, 414–425. [Google Scholar] [CrossRef]
Xu, Z.; Liu, Y.; Li, C. Distributed information theoretic semi-supervised learning for multi-label classification. IEEE Trans. Cybern. 2022, 52, 821–835. [Google Scholar] [CrossRef]
Garcia-Laencina, P.J.; Sancho-Gomez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef] [PubMed]
Murti, D.M.P.; Pujianto, U.; Wibawa, A.P.; Akbar, M.I. K-nearest neighbor (k-nn) based missing data imputation. In Proceedings of the 5th International Conference on Science in Information Technology (ICSITech), Yogyakarta, Indonesia, 23–24 October 2019; pp. 83–88. [Google Scholar]
Alabadla, M.; Sidi, F.; Ishak, I.; Ibrahim, H.; Affendey, L.; Ani, Z.; Jabar, M.; Bukar, U.; Devaraj, N.; Muda, A.; et al. Systematic review of using machine learning in imputing missing values. IEEE Access 2022, 10, 44483–44502. [Google Scholar] [CrossRef]
Lu, C.; Mei, Y. An imputation method for missing data based on an extreme learning machine auto-encoder. IEEE Access 2018, 6, 52930–52935. [Google Scholar] [CrossRef]
Pan, Z.; Wang, Y.; Wang, K.; Chen, H.; Yang, C.; Gui, W. Imputation of missing values in time series using an adaptive-learned median-filled deep autoencoder. IEEE Trans. Cybern. 2022, 53, 695–706. [Google Scholar] [CrossRef]
Lai, X.; Wu, X.; Zhang, L. Autoencoder-based multi-task learning for imputation and classification of incomplete data. Appl. Soft Comput. 2021, 98, 106838. [Google Scholar] [CrossRef]
Zhang, Y.; Li, M.; Wang, S.; Dai, S. Gaussian mixture model clustering with incomplete data. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–14. [Google Scholar] [CrossRef]
Sang, H.; Kim, J.; Lee, D. Semiparametric fractional imputation using gaussian mixture models for handling multivariate missing data. J. Am. Stat. Asso. 2022, 117, 654–663. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y. Distributed multi-kernel learning based on Gaussian mixture model with missing Data. In Proceedings of the 2nd International Conference on Signal Processing, Computer Networks, and Communications (SPCNC 2023), Xiamen, China, 8–10 December 2023; pp. 106–111. [Google Scholar]
Wang, G.; Deng, Z.; Choi, K.S. Tackling missing data in community health studies using additive LS-SVM classifier. IEEE J. Biomed. Health Inform. 2018, 22, 579–587. [Google Scholar] [CrossRef]
Palanivinayagam, A.; Damasevicius, R. Effective handling of missing values in datasets for classification using machine learning methods. Information 2023, 14, 92. [Google Scholar] [CrossRef]
Kim, S.; Kim, H.; Yun, E.; Lee, H.; Lee, J.; Lee, J. Probabilistic imputation for time-series classification with missing data. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 16654–16667. [Google Scholar]
Wang, W.; Zhang, M.-L. Semi-supervised partial label learning via confidence-rated margin maximization. In Proceedings of the 34th Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 6982–6993. [Google Scholar]
Wen, H.; Cui, J.; Hang, H.; Liu, J.; Wang, Y.; Lin, Z. Leveraged weighted loss for partial label learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 11091–11100. [Google Scholar]
Yu, X.R.; Wang, D.B.; Zhang, M.L. Dimensionality reduction for partial label learning: A unified and adaptive approach. IEEE Trans. Knowl. Data Eng. 2024, 36, 3765–3782. [Google Scholar] [CrossRef]
Xu, N.; Qiao, C.; Zhao, Y.; Geng, X.; Zhang, M.-L. Variational label enhancement for instance-dependent partial label learning. IEEE Trans. Patt. Anal. Mach. Intell. 2024, 46, 11298–11313. [Google Scholar] [CrossRef]
Fang, J.P.; Zhang, M.L. Partial multi-label learning via credible label elicitation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3587–3599. [Google Scholar] [CrossRef]
Xie, M.-K.; Huang, S. Semi-supervised partial multi-label learning. In Proceedings of the 2020 IEEE International Conference on Data Mining, Sorrento, Italy, 17–20 November 2020; pp. 691–700. [Google Scholar]
Yu, T.; Yu, G.; Wang, J.; Domeniconi, C.; Zhang, X. Partial multi-label learning using label compression. In Proceedings of the 2020 IEEE International Conference on Data Mining, Sorrento, Italy, 17–20 November 2020; pp. 761–770. [Google Scholar]
Xie, M.K.; Huang, S.J. Partial multi-label learning with noisy label identification. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3676–3687. [Google Scholar]
Liu, B.Q.; Jia, B.B.; Zhang, M.-L. Towards enabling binary decomposition for partial multi-label learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13203–13217. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Chen, W. Distributed Semi-Supervised Partial Multi-Label Learning over Networks. Electronics 2024, 13, 4754. [Google Scholar] [CrossRef]
Vedaldi, A.; Zisserman, A. Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 480–492. [Google Scholar] [CrossRef]
Pu, X.; Li, C. Probabilistic Information-Theoretic Discriminant Analysis for Industrial Label-Noise Fault Diagnosis. IEEE Trans. Ind. Inf. 2021, 17, 2664–2674. [Google Scholar] [CrossRef]
Hero, A.O.; Fessler, J.A. Convergence in norm for alternating expectation maximization (em) type algorithms. Stat. Sin. 1995, 5, 41–54. [Google Scholar]
Petersen, K.B.; Pedersen, M.S. The Matrix Cookbook; Technical University of Denmark Press: Kongens Lyngby, Denmark, 2012. [Google Scholar]
Niu, G.; Dai, B.; Yamada, M.; Sugiyama, M. Information-theoretic semi-supervised metric learning via entropy regularization. Neural Comput. 2014, 26, 1717–1762. [Google Scholar] [CrossRef]
Wang, S.; Li, C. Distributed stochastic algorithm for global optimization in networked system. J. Optim. Theory Appl. 2018, 179, 1001–1007. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y. Distributed Personalized Imputation Based on Gaussian Mixture Model for Missing Data. Neural Comput. Appl. 2024, 36, 14237–14250. [Google Scholar] [CrossRef]
Blake, C.; Merz, C. UCI Repository of Machine Learning Databases. Available online: http://archive.ics.uci.edu/ml/datasets.html. (accessed on 1 July 2024).
Cour, T.; Sapp, B.; Jordan, C.; Taskar, B. Learning from ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, USA, 20–25 June, 2009; pp. 919–926. [Google Scholar]
Briggs, F.; Fern, X.Z.; Raich, R. Rank-loss support instance machines for MIML instance annotation. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12), Beijing, China, 12–16 August 2012; pp. 534–542. [Google Scholar]
Liu, L.; Dietterich, T. A conditional multinomial mixture model for superset label learning. In Proceedings of the 26th Neural Information Processing Systems. (NeurIPS’12), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 557–565. [Google Scholar]
Demsar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1. An example of the topology of a connected network and a flowchart of the proposed dPMDC algorithm.

Figure 2. Diagram of the main steps of proposed dPMDC algorithm.

Figure 3. Learning curve of imputation error of dPMDC using different networks on “mHealth” and “Gas Drift” datasets.

Figure 4. Learning curve of classification accuracy of dPMDC using different networks on “mHealth” and “Gas Drift” datasets.

Figure 5. CPU times of dPMDC algorithm versus different network topologies.

Figure 6. Learning curve of imputation error of dPMDC with different initial settings on “mHealth” and “Gas Drift” datasets.

Figure 7. Learning curve of classification accuracy of dPMDC with different initial settings on “mHealth” and “Gas Drift” datasets.

Figure 8. Classification accuracy of dPMDC algorithm versus weight parameters

λ_{A}

and

λ_{B}

and discretization level of random feature map U on “mHealth” and “Gas Drift” datasets.

Figure 8. Classification accuracy of dPMDC algorithm versus weight parameters

λ_{A}

and

λ_{B}

and discretization level of random feature map U on “mHealth” and “Gas Drift” datasets.

Figure 9. Imputation error and classification accuracy of dPMDC versus number of Gaussian components K on “mHealth” and “Gas Drift” datasets.

Figure 10. Diagram of “Double-moon” dataset with different types of noise.

Figure 11. Learning curve of imputation error and classification accuracy of dPMDC on “Double Moon” dataset with different types of noise.

Figure 12. Comparison of dPMDC algorithm (the control algorithm) in contrast to other imputation methods using the Bonferroni–Dunn test.

Figure 13. Comparison of dPMDC algorithm (the control algorithm) in contrast to other comparing classification algorithms using the Bonferroni–Dunn test.

Figure 14. Classification accuracy of dPMDC versus the co-occurrence probability between a coupling noisy label and the correct label

ϵ

on “Pendigits” and “Ecoli” datasets.

Figure 14. Classification accuracy of dPMDC versus the co-occurrence probability between a coupling noisy label and the correct label

ϵ

on “Pendigits” and “Ecoli” datasets.

Figure 15. Classification accuracy of dPMDC versus the co-occurrence probability between a coupling noisy label and the correct label

ϵ

on “Segmentation” and “Vertebral” datasets.

Figure 15. Classification accuracy of dPMDC versus the co-occurrence probability between a coupling noisy label and the correct label

ϵ

on “Segmentation” and “Vertebral” datasets.

Table 1. Explanation of notations.

Variable	Meaning	Variable	Meaning
${x_{j, n}}_{n = 1}^{N_{j}}$	training data with complete features	${Ω_{j, n} x_{j, n}}_{n = 1}^{N_{j}}$	observed training data with incomplete features
${y_{j, n}}_{n = 1}^{N_{j}}$	class label	$k (x_{h}, x_{n})$	kernel function
$w_{c}$	weight vector with respect to the c-th class	$f_{c} (x)$	output of the discriminant function with respect to the c-th class
$ϕ (x_{n})$	kernel feature map	$p (x_{n})$	distribution of training data $x_{n}$
$π_{k}$	mixing parameter	$μ_{k}$	mean vector of the k-th Gaussian component
$Σ_{k}$	covariance matrix of the k-th Gaussian component	$r_{c k}$	probability that the k-th Gaussian component belongs to the c-th class
$p (x_{n} \| μ_{k}, Σ_{k})$	distribution of $x_{n}$ with respect to the k-th Gaussian component	$g_{j, n, c}$	confidence of the c-th candidate label
$γ_{j, k, n}$	posterior probability belongs to the k-th component	$v_{j i}$	cooperative coefficient
$Ξ_{j, k}$	inverse matrix of $Σ_{j, k}$	$Ω_{j, n}^{'}$ , $Λ_{j, k}$	auxiliary variable matrices
$p (c \| x_{j, n})$	conditional probability belongs to the c-th class	$I_{W} (a; K)$	mutual information between Gaussian component centers and class labels
$p (c)$	empirical distribution of the c-th class	$h (\cdot)$	a scaling function
$λ_{A}$	weight parameter with respect to logistical loss	$λ_{B}$	weight parameter with respect to mutual information

Table 2. Explanation of technical terms.

Term	Meaning	Term	Meaning
data imputation	filling in missing features in training data	ambiguous labels	the label is not unique and candidate labels consist of the correct label and noisy labels
discriminant function	describing the boundary between separate classes	kernel function	a tool for non-linear classification problems
random feature map	a promising tool for approximating kernel functions	Gaussian mixture model (GMM)	a technique for estimating the probability of data belonging to a cluster
expectation-maximization (EM) procedures	an iterative method to seek the maximum likelihood estimates	probability density function (pdf)	a function for determining the probability of the variable falling within a range of values
marginal probability	the unconditional probability of an event occurring	conditional probability	the probability of an event occurring on the basis that the other event occurs
mutual information	a tool for measuring the amount of dependence between two variables	missing completely at random (MCAR)	the missing data are independent of the observed and unobserved data
missing at random	the missing data are related to the observed data but not the unobserved data	missing not at random (MNAR)	the missing data are related to the unobserved data

Table 3. Computation cost in terms of number of MOs and AOs per iteration t per node j.

${γ_{j, k, n}}_{k, n}$	MO	$\frac{1}{2} {(1 - P_{m})}^{3} D^{3} + {(1 - P_{m})}^{2} D^{2} + (1 - P_{m}) D + C + C K$
${γ_{j, k, n}}_{k, n}$	AO	$\frac{1}{2} {(1 - P_{m})}^{3} D^{3} + {(1 - P_{m})}^{2} D^{2} + 2 (1 - P_{m}) D + C + C K$
${μ_{j, k}}_{k}$	MO	$D (\| B_{j} \| + 2 N_{j} C)$
${μ_{j, k}}_{k}$	AO	$D (\| B_{j} \| + 4 N_{j} C)$
${Σ_{j, k}}_{k}$	MO	$D^{2} (\| B_{j} \| + 2 N_{j} C) + {(1 - P_{m})}^{2} P_{m} D^{3} + 4 {(1 - P_{m})}^{3} D^{3}$
${Σ_{j, k}}_{k}$	AO	$D^{2} (\| B_{j} \| + 4 N_{j} C) + {(1 - P_{m})}^{2} P_{m} D^{3} + 4 {(1 - P_{m})}^{3} D^{3}$
${π_{j, k}}_{k}$	MO	$K N_{j} C + \| B_{j} \|$
${π_{j, k}}_{k}$	AO	$K N_{j} C + N_{j} C + \| B_{j} \|$
${r_{j, c, k}}_{c, k}$	MO	$2 C K N_{j} + N_{j} C + \| B_{j} \|$
${r_{j, c, k}}_{c, k}$	AO	$C K N_{j} + N_{j} C + N_{j} + \| B_{j} \|$
${{\hat{x}}_{j, n}^{m \| o}}_{n}$	MO	$K P_{m} D + \frac{1}{2} {(1 - P_{m})}^{3} D^{3} + P_{m} {(1 - P_{m})}^{2} D^{3} + P_{m} (1 - P_{m}) D^{2}$
${{\hat{x}}_{j, n}^{m \| o}}_{n}$	AO	$K P_{m} D + \frac{1}{2} {(1 - P_{m})}^{3} D^{3} + P_{m} {(1 - P_{m})}^{2} D^{3} + P_{m} (1 - P_{m}) D^{2}$
${w_{j, c}}_{c}$	MO	$T [D U (\| B_{j} \| + 4 N_{j} C + K C)]$
${w_{j, c}}_{c}$	AO	$T [D U (\| B_{j} \| + 6 N_{j} C + K C)]$
${g_{j, n, c}}_{n, c}$	MO	$C D U + 6$
${g_{j, n, c}}_{n, c}$	AO	$C D U + 2 C + 6$

Table 4. Summary of characteristics of used datasets.

Dataset	No. Training Data	No. Testing Data	No. Dimension	No. Class
Double Moon	20,000	5000	2	4
mHealth	16,000	4000	23	4
Gas Drift	11,120	2790	129	6
Pendigits	2800	698	16	10
Segmentation	3700	928	19	7
Ecoli	270	66	8	8
Vertebral	250	60	6	3
Lost	900	222	108	16
Birdsong	4000	998	38	13
MRSCv2	1410	348	48	23

Table 5. Imputation error of different algorithms versus

P_{m}

and

ϵ

on 10 used datasets under MCAR assumption, with the best performances shown in bold face.

Table 5. Imputation error of different algorithms versus

P_{m}

and

ϵ

on 10 used datasets under MCAR assumption, with the best performances shown in bold face.

Imputation Error
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$P_{m}$	0.3	0.3	0.3	0.3	0.3	0.3
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.003	0.010	0.004	0.005	0.005	0.009
kNN	0.007	0.037	0.042	0.012	0.052	0.062
LR	0.006	0.028	0.031	0.009	0.036	0.043
SL	0.006	0.022	0.018	0.009	0.026	0.018
AE-ELM	0.007	0.047	0.041	0.008	0.044	0.035
MAE	0.003	0.015	0.009	0.006	0.016	0.018
SVR-SVM	0.007	0.037	0.024	0.018	0.026	0.029
MIWAE	0.005	0.016	0.013	0.010	0.019	0.021
Imputation Error
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$P_{m}$	0.3	0.3	0.3	0.3
$ϵ$	0.3	/	/	/
dPMDC	0.006	0.013	0.002	0.002
kNN	0.012	0.022	0.011	0.033
LR	0.015	0.020	0.006	0.022
SL	0.012	0.019	0.004	0.020
AE-ELM	0.013	0.018	0.006	0.026
MAE	0.010	0.015	0.004	0.007
SVR-SVM	0.013	0.021	0.009	0.028
MIWAE	0.010	0.017	0.006	0.012

Table 6. Imputation error of different algorithms versus

P_{m}

,

P_{c}

, and

ϵ

on 10 used datasets under MAR assumption, with the best performances shown in bold face.

Table 6. Imputation error of different algorithms versus

P_{m}

,

P_{c}

, and

ϵ

on 10 used datasets under MAR assumption, with the best performances shown in bold face.

Imputation Error
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$δ$ / $P_{c}$	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.005	0.018	0.009	0.011	0.014	0.009
kNN	0.009	0.048	0.041	0.018	0.066	0.068
LR	0.008	0.035	0.028	0.023	0.048	0.048
SL	0.010	0.026	0.024	0.017	0.035	0.024
AE-ELM	0.011	0.055	0.040	0.026	0.053	0.047
MAE	0.005	0.019	0.013	0.020	0.023	0.024
SVR-SVM	0.007	0.027	0.034	0.022	0.038	0.039
MIWAE	0.009	0.024	0.018	0.019	0.026	0.026
Imputation Error
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$P_{m}$ / $P_{c}$	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5
$ϵ$	0.3	/	/	/
dPMDC	0.012	0.016	0.008	0.007
kNN	0.022	0.028	0.017	0.039
LR	0.019	0.024	0.011	0.031
SL	0.020	0.022	0.008	0.026
AE-ELM	0.019	0.021	0.012	0.040
MAE	0.014	0.019	0.009	0.015
SVR-SVM	0.019	0.028	0.016	0.035
MIWAE	0.015	0.024	0.010	0.012

Table 7. Imputation error of different algorithms versus

δ

and

ϵ

on 10 used datasets under MNAR assumption, with the best performances shown in bold face.

Table 7. Imputation error of different algorithms versus

δ

and

ϵ

on 10 used datasets under MNAR assumption, with the best performances shown in bold face.

Imputation Error
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$δ$	0.9	0.9	0.9	0.9	0.9	0.9
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.037	0.031	0.029	0.031	0.034	0.029
kNN	0.069	0.078	0.068	0.078	0.081	0.088
LR	0.055	0.062	0.057	0.064	0.057	0.063
SL	0.063	0.056	0.064	0.041	0.046	0.041
AE-ELM	0.043	0.057	0.061	0.063	0.071	0.068
MAE	0.035	0.039	0.033	0.050	0.033	0.044
SVR-SVM	0.054	0.061	0.054	0.052	0.067	0.059
MIWAE	0.029	0.034	0.026	0.029	0.038	0.031
Imputation Error
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$δ$	0.9	0.9	0.9	0.9
$ϵ$	0.3	/	/	/
dPMDC	0.022	0.027	0.029	0.036
kNN	0.072	0.076	0.067	0.079
LR	0.046	0.054	0.051	0.057
SL	0.035	0.039	0.038	0.046
AE-ELM	0.042	0.046	0.051	0.054
MAE	0.035	0.025	0.037	0.035
SVR-SVM	0.052	0.064	0.045	0.058
MIWAE	0.022	0.029	0.031	0.028

Table 8. Classification accuracy of different algorithms versus

P_{m}

and

ϵ

on 10 used datasets under MCAR assumption, with the best performances shown in bold face.

Table 8. Classification accuracy of different algorithms versus

P_{m}

and

ϵ

on 10 used datasets under MCAR assumption, with the best performances shown in bold face.

Classification Accuracy
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$P_{m}$	0.3	0.3	0.3	0.3	0.3	0.3
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.983	0.960	0.884	0.793	0.765	0.855
im-dS²PLL	0.975	0.932	0.862	0.779	0.723	0.809
dS²MDC	0.966	0.916	0.831	0.756	0.714	0.822
ALS-SVM	0.926	0.903	0.818	0.749	0.697	0.782
im-LWS	0.976	0.940	0.861	0.776	0.747	0.827
im-VALEN	0.980	0.937	0.857	0.778	0.752	0.819
SVR-SVM	0.967	0.939	0.827	0.766	0.736	0.805
MIWAE	0.973	0.946	0.839	0.781	0.759	0.825
Classification Accuracy
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$P_{m}$	0.3	0.3	0.3	0.3
$ϵ$	0.3	/	/	/
dPMDC	0.822	0.528	0.443	0.628
im-dS²PLL	0.795	0.501	0.425	0.609
dS²MDC	0.776	0.491	0.431	0.601
ALS-SVM	0.763	0.375	0.395	0.573
im-LWS	0.807	0.504	0.414	0.602
im-VALEN	0.804	0.489	0.427	0.582
SVR-SVM	0.798	0.438	0.409	0.575
MIWAE	0.802	0.492	0.416	0.604

Table 9. Classification accuracy of different algorithms versus

P_{m}

,

P_{c}

, and

ϵ

on 10 used datasets under MAR assumption, with the best performances shown in bold face.

Table 9. Classification accuracy of different algorithms versus

P_{m}

,

P_{c}

, and

ϵ

on 10 used datasets under MAR assumption, with the best performances shown in bold face.

Classification Accuracy
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$δ$ / $P_{c}$	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.975	0.957	0.867	0.786	0.759	0.850
im-dS²PLL	0.968	0.930	0.840	0.771	0.709	0.802
dS²MDC	0.960	0.909	0.815	0.747	0.701	0.816
ALS-SVM	0.913	0.887	0.798	0.741	0.685	0.772
im-LWS	0.965	0.922	0.842	0.768	0.740	0.821
im-VALEN	0.974	0.913	0.835	0.770	0.746	0.812
SVR-SVM	0.960	0.921	0.809	0.760	0.731	0.793
MIWAE	0.968	0.932	0.821	0.774	0.746	0.814
Classification Accuracy
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$δ$ / $P_{c}$	0.8/0.5	0.8/0.5	0.8/0.5	0.8/0.5
$ϵ$	0.3	/	/	/
dPMDC	0.815	0.514	0.443	0.605
im-dS²PLL	0.769	0.493	0.425	0.601
dS²MDC	0.766	0.486	0.431	0.584
ALS-SVM	0.741	0.370	0.395	0.565
im-LWS	0.793	0.493	0.414	0.594
im-VALEN	0.789	0.480	0.427	0.572
SVR-SVM	0.780	0.421	0.409	0.570
MIWAE	0.791	0.485	0.416	0.590

Table 10. Classification accuracy of different algorithms versus

δ

and

ϵ

on 10 used datasets under MNAR assumption, with the best performances shown in bold face.

Table 10. Classification accuracy of different algorithms versus

δ

and

ϵ

on 10 used datasets under MNAR assumption, with the best performances shown in bold face.

Classification Accuracy
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
$δ$	0.9	0.9	0.9	0.9	0.9	0.9
$ϵ$	0.3	0.3	0.3	0.3	0.3	0.3
dPMDC	0.971	0.953	0.832	0.731	0.721	0.823
im-dS²PLL	0.960	0.926	0.812	0.716	0.673	0.791
dS²MDC	0.953	0.901	0.789	0.705	0.661	0.786
ALS-SVM	0.910	0.872	0.770	0.693	0.648	0.757
im-LWS	0.964	0.908	0.825	0.714	0.703	0.793
im-VALEN	0.968	0.912	0.803	0.705	0.706	0.795
SVR-SVM	0.951	0.919	0.779	0.697	0.684	0.786
MIWAE	0.963	0.937	0.816	0.738	0.706	0.802
Classification Accuracy
Dataset	Vertebral	Lost	MSRCv2	Birdsong
$δ$	0.9	0.9	0.9	0.9
$ϵ$	0.3	/	/	/
dPMDC	0.775	0.505	0.435	0.583
im-dS²PLL	0.734	0.490	0.410	0.556
dS²MDC	0.736	0.489	0.413	0.561
ALS-SVM	0.704	0.361	0.382	0.552
im-LWS	0.752	0.496	0.410	0.570
im-VALEN	0.757	0.474	0.408	0.557
SVR-SVM	0.750	0.410	0.392	0.528
MIWAE	0.773	0.487	0.403	0.575

Table 11. CPU times of different algorithms on 10 used datasets under MNAR assumption, with the lowest CPU times performances shown in bold face.

CPU Times
Dataset	Double Moon	mHealth	Gas Drift	Ecoli	Segmentation	Pendigits
dPMDC	142.42	226.07	520.80	23.31	120.67	136.88
im-dS²PLL	463.79	665.24	1568.06	97.60	437.70	462.88
dS²MDC	147.83	219.29	535.80	23.85	124.29	134.14
ALS-SVM	484.23	779.94	1754.70	72.26	434.16	483.92
im-LWS	2493.65	4075.60	7434.66	550.53	2484.90	2528.03
im-VALEN	2562.82	4110.87	7353.02	435.44	2469.20	2520.75
SVR-SVM	539.50	874.93	1963.65	86.71	502.92	546.81
MIWAE	2569.84	4032.80	7622.61	430.96	2445.47	2385.89
CPU Times
Dataset	Vertebral	Lost	MSRCv2	Birdsong
dPMDC	20.74	85.80	154.62	61.54
im-dS²PLL	114.08	401.24	477.25	356.93
dS²MDC	21.26	88.37	160.97	65.63
ALS-SVM	85.03	384.71	620.38	220.45
im-LWS	416.14	1980.48	2457.54	1207.67
im-VALEN	418.32	1876.09	2427.24	1498.37
SVR-SVM	96.93	438.56	670.23	245.30
MIWAE	421.86	1866.07	2525.24	1289.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Chen, Z. Distributed Partial Label Learning for Missing Data Classification. Electronics 2025, 14, 1770. https://doi.org/10.3390/electronics14091770

AMA Style

Xu Z, Chen Z. Distributed Partial Label Learning for Missing Data Classification. Electronics. 2025; 14(9):1770. https://doi.org/10.3390/electronics14091770

Chicago/Turabian Style

Xu, Zhen, and Zushou Chen. 2025. "Distributed Partial Label Learning for Missing Data Classification" Electronics 14, no. 9: 1770. https://doi.org/10.3390/electronics14091770

APA Style

Xu, Z., & Chen, Z. (2025). Distributed Partial Label Learning for Missing Data Classification. Electronics, 14(9), 1770. https://doi.org/10.3390/electronics14091770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Partial Label Learning for Missing Data Classification

Abstract

1. Introduction

2. Problem Formulation and Preliminaries

2.1. Problem Formulation

2.2. Random Feature Map

3. dPMDC Algorithm

3.1. Missing Feature Imputation

3.2. Classifier Induction

3.3. Performance Analysis

4. Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI