Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph

Peng, Zhongxing; Zheng, Gengzhong; Huang, Wei

doi:10.3390/sym16101312

Open AccessArticle

Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph

by

Zhongxing Peng

,

Gengzhong Zheng

^* and

Wei Huang

^*

School of Computer Information Engineering, Hanshan Normal University, Chaozhou 521041, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2024, 16(10), 1312; https://doi.org/10.3390/sym16101312

Submission received: 26 August 2024 / Revised: 25 September 2024 / Accepted: 2 October 2024 / Published: 4 October 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we introduce an efficient and effective algorithm for Graph-based Semi-Supervised Learning (GSSL). Unlike other GSSL methods, our proposed algorithm achieves efficiency by constructing a bipartite graph, which connects a small number of representative points to a large volume of raw data by capturing their underlying manifold structures. This bipartite graph, with a sparse and anti-diagonal affinity matrix which is symmetrical, serves as a low-rank approximation of the original graph. Consequently, our algorithm accelerates both the graph construction and label propagation steps. In particular, on the one hand, our algorithm computes the label propagation in closed-form, reducing its computational complexity from cubic to approximately linear with respect to the number of data points; on the other hand, our algorithm calculates the soft label matrix for unlabeled data using a closed-form solution, thereby gaining additional acceleration. Comprehensive experiments performed on six real-world datasets demonstrate the efficiency and effectiveness of our algorithm in comparison to five state-of-the-art algorithms.

Keywords:

graph-based semi-supervised learning; semi-supervised learning; machine learning; data mining

1. Introduction

In recent decades, the volume of user-generated content has experienced explosive growth. As an illustration, millions of new photos are added to Flickr daily, in addition to its already substantial collection of several billion images [1]. This massive influx of multimedia data presents novel challenges and vast opportunities for the fields of machine learning and data mining. Given the infeasibility of manually annotating such extensive datasets, Semi-supervised Learning (SSL) emerges as a pivotal approach, utilizing abundant unlabeled data in conjunction with a relatively small amount of labeled data. This paper focus on one principal category of SSL techniques: Graph-based SSL (GSSL) [2,3].

GSSL is an appealing approach due to its superior performance, despite its cubic computational complexity. In GSSL, we assume that similar labels are shared by nearby data points. Based on a small number of given labels, GSSL constructs an affinity graph and utilizes it to perform close-form label propagation, resulting in high accuracy. Representative GSSL methods include Harmonic Energy Minimization (HEM) [2], Local and Global Consistency (LGC) [4], and General Graph Semi-Supervised Learning (GGSSL) [5]. However, their computational complexity of label propagation is cubic (i.e.,

O (n^{3})

) for n data points, as it requires computing the inverse of the graph Laplacian. In addition,

O (n^{2})

operations are needed to construct their dense affinity graphs, which are impractical for large-scale datasets. Therefore, fast GSSL algorithms are highly desirable.

Several studies have been proposed to seek efficient GSSL algorithm. In [6], label propagation is first carried out on a randomly selected subset, and then on a truncated graph Laplacian which is constructed by connecting the selected subset with remaining data points. In [7], the Nyström method is used to approximate the eigenfunction with that of a randomly selected subset. However, both of these methods experience performance degradation, as they do not take into account the topology structure of the dataset during approximation. In [8], a fast SSL algorithm is proposed by sparsifying manifold regularization. In [9], a method combining the parametric mixture model with harmonic property is proposed to handle raw data and label propagation. Based on spectral graph theory, in [10], an efficient method is proposed to approximate the eigenvectors of the normalized graph Laplacian. However, its assumptions, such as dimension independence and 1-D uniform distributions, are often difficult to satisfy in many real-world datasets. In [11,12], Anchor Graph Regularization (AGR) is introduced, which considers manifold regularization by employing a low-rank approximation of the original graph. Nevertheless, it necessitates a linear transformation between raw data and anchor points. Evidently, the efficiency of these algorithms comes at the expense of accuracy. In [13], Optimal Bipartite Graph Semi-Supervised Learning (OBGSSL) is introduced, which learns a new bipartite graph iteratively instead of fixing the input data graph. In [14], semi-supervised learning via Bipartite graph Construction and Adaptive Neighbors (BCAN) considers both bipartite graph construction and label propagation simultaneously. Since the bipartite graph [15] must be updated iteratively until convergence in both OBGSSL and BCAN, their accuracy and speed are either inferior or marginally superior to AGR, as evidenced by their own experimental results in [13,14]. These inconsistent performances compared to AGR motivate us to develop our algorithm, which ultimately demonstrates a consistent outperformance over AGR, as confirmed by the experimental results presented at the end of this paper.

In this paper, we introduce an efficient and effective algorithm termed Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG). Here, closed-form label propagation implies that the soft label matrix

F_{u}

for unlabeled data can be computed analytically through an equation, rather than employing an iterative algorithm. This algorithm is based on a bipartite graph, which uses a small set of representative points to represent a large set of raw data. Label propagation is subsequently performed on the graph Laplacian of this structure. Owing to the unique structure of its affinity matrix, the proposed SSLCFBG exhibits efficiency in both graph construction and label propagation steps. Firstly, constructing its sparse graph incurs a cost of

O (n m)

, as opposed to the

O (n^{2})

cost associated with traditional GSSL algorithms, where m represents the number of representative points and satisfies

m ≪ n

. Secondly, the block structure of the graph Laplacian facilitates accelerated close-form label propagation through an efficient matrix inversion, with a computational complexity that scales nearly linearly with the number of data points n. The advantages of the proposed SSLCFBG in terms of speed and accuracy are validated by experiments conducted on six real-world datasets, outperforming five state-of-the-art and baseline algorithms.

Next, we will first elucidate our notations and then briefly delve into the graph construction of GSSL and the label propagation process in related algorithms.

1.1. Notations

Let

X = {x_{1}, \dots, x_{l}, x_{l + 1}, \dots, x_{n}}

be the dataset, where

x_{i} \in R^{d}

and

i = 1, \dots, n

. Label set is

L = {1, \dots, c}

. Without loss of generality, we assume the first l data points

x_{i}

(i \leq l)

are labeled as

y_{i} \in L

and the remaining data points

x_{u}

(l + 1 \leq u \leq n)

are unlabeled. Similar to other SSL problems, we assume

l ≪ u

.

Q = {q_{1}, \dots, q_{m}} \subset R^{d}

denotes the set of representative points (or anchor points for AGR) with

m ≪ n

. A soft label matrix is defined as

F \in F

, where

F

denotes the set of

(n + m) \times c

matrices with nonnegative entries. Label indicator matrix is defined as

Y \in R^{(n + m) \times c}

with

Y_{i j} = 1

if

x_{i}

is labeled as

y_{i} = j

, and

Y_{i j} = 0

otherwise. For convenience, we express F and Y in sub-matrices:

F = {[F_{l}^{T}, F_{u}^{T}, F_{q}^{T}]}^{T}

,

Y = {[Y_{l}^{T}, Y_{u}^{T}, Y_{q}^{T}]}^{T}

, where

F_{l}

,

F_{u}

and

F_{q}

correspond to the rows for labeled, unlabeled and the representative points in F, respectively. Similarly, we define

Y_{l}

,

Y_{u}

and

Y_{q}

for Y.

1.2. Graph Construction in GSSL

The first step of GSSL is to construct an undirected graph

G (V, E, W)

, where

V = X \cup Q

is a set including raw data and representative points. A set

E \subseteq V \times V

consists of edges between vertices. Associated with each edge

e_{i j} \in E

, its weight is defined as

w_{i j} \geq 0

, which forms an affinity matrix

W \in R^{(n + m) \times (n + m)}

, where

w_{i j} = 0

means no direct connection between

v_{i}

and

v_{j}

.

There are different ways to construct the affinity matrix. For example, we use k-nearest neighbor (kNN) to connects any pair of points if one is among the k-th nearest neighbors of the other [16]. The computational complexity of kNN in full graph construction is

O ({(n + m)}^{2})

, making it unsuitable for real-world datasets with large n. In this paper, we propose SSLCFBG with a fast graph construction. Meanwhile, graph Laplician is defined as

L = D - W

, where

L \in R^{l \times l}

and D is a diagonal matrix with diagonal elements as

d_{i i} = \sum_{j} w_{i j}

.

1.3. Related Algorithms

After constructing the graph in the above subsection, we are going to discuss label propagation of three closely related algorithms: HEM, LGC and AGR.

Considering the given labels as additional constrains, HEM in [2] aims to minimize

\begin{matrix} J_{HEM} (F) : = tr (F^{T} L F), s . t . F_{i} = Y_{i}, \end{matrix}

(1)

for

i = 1, \dots, l

, where

tr (\cdot)

is matrix trace. Here, label consistence of given label is guaranteed by the explicit constraints of

F_{i} = Y_{i}

. The closed-form label propagation of HEM is

\begin{matrix} F_{u}^{*} = {(D_{u u} - W_{u u})}^{- 1} W_{u l} Y_{l}, \end{matrix}

(2)

where

D = diag (D_{l l}, D_{u u})

is the degree matrix in graph Laplician, while

D_{l l}

and

D_{u u}

correspond to labeled and unlabeled data.

W_{u u} \in R^{u \times u}

is the weight for unlabeled data, and

W_{l u} \in R^{l \times u}

is the weight between labeled and unlabeled data. For every data

x_{k}

, its label

y_{k}

can be estimated through

\begin{matrix} y_{k} = \underset{j \leq c}{arg max} F_{k j}, \end{matrix}

(3)

where

k = l + 1, \dots, n

.

Since HEM is unable to handle noised data, LGC in [4] generalizes HEM with a finite weight of

γ > 0

, which has a following loss function:

\begin{matrix} J_{LGC} (F) : = tr (F^{T} L F) + γ \sum_{i = 1}^{n} {∥ F_{i} - Y_{i} ∥}_{F}^{2}, \end{matrix}

(4)

where

tr (\cdot)

is matrix trace, and

{∥ \cdot ∥}_{F}

represents the Frobenius norm. The closed-form solution for LGC is

\begin{matrix} F^{*} = {(D - \frac{1}{1 + γ} W)}^{- 1} Y . \end{matrix}

(5)

To reduce the high computational complexities of HEM and LGC, AGR in [11,12] minimizes

\begin{matrix} min_{F} \frac{1}{2} {∥ Z_{l q} F_{q} - Y_{l} ∥}_{F}^{2} + \frac{γ}{2} tr (F_{q}^{T} Z_{q}^{T} L Z_{q} F_{q}), \end{matrix}

(6)

where

Z_{l q} \in R^{l \times q}

and

Z_{q} \in R^{n \times q}

are matrices of sample-adaptive weights. AGR is much faster than HEM and LGC, because it employs a small number of anchor points. However, it requires a linear assumption of

\begin{matrix} F_{l + u} = Z_{q} F_{q}, \end{matrix}

(7)

where

F_{l + u} \in R^{n \times c}

. This assumption implies that the labels of the raw data can be linearly represented by the labels of a limited number of anchor points. Given that the ratio of

n / m

is typically very high, it becomes challenging to accurately fit all the raw data using a linear projection.

In the subsequent section, we introduce SSLCFBG, which aims to preserve the precision of HEM and LGC while maintaining the efficiency of AGR.

2. Materials and Methods

This section presents an efficient and effective algorithm, SSLCFBG, designed to address a general GSSL problem using a bipartite graph. On the one hand, it converges significantly faster than HEM and LGC. On the other hand, the proposed algorithm’s label propagation is faster and more precise, as it does not rely on the linear assumption inherent in AGR.

2.1. General Graph-Based Model

The GSSL problem can be formulated in different ways, such as those in [2,4,5]. In this paper, we consider a general model (loss function) as follows:

\begin{matrix} J (F) = \sum_{i, j = 1}^{n + m} w_{i j} ∥ F_{i} - F_{j} ∥_{F}^{2} + \sum_{i = 1}^{n + m} λ_{i} {∥ F_{i} - Y_{i} ∥}_{F}^{2}, \end{matrix}

(8)

where parameter

λ_{i} > 0

trades off label smoothness and consistence. Particularly, the loss functions of HEM and LGC can be interpreted as special cases of Equation (8), if

We set $m = 0$ , $λ_{i} = + \infty$ for $i = 1, \dots, l$ , and $λ_{i} = 0$ otherwise. Then, Equation (8) is the loss function of HEM.
We let $λ_{1} = \dots = λ_{n} > 0$ , and $m = 0$ . Then, Equation (8) becomes the loss function of LGC.

For convenience, Equation (8) can be rewritten in matrix-form,

\begin{matrix} J (F) = tr (F^{T} L F) + tr ({(F - Y)}^{T} Λ (F - Y)), \end{matrix}

(9)

where

Λ = diag (λ_{1}, \dots, λ_{n + m})

is a diagonal matrix.

2.2. Naive Label Propagation

Due to its convexity with respect to F, the minimum of Equation (9) attains when

\frac{\partial J (F)}{\partial F} = 2 L F + 2 Λ (F - Y) = 0

, which leads to the closed-form solution of

\begin{matrix} F^{*} = {(L + Λ)}^{- 1} Λ Y . \end{matrix}

(10)

Then, we apply Equation (3) to assign labels.

However, when it comes to computational complexity, efficiently minimizing Equation (9) presents a significant challenge. The crux lies in Equation (10), where computing the inverse of an

(n + m) \times (n + m)

matrix requires

O ({(n + m)}^{3})

operations. Likewise, the closed-form solutions for HEM and LGC, as presented in Equation (2) and Equation (5), respectively, also entail cubic computational complexities.

Consequently, it is highly desirable to reduce the computational cost associated with label propagation.

2.3. Large-Scale Graph Construction

In this subsection, we employ a bipartite graph to expedite the graph construction phase. An illustrative example of constructing such a bipartite graph is provided in Figure 1. Figure 1a shows the original fully connected graph, comprising 8 data points and 28 edges. In Figure 1b, we introduce three representative points to represent the 8 data points; for instance, ‘A’is designated to represent data points ‘1’and ‘2’. This bipartite graph requires only 8 edges. Figure 1c displays just the 3 representative points connected by 3 edges, as opposed to the 8 data points and 28 edges in Figure 1a.

The affinity matrix of GSSL problem based on bipartite graph is sparse and block anti-diagonal, i.e.,

\begin{matrix} W = [\begin{matrix} 0 & Z \\ Z^{T} & 0 \end{matrix}] \in R^{(n + m) \times (n + m)}, \end{matrix}

(11)

where

Z = {[W_{l q}^{T}, W_{u q}^{T}]}^{T} \in R^{n \times m}

,

W_{l q} \in R^{l \times m}

and

W_{u q} \in R^{u \times m}

. Meanwhile, we assume that Z is element-wise nonnegative and row-wise normalized, i.e.,

Z_{i j} \geq 0

and

\sum_{j = 1}^{m} Z_{i j} = 1

. Since

m ≪ n

, a large portion of the elements in W of Equation (11) are zeros, which reduce the computational complexity to construct the graph. To make it more efficient, it is reasonable to assume that Z is sparse, since the manifold assumption of GSSL implies that it is good enough to have

Z_{i j} = 0

for data

x_{i}

if it is far from a representative point

q_{j}

.

The matrix Z can be determined by Nadaraya–Watson kernel estimation [11,17]:

\begin{matrix} Z_{i j} = \frac{K (x_{i}, q_{j}, σ)}{\sum_{k \in A_{i}} K (x_{i}, q_{k}, σ)}, \end{matrix}

(12)

where set

A_{i}

consists of the indices of the s nearest representative points of

x_{i}

. Parameter s is given by user to define the sparsity number of each row in Z.

K (x_{i}, q_{j}, σ)

represents kernel function; for instance, we can use following Gaussian kernel:

\begin{matrix} K (x_{i}, q_{j}, σ) = exp (\frac{- ∥ x_{i} - q_{j} ∥^{2}}{2 σ^{2}}) . \end{matrix}

(13)

Representative points can be chosen using various methods, including random sampling, k-means clustering, or Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [18]. Owing to space constraints, we opt for k-means in this paper for its simplicity and efficacy. Nevertheless, the proposed SSLCFBG framework can be adapted to incorporate other methods for selecting representative points.

2.4. Label Propagation and SSLCFBG

This subsection further enhances the efficiency of another step in label propagation by leveraging a bipartite graph, then we will summarize the proposed SSLCFBG algorithm. First, of all, the closed-form solution for label propagation in our SSLCFBG algorithm is established through the following theorem.

Theorem 1.

If the GSSL problem has

1.: Undirected weighted graph as $G (V, E, W)$ ;
2.: Label indicator matrix $Y = {[Y_{l}^{T}, Y_{u}^{T}, Y_{q}^{T}]}^{T}$ where $Y_{u} = 0$ and $Y_{q} = 0$ ;
3.: Affinity matrix W is defined as Equation (11).

Then, we have following closed-form solution for label propagation:

\begin{matrix} F_{u}^{*} = I_{α u} W_{u q} {\hat{Θ}}^{- 1} I_{α m} W_{l q}^{T} I_{β l} Y_{l} \end{matrix}

(14)

where

I_{α} = diag (I_{α l}, I_{α u}, I_{α m}) \in R^{(n + m) \times (n + m)}

is a diagonal matrix, whose nonzero element satisfies

α_{i i} = \frac{1}{1 + λ_{i}}

.

I_{β} = I - I_{α} = diag (I_{β l}, I_{β u}, I_{β m})

, and

\begin{matrix} \hat{Θ} = I - I_{α m} W_{l q}^{T} I_{α l} W_{l q} - I_{α m} W_{u q}^{T} I_{α u} W_{u q} . \end{matrix}

(15)

Proof.

According to Equation (10), we have

\begin{matrix} F^{*} = {(L + Λ)}^{- 1} Λ Y = {(I - I_{α} W)}^{- 1} I_{β} Y = [\begin{matrix} I + B_{α l} Θ^{- 1} B_{α m} & - B_{α l} Θ^{- 1} \\ - Θ^{- 1} B_{α m} & Θ^{- 1} \end{matrix}] I_{β} Y \end{matrix}

(16)

where

B_{α l} = [0, - I_{α l} W_{l q}]

,

B_{α m} = [\begin{matrix} 0 \\ - I_{α m} W_{l q}^{T} \end{matrix}]

and

\begin{matrix} Θ^{- 1} = {[\begin{matrix} I & - I_{α u} W_{u q} \\ - I_{α m} W_{u q}^{T} & I - I_{α m} W_{l q}^{T} I_{α l} W_{l q} \end{matrix}]}^{- 1} = [\begin{matrix} I + I_{α u} W_{u q} {\hat{Θ}}^{- 1} I_{α m} W_{u q}^{T} & I_{α u} W_{u q} {\hat{Θ}}^{- 1} \\ {\hat{Θ}}^{- 1} I_{α m} W_{u q}^{T} & {\hat{Θ}}^{- 1} \end{matrix}] . \end{matrix}

(17)

Then, we can expand Equation (16) to have the following equation:

\begin{matrix} F_{u}^{*} = I_{α u} W_{u q} {\hat{Θ}}^{- 1} I_{α m} W_{l q}^{T} I_{β l} Y_{l} + (I + I_{α u} W_{u q} {\hat{Θ}}^{- 1} I_{α m} W_{u q}^{T}) I_{β u} Y_{u} + I_{α u} W_{u q} {\hat{Θ}}^{- 1} I_{β m} Y_{q}, \end{matrix}

(18)

Utilizing

Y_{u} = 0

and

Y_{q} = 0

results in Equation (14). □

Moreover, based on Equation (14), a specific solution can be derived after taking into account additional information. For example, we can utilize Harmonic property in HEM of [2], if set

\hat{Θ} = I - W_{u q}^{T} W_{u q}

,

α_{i}

and

β_{i}

are as follows:

\begin{matrix} α_{i} & = \frac{1}{1 + μ_{i}} = \{\begin{matrix} 0, & i \in I_{l} \\ \frac{1}{1 + 0} = 1, & i \in I_{u + m} \end{matrix}, \end{matrix}

(19)

\begin{matrix} β_{i} = \{\begin{matrix} 1, & i \in I_{l} \\ 0, & i \in I_{u + m} \end{matrix}, \end{matrix}

(20)

where

I = {1, \dots, n + m} = I_{l} \cap I_{u + m}

denotes the index set of

V = X \cap Q

.

I_{l}

is the index set for labeled data.

I_{u + m}

means the index set for unlabeled data and representative points. Then, the closed-form solution for HEM with bipartite graph is

F_{l}^{*} = Y_{l}

and

\begin{matrix} F_{u}^{*} = W_{u q} {(I - W_{u q}^{T} W_{u q})}^{- 1} W_{l q}^{T} Y_{l}, \end{matrix}

(21)

whose computational complexity,

O (m^{3})

, is considerably lower than the

O (u^{3})

complexity of HEM, under the assumption that

u ≫ m

.

Lastly, the proposed SSLCFBG algorithm is encapsulated in Algorithm 1, with its computational complexity to be discussed in the subsequent subsection.

Algorithm 1: Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG)

Input: m, s,

Y_{l}

,

λ_{1, \dots, n + m}

Output:

F_{u}^{*}

1 Selection of m representative points by k-means.

2 Construct bipartite graph then compute the corresponding

Z = {[W_{l q}^{T}, W_{u q}^{T}]}^{T}

. Here, we use Equation (12) to decide the elements of Z.

3 Compute

I_{α u}

,

I_{α m}

and

I_{β l}

.

4 Compute

\hat{Θ}

using Equation (15).

5 Compute the soft label matrix for unlabeled data using Equation (14)

6 Determine the label for every data by Equation (3).

2.5. Complexity Analysis

The computational complexity of Algorithm 1 is governed by four main steps, which have a cost of

O (n)

, with the exception of the third and final steps. Specifically: (1) the first step incurs a cost of

O (d m n T)

, where T is typically small and denotes the number of iterations for k-means, thus simplifying the complexity of this step to

O (d m n)

. (2) The second step has a cost of

O (s m n)

, with s representing the number of nearest representative points to be connected to. (3) The fourth step costs

O (m^{2} n)

, given that in SSL problems, we have

u > l

and

u \approx n

. And (4) The fifth step requires

O (n m c)

operations.

Consequently, the overall computational complexity of SSLCFBG is

O (max (m, d) \cdot m n)

, assuming

m \geq c

and

m \geq s

. Moreover, since in SSL problems we typically have

n ≫ d

and

n ≫ m

, the complexity of the proposed SSLCFBG is nearly linear with respect to the number of data points, making it more efficient than other GSSL algorithms. We will conduct experiments in the following section to elucidate the advantages of the proposed algorithm.

3. Results

In this section, we conduct extensive experiments on six real-world datasets to compare our algorithm with five state-of-the-art and baseline approaches.

3.1. Setup and Datasets

In our experimental setup, labeled data are randomly selected from a dataset, with the remaining instances treated as unlabeled. The reported results are the average of 20 independent trials. Unless specified otherwise, we construct a set of 1000 representative points. For the sake of simplicity, we employ Equation (21) during the label propagation phase.

All experiments are conducted on a desktop computer equipped with a 3.4 GHz Intel CPU and 16 GB of RAM, using Matlab for implementation. The running time is calculated in Matlab by aggregating the costs associated with graph construction and label propagation.

We carry out experiments on six real-world datasets, each described as follows:

Handwritten (HW), named as “Multiple Features Data Set” in the UCI machine learning repository [19]). It consists of 2000 data points, each of which is a handwritten digit ranging from 0 to 9 with 10 classes. We use 64 dimensional Karhunen–Loève coefficients [20] as features.
Caltech-101 (CT-101) [21]: Consists of 9144 images of objects across 101 categories. Each image has a size of about $300 \times 200$ pixels. Similar to [22], we also consider two additional sub-datasets: Caltech-7 (CT-7) and Caltech-20 (CT-20). Caltech-7 consists of 1474 images among seven classes. Caltech-20 consists of 2386 images among 20 classes. We use a Histogram of Oriented Gradients [23] as feature with dimension of 1984.
USPS [24]. The training part of the USPS dataset is used in our experiments. It consists of 10 classes of digits ranging from 0 to 9, with 7291 samples in total. We use all $16 \times 16$ pixels in each sample as a feature vector with dimension of 256.
NUS-WIDE-OBJECT (NUS) [25]. It consists of 31 object categories with totally 30,000 images, in which 17,927 are training images and the other 12,073 are testing data. We first combine all these 30,000 images, then randomly choose our labeled and unlabeled data from it. We use color correlogram in HSV color space [26] as the feature, with dimension of 144.
MNIST [27]. It consists of 70,000 handwritten images in grayscale with size of $28 \times 28$ . Its 10 classes ranges from 0 to 9. We randomly choose labeled data, while the remaining images are used as unlabeled data. All of $28 \times 28$ pixels in each sample are considered as a feature vector with dimension of 784.
Animal with attributes (AWA) [28]. This dataset consists of 30,475 images among 50 animal classes, each of which contains at least 92 data points. With semantic attributes assigned to its class, animals can be uniquely characterized by their attribute vector. We employ rgSIFT [29] as the feature, with dimension of 2000.

3.2. Experimental Results

In this subsection, we report our experimental results into two groups: (1) with baseline algorithms: 1NN and SVM [30]. And (2) With the state-of-the-art algorithms: HEM [2], GLC [4], GGSSL [5] and AGR [12].

Meanwhile, six real-world datasets are categorized into two subgroups: (1) small-size datasets: Handwritten, Caltech-101 (with three subsets: Caltech-7, Caltech-20, and Caltech-101) and USPS. And (2) large-size datasets: NUS-WIDE-OBJECT, MNIST, and AwA.

3.2.1. Comparison to Baseline Methods

We first compare the proposed SSLCFBG with two baseline algorithms.

On small datasets, as detailed in Table 1 and Table 2, the accuracy of the proposed algorithm surpasses that of both 1NN and SVM, particularly in scenarios with a limited number of labeled data. This is due to the fact that 1NN requires a larger number of labeled neighbors to accurately determine labels for unlabeled data, while SVM struggles with low accuracy because it attempts to separate data using hyperplanes, which can easily deviate from the correct position when labeled data are scarce. The proposed LSLGG inherits the high accuracies characteristic of graph-based methods. In terms of speed, SSLCFBG operates more slowly than 1NN and SVM, because we employ a relatively large number (i.e., 1000) of representative points to maintain consistency with other experiments in this paper. Evidently, the speed of SSLCFBG can be increased by reducing the number of representative points, as will be demonstrated in subsequent experiments.

On the larger MNIST dataset, SSLCFBG achieves significantly higher accuracy than 1NN and SVM when fewer labeled data are available, as shown in the left sub-figure of Figure 2. In the right sub-figure, the running times of SVM and 1NN increase markedly as the number of available labeled data grows, limiting their applicability in scenarios with a large volume of labeled data. Intriguingly, SSLCFBG maintains a nearly constant running time, independent of the number of labeled data, due to the use of a fixed number of representative points.

Based on these findings, our algorithm demonstrates a favorable balance between efficiency and effectiveness when compared to baseline methods.

3.2.2. Comparisons to State-of-the-Art Methods

As highlighted in our theoretical analysis, the high computational cost poses a significant barrier for many GSSL algorithms. In this subsection, we aim to demonstrate the effectiveness and efficiency of our algorithm in comparison to other state-of-the-art GSSL algorithms, excluding baseline methods that are either computationally intensive or lacking in accuracy.

The accuracy and running times for small and large datasets are presented in Table 3 and Table 4, respectively. We observe that the proposed SSLCFBG yields the best outcomes in terms of both effectiveness and efficiency across different datasets when compared to all other algorithms.

Regarding accuracy, as shown in Table 3 the proposed algorithm outperforms HEM, GLC, and GGSSL. This is because it not only leverages the advantages of GSSL algorithms, but also employs representative points to capture the underlying manifold structure of the raw data. These points are generated using k-means, which serves as a filtering mechanism to eliminate noisy data. Additionally, SSLCFBG demonstrates greater accuracy than AGR, as it does not impose any linear assumptions on the representative points or raw data, as discussed in our theoretical analysis.

In terms of efficiency, as indicated in Table 4, the proposed algorithm is considerably faster than HEM, GLC, and GGSSL, aligning with our theoretical analysis. This is due to two reasons: (1) label propagation in HEM, GLC, and GGSSL involves computing the inverses of

n \times n

matrices, whereas SSLCFBG only requires the inverse of an

m \times m

matrix, with

m ≪ n

; and (2) the bipartite graph used in graph construction is highly sparse. Furthermore, our algorithm is slightly faster than AGR, as it does not require additional time to accommodate the linear assumption.

To illustrate the limitations of the linear assumption in AGR, we compare its accuracy to that of our algorithm on the USPS dataset, as depicted in Figure 3. The number of representative points ranges from 100 to 1000, where the number of labeled data are set at 20. As evident from Figure 3, the performance of the proposed SSLCFBG consistently outperforms AGR. Therefore, the proposed algorithm is more effective than AGR, as it is not constrained by linear assumptions.

These experiments collectively highlight the advantages of the proposed SSLCFBG in managing large-scale data with respect to both efficiency and accuracy. By “big data”, we refer to the following scenarios: (1) an increased number of data points, (2) data with higher dimensions, (3) a greater number of labeled data, or (4) any combination of these factors. The benefits of SSLCFBG will be even more pronounced in larger datasets, making it well-suited for big data applications.

4. Conclusions

In this paper, we introduce a novel large-scale semi-supervised learning framework that utilizes a bipartite graph. The label propagation within this framework exhibits nearly linear computational complexity with respect to the number of data points. We attribute its success to the construction of a sparse, anti-diagonal affinity matrix. In this bipartite graph, a limited number of representative points are connected to a larger set of raw data points, resulting in a sparse and anti-diagonal affinity matrix that accelerates both graph construction and label propagation. The abundance of zero entries in the affinity matrix facilitates rapid graph construction. Moreover, the significantly smaller size of the matrix inverse required in our algorithm accelerates the process compared to traditional GSSL algorithms, as the affinity matrix simplifies the closed-form solution of our approach. Compared to existing fast algorithms like AGR, our proposed algorithm is consistently more efficient and accurate, as it does not require any linear assumptions. Extensive experiments conducted on six real-world datasets showcase the advantages of the proposed SSLCFBG over five state-of-the-art and baseline algorithms, both in terms of efficiency and effectiveness.

Author Contributions

Conceptualization, Z.P.; methodology, Z.P.; software, Z.P.; validation, Z.P., G.Z. and W.H.; formal analysis, Z.P.; investigation, Z.P.; resources, G.Z. and W.H.; data curation, Z.P.; writing—original draft preparation, Z.P.; writing—review and editing, Z.P., G.Z. and W.H.; visualization, Z.P.; supervision, G.Z. and W.H.; project administration, G.Z. and W.H.; funding acquisition, G.Z. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Platform Project of the Education Department of Guangdong Province (No. 2021KCXTD038, No. 2022KSYS003), the Discipline Construction and Promotion Project of Guangdong Province (No. 2022ZDJS065), Education and Teaching Reform Project of Hanshan Normal University (No. PX-161241546).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank the referees and the editors for their very useful suggestions, which improved significantly this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Notations used in the paper:

Notations	Descriptions
G	A Graph
$V$	The set of vertices in a graph
i	The index of i
$v_{i}$	The i-th vertex
$E$	The set of edges in a graph
$e_{i j}$	The edge linked between vertices of $v_{i}$ and $v_{j}$
$Q$	The set of representative points (or anchor points)
$q_{j}$	The j-th representative point
W	The affinity matrix (or weight matrix) of a graph
$w_{i j}$	The weight associated with edge $e_{i j}$
D	The degree matrix of a graph
$d_{i i}$	The degree of data point $x_{i}$
L	The graph Laplacian matrix
$X$	The set of data
$x_{i}$	The i-th data point
$L$	The set of labels
$y_{i}$	The label for data point $x_{i}$
Y	The label indicator matrix
$Y_{i j}$	$Y_{i j} = 1$ if $x_{i}$ is labeled as $y_{i} = j$ , otherwise $Y_{i j} = 0$
F	The soft label matrix
$F$	The set of $(n + m) \times c$ matrices with nonnegative entries
Z	The matrix of sample-adaptive weights
$K (x_{i}, q_{j}, σ)$	The kernel function
n	The number of data points
m	The number of representative points
c	The number of different types of labels
l	The number of labeled points
u	The number of unlabeled points
d	The dimension of a data point
$γ$	The finite weight in LGC [4]
$σ$	The standard deviation
$λ_{1, \dots, n + m}$	The parameters trade off label smoothness and consistence
$Λ$	$Λ = diag (λ_{1}, \dots, λ_{n + m})$ is a diagonal matrix
$tr (\cdot)$	The matrix trace
${∥ \cdot ∥}_{F}$	The Frobenius norm
$\hat{Θ}$ and $Θ^{- 1}$	The intermediate matrices
$I_{α}$	A diagonal matrix with $α_{i i} = \frac{1}{1 + λ_{i}}$ and $I_{α} = diag (I_{α l}, I_{α u}, I_{α m})$
$I_{α l}$	Sub-matrix of $I_{α}$ corresponding to labeled data
$I_{α u}$	Sub-matrix of $I_{α}$ corresponding to unlabeled data
$I_{α m}$	Sub-matrix of $I_{α}$ corresponding to representative points
$I_{β}$	$I_{β} = I - I_{α} = diag (I_{β l}, I_{β u}, I_{β m})$
$I_{β l}, I_{β u}, I_{β m}$	Sub-matrices of $I_{β}$ and similar to $I_{α l}$ , $I_{α u}$ and $I_{α m}$

References

Wang, Q.; Shen, B.; Wang, S.; Li, L.; Si, L. Binary Codes Embedding for Fast Image Tagging with Incomplete Labels. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; Volume 8690, pp. 425–439. [Google Scholar] [CrossRef]
Zhu, X.; Ghahramani, Z.; Lafferty, J. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003), Washington, DC, USA, 21–24 August 2003. [Google Scholar]
Song, Z.; Yang, X.; Xu, Z.; King, I. Graph-Based Semi-Supervised Learning: A Comprehensive Review. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8174–8194. [Google Scholar] [CrossRef] [PubMed]
Zhou, D.; Bousquet, O.; Lal, T.; Weston, J.; Scholkopf, B. Learning with Local and Global Consistency. Adv. Neural Inf. Process. Syst. 2003, 16, 321–328. [Google Scholar]
Nie, F.; Xiang, S.; Liu, Y.; Zhang, C. A General Graph-based Semi-supervised Learning with Novel Class Discovery. Neural Comput. Appl. 2010, 19, 549–555. [Google Scholar] [CrossRef]
Delalleau, O.; Bengio, Y.; Le Roux, N. Efficient Non-Parametric Function Induction in Semi-Supervised Learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS-2005), Savannah, GA, USA, 6–8 January 2005. [Google Scholar]
Kumar, S.; Mohri, M.; Talwalkar, A. Sampling techniques for the Nystrom method. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS-2009), Clearwater Beach, FL, USA, 16–18 April 2009. [Google Scholar]
Tsang, I.; Kwok, J. Large-scale sparsified manifold regularization. Adv. Neural Inf. Process. Syst. 2006, 19, 1401–1408. [Google Scholar]
Zhu, X.; Lafferty, J. Harmonic mixtures: Combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005. [Google Scholar]
Fergus, R.; Weiss, Y.; Torralba, A. Semi-supervised learning in gigantic image colletions. Adv. Neural Inf. Process. Syst. 2010, 522–530. [Google Scholar]
Liu, W.; He, J.; Chang, S. Large Graph Construction for Scalabel Semi-Supervised Learning. In Proceedings of the International Conference on Machine Learning (ICML-2010), Haifa, Israel, 21–24 June 2010. [Google Scholar]
Liu, W.; Wang, J.; Chang, S.F. Robust and Scalable Graph-Based Semisupervised Learning. Proc. IEEE 2012, 100, 2624–2638. [Google Scholar] [CrossRef]
He, F.; Nie, F.; Wang, R.; Hu, H.; Jia, W.; Li, X. Fast Semi-Supervised Learning With Optimal Bipartite Graph. IEEE Trans. Knowl. Data Eng. 2021, 33, 3245–3257. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, L.; Wang, R.; Nie, F.; Li, X. Semi-Supervised Learning via Bipartite Graph Construction with Adaptive Neighbors. IEEE Trans. Knowl. Data Eng. 2023, 35, 5257–5268. [Google Scholar] [CrossRef]
Rowshan, Y. The m-Bipartite Ramsey Number of the K_2,2 Versus K_6,6. Contrib. Math. 2022, 5, 36–42. [Google Scholar]
Luxburg, U. A Tutorial on Spectral Clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Ester, M.; Kriegel, H.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
Bache, K.; Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu (accessed on 1 November 2013).
Anwar, Z.; Chua, S.; Mowe, R. The MPH Cookbook; MPH Distributors (S): Singapore, 1978. [Google Scholar]
Li, F.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Imagin Underst. 2007, 106, 59–70. [Google Scholar]
Dueck, D.; Frey, B. Non-metric affinity propagation for unsupervised image categorization. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Hull, J. A database for handwritten text recognition research. Pattern Anal. Mach. Intell. IEEE Trans. 1994, 16, 550–554. [Google Scholar] [CrossRef]
Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM Conference on Image and Video Retrieval (CIVR’09), Santorini Island, Greece, 8–10 July 2009. [Google Scholar]
Huang, J.; Kumar, S.R.; Mitra, M.; Zhu, W.J.; Zabih, R. Image indexing using color correlograms. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 762–768. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Lampert, C.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar] [CrossRef]
van de Sande, K.; Gevers, T.; Snoek, C. Evaluating Color Descriptors for Object and Scene Recognition. Pattern Anal. Mach. Intell. IEEE Trans. 2010, 32, 1582–1596. [Google Scholar] [CrossRef] [PubMed]
Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27:1–27:27. [Google Scholar] [CrossRef]

Figure 1. Bipartite Graph Construction in GSSL Problem. (a) The original fully connected graph comprises 8 data points (circles) and 28 edges (lines). (b) The bipartite graph connects two distinct sets: one set includes three representative points (squares) labeled ‘A’, ‘B’, and ‘C’, while the other set contains the 8 data points. Each representative point is associated with several data points of the same color. (c) Label propagation can be performed across the 3 representative points and along the 3 connecting edges.

Figure 2. Comparisons between the proposed method and baseline methods on the MNIST dataset, with the quantity of labeled data ranging from 20 to 4500. Left: accuracy. Right: running time.

Figure 3. The accuracy of AGR and the proposed SSLCFBG, in relation to varying numbers of representative points, on the USPS dataset, where the number of labeled data are set at 20.

Table 1. Accuracy of the proposed method and baseline methods on HW dataset.

Methods	Label # 20 ^†	Label # 1000
1NN	0.7337 ± 0.0420	0.9663 ± 0.0049
SVM	0.7565 ± 0.0396	0.9607 ± 0.0060
Proposed	0.9385 ± 0.0237	0.9725 ± 0.0046

^† “# of Label” denotes “the number of labeled data”.

Table 2. Running time of the proposed method and baseline methods on HW dataset.

Methods	Label # 20 ^†	Label # 1000
1NN	0.0012 ± 0.0006	0.0162 ± 0.0014
SVM	0.0056 ± 0.0015	0.0524 ± 0.0019
Proposed	0.1341 ± 0.0023	0.1389 ± 0.0002

^† “# of Label” denotes “the number of labeled data”.

Table 3. Accuracy of the proposed algorithm and other graph-based algorithms.

		Middle Size Datasets					Large Size Datasets
No.	Methods^‡	HW	CT-7	CT-20	CT-101	USPS	NUS	MNIST	AwA
1	HEM	0.8970	0.8587	0.7565	0.3256	0.1719	0.1629	0.1844	0.0600
2	GLC	0.8991	0.7519	0.6551	0.2804	0.1513	0.1607	0.1817	0.0347
3	GGSSL	0.9064	0.8587	0.7520	0.2895	0.1513	0.1587	0.1795	0.0231
4	AGR	0.9214	0.8923	0.8072	0.3882	0.8515	0.1828	0.8494	0.0660
5	Proposed	0.9385	0.9017	0.8202	0.3990	0.8613	0.1923	0.8591	0.0700

^‡ The number of labeled data are HW

= 20

, CT-7

= 20

, CT-20

= 50

, CT-101

= 200

, USPS

= 20

, NUS

= 1000

, MNIST

= 20

, AwA

= 1000

.

Table 4. Running time of the proposed algorithm and other graph-based algorithm.

		Middle Size Datasets					Large Size Datasets
No.	Methods	HW	CT-7	CT-20	CT-101	USPS	NUS	MNIST	AwA
1	HEM	0.1411	13.9983	42.3065	955.9894	17.2055	262.4838	13343.1989	5795.7914
2	GLC	0.1341	13.9624	42.2709	964.4532	17.2987	177.7972	12964.3791	6322.7574
3	GGSSL	0.1344	13.9637	42.2850	961.0679	17.2555	180.9809	13167.3352	6133.7992
4	AGR	0.1532	0.6570	1.3209	6.7802	1.4397	3.9459	24.2268	295.9069
5	Proposed	0.1341	0.6349	1.3022	6.7387	1.4175	3.9206	24.1960	295.8366

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, Z.; Zheng, G.; Huang, W. Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph. Symmetry 2024, 16, 1312. https://doi.org/10.3390/sym16101312

AMA Style

Peng Z, Zheng G, Huang W. Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph. Symmetry. 2024; 16(10):1312. https://doi.org/10.3390/sym16101312

Chicago/Turabian Style

Peng, Zhongxing, Gengzhong Zheng, and Wei Huang. 2024. "Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph" Symmetry 16, no. 10: 1312. https://doi.org/10.3390/sym16101312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph

Abstract

1. Introduction

1.1. Notations

1.2. Graph Construction in GSSL

1.3. Related Algorithms

2. Materials and Methods

2.1. General Graph-Based Model

2.2. Naive Label Propagation

2.3. Large-Scale Graph Construction

2.4. Label Propagation and SSLCFBG

2.5. Complexity Analysis

3. Results

3.1. Setup and Datasets

3.2. Experimental Results

3.2.1. Comparison to Baseline Methods

3.2.2. Comparisons to State-of-the-Art Methods

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI