1. Introduction
In recent decades, the volume of user-generated content has experienced explosive growth. As an illustration, millions of new photos are added to Flickr daily, in addition to its already substantial collection of several billion images [
1]. This massive influx of multimedia data presents novel challenges and vast opportunities for the fields of machine learning and data mining. Given the infeasibility of manually annotating such extensive datasets, Semi-supervised Learning (SSL) emerges as a pivotal approach, utilizing abundant unlabeled data in conjunction with a relatively small amount of labeled data. This paper focus on one principal category of SSL techniques: Graph-based SSL (GSSL) [
2,
3].
GSSL is an appealing approach due to its superior performance, despite its cubic computational complexity. In GSSL, we assume that similar labels are shared by nearby data points. Based on a small number of given labels, GSSL constructs an affinity graph and utilizes it to perform close-form label propagation, resulting in high accuracy. Representative GSSL methods include Harmonic Energy Minimization (HEM) [
2], Local and Global Consistency (LGC) [
4], and General Graph Semi-Supervised Learning (GGSSL) [
5]. However, their computational complexity of label propagation is cubic (i.e.,
) for
n data points, as it requires computing the inverse of the graph Laplacian. In addition,
operations are needed to construct their dense affinity graphs, which are impractical for large-scale datasets. Therefore, fast GSSL algorithms are highly desirable.
Several studies have been proposed to seek efficient GSSL algorithm. In [
6], label propagation is first carried out on a randomly selected subset, and then on a truncated graph Laplacian which is constructed by connecting the selected subset with remaining data points. In [
7], the Nyström method is used to approximate the eigenfunction with that of a randomly selected subset. However, both of these methods experience performance degradation, as they do not take into account the topology structure of the dataset during approximation. In [
8], a fast SSL algorithm is proposed by sparsifying manifold regularization. In [
9], a method combining the parametric mixture model with harmonic property is proposed to handle raw data and label propagation. Based on spectral graph theory, in [
10], an efficient method is proposed to approximate the eigenvectors of the normalized graph Laplacian. However, its assumptions, such as dimension independence and 1-D uniform distributions, are often difficult to satisfy in many real-world datasets. In [
11,
12], Anchor Graph Regularization (AGR) is introduced, which considers manifold regularization by employing a low-rank approximation of the original graph. Nevertheless, it necessitates a linear transformation between raw data and anchor points. Evidently, the efficiency of these algorithms comes at the expense of accuracy. In [
13], Optimal Bipartite Graph Semi-Supervised Learning (OBGSSL) is introduced, which learns a new bipartite graph iteratively instead of fixing the input data graph. In [
14], semi-supervised learning via Bipartite graph Construction and Adaptive Neighbors (BCAN) considers both bipartite graph construction and label propagation simultaneously. Since the bipartite graph [
15] must be updated iteratively until convergence in both OBGSSL and BCAN, their accuracy and speed are either inferior or marginally superior to AGR, as evidenced by their own experimental results in [
13,
14]. These inconsistent performances compared to AGR motivate us to develop our algorithm, which ultimately demonstrates a consistent outperformance over AGR, as confirmed by the experimental results presented at the end of this paper.
In this paper, we introduce an efficient and effective algorithm termed Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG). Here, closed-form label propagation implies that the soft label matrix for unlabeled data can be computed analytically through an equation, rather than employing an iterative algorithm. This algorithm is based on a bipartite graph, which uses a small set of representative points to represent a large set of raw data. Label propagation is subsequently performed on the graph Laplacian of this structure. Owing to the unique structure of its affinity matrix, the proposed SSLCFBG exhibits efficiency in both graph construction and label propagation steps. Firstly, constructing its sparse graph incurs a cost of , as opposed to the cost associated with traditional GSSL algorithms, where m represents the number of representative points and satisfies . Secondly, the block structure of the graph Laplacian facilitates accelerated close-form label propagation through an efficient matrix inversion, with a computational complexity that scales nearly linearly with the number of data points n. The advantages of the proposed SSLCFBG in terms of speed and accuracy are validated by experiments conducted on six real-world datasets, outperforming five state-of-the-art and baseline algorithms.
Next, we will first elucidate our notations and then briefly delve into the graph construction of GSSL and the label propagation process in related algorithms.
1.1. Notations
Let be the dataset, where and . Label set is . Without loss of generality, we assume the first l data points are labeled as and the remaining data points are unlabeled. Similar to other SSL problems, we assume . denotes the set of representative points (or anchor points for AGR) with . A soft label matrix is defined as , where denotes the set of matrices with nonnegative entries. Label indicator matrix is defined as with if is labeled as , and otherwise. For convenience, we express F and Y in sub-matrices: , , where , and correspond to the rows for labeled, unlabeled and the representative points in F, respectively. Similarly, we define , and for Y.
1.2. Graph Construction in GSSL
The first step of GSSL is to construct an undirected graph , where is a set including raw data and representative points. A set consists of edges between vertices. Associated with each edge , its weight is defined as , which forms an affinity matrix , where means no direct connection between and .
There are different ways to construct the affinity matrix. For example, we use
k-nearest neighbor (kNN) to connects any pair of points if one is among the
k-th nearest neighbors of the other [
16]. The computational complexity of kNN in full graph construction is
, making it unsuitable for real-world datasets with large
n. In this paper, we propose SSLCFBG with a fast graph construction. Meanwhile, graph Laplician is defined as
, where
and
D is a diagonal matrix with diagonal elements as
.
1.3. Related Algorithms
After constructing the graph in the above subsection, we are going to discuss label propagation of three closely related algorithms: HEM, LGC and AGR.
Considering the given labels as additional constrains, HEM in [
2] aims to minimize
for
, where
is matrix trace. Here, label consistence of given label is guaranteed by the explicit constraints of
. The closed-form label propagation of HEM is
where
is the degree matrix in graph Laplician, while
and
correspond to labeled and unlabeled data.
is the weight for unlabeled data, and
is the weight between labeled and unlabeled data. For every data
, its label
can be estimated through
where
.
Since HEM is unable to handle noised data, LGC in [
4] generalizes HEM with a finite weight of
, which has a following loss function:
where
is matrix trace, and
represents the Frobenius norm. The closed-form solution for LGC is
To reduce the high computational complexities of HEM and LGC, AGR in [
11,
12] minimizes
where
and
are matrices of sample-adaptive weights. AGR is much faster than HEM and LGC, because it employs a small number of anchor points. However, it requires a linear assumption of
where
. This assumption implies that the labels of the raw data can be linearly represented by the labels of a limited number of anchor points. Given that the ratio of
is typically very high, it becomes challenging to accurately fit all the raw data using a linear projection.
In the subsequent section, we introduce SSLCFBG, which aims to preserve the precision of HEM and LGC while maintaining the efficiency of AGR.
2. Materials and Methods
This section presents an efficient and effective algorithm, SSLCFBG, designed to address a general GSSL problem using a bipartite graph. On the one hand, it converges significantly faster than HEM and LGC. On the other hand, the proposed algorithm’s label propagation is faster and more precise, as it does not rely on the linear assumption inherent in AGR.
2.1. General Graph-Based Model
The GSSL problem can be formulated in different ways, such as those in [
2,
4,
5]. In this paper, we consider a general model (loss function) as follows:
where parameter
trades off label smoothness and consistence. Particularly, the loss functions of HEM and LGC can be interpreted as special cases of Equation (
8), if
We set
,
for
, and
otherwise. Then, Equation (
8) is the loss function of HEM.
We let
, and
. Then, Equation (
8) becomes the loss function of LGC.
For convenience, Equation (
8) can be rewritten in matrix-form,
where
is a diagonal matrix.
2.2. Naive Label Propagation
Due to its convexity with respect to
F, the minimum of Equation (
9) attains when
, which leads to the closed-form solution of
Then, we apply Equation (
3) to assign labels.
However, when it comes to computational complexity, efficiently minimizing Equation (
9) presents a significant challenge. The crux lies in Equation (
10), where computing the inverse of an
matrix requires
operations. Likewise, the closed-form solutions for HEM and LGC, as presented in Equation (
2) and Equation (
5), respectively, also entail cubic computational complexities.
Consequently, it is highly desirable to reduce the computational cost associated with label propagation.
2.3. Large-Scale Graph Construction
In this subsection, we employ a bipartite graph to expedite the graph construction phase. An illustrative example of constructing such a bipartite graph is provided in
Figure 1.
Figure 1a shows the original fully connected graph, comprising 8 data points and 28 edges. In
Figure 1b, we introduce three representative points to represent the 8 data points; for instance, ‘A’is designated to represent data points ‘1’and ‘2’. This bipartite graph requires only 8 edges.
Figure 1c displays just the 3 representative points connected by 3 edges, as opposed to the 8 data points and 28 edges in
Figure 1a.
The affinity matrix of GSSL problem based on bipartite graph is sparse and block anti-diagonal, i.e.,
where
,
and
. Meanwhile, we assume that
Z is element-wise nonnegative and row-wise normalized, i.e.,
and
. Since
, a large portion of the elements in
W of Equation (
11) are zeros, which reduce the computational complexity to construct the graph. To make it more efficient, it is reasonable to assume that
Z is sparse, since the manifold assumption of GSSL implies that it is good enough to have
for data
if it is far from a representative point
.
The matrix
Z can be determined by Nadaraya–Watson kernel estimation [
11,
17]:
where set
consists of the indices of the
s nearest representative points of
. Parameter
s is given by user to define the sparsity number of each row in
Z.
represents kernel function; for instance, we can use following Gaussian kernel:
Representative points can be chosen using various methods, including random sampling,
k-means clustering, or Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [
18]. Owing to space constraints, we opt for
k-means in this paper for its simplicity and efficacy. Nevertheless, the proposed SSLCFBG framework can be adapted to incorporate other methods for selecting representative points.
2.4. Label Propagation and SSLCFBG
This subsection further enhances the efficiency of another step in label propagation by leveraging a bipartite graph, then we will summarize the proposed SSLCFBG algorithm. First, of all, the closed-form solution for label propagation in our SSLCFBG algorithm is established through the following theorem.
Theorem 1. If the GSSL problem has
- 1.
Undirected weighted graph as ;
- 2.
Label indicator matrix where and ;
- 3.
Affinity matrix W is defined as Equation (11).
Then, we have following closed-form solution for label propagation:where is a diagonal matrix, whose nonzero element satisfies . , and Proof. According to Equation (
10), we have
where
,
and
Then, we can expand Equation (
16) to have the following equation:
Utilizing
and
results in Equation (
14). □
Moreover, based on Equation (
14), a specific solution can be derived after taking into account additional information. For example, we can utilize Harmonic property in HEM of [
2], if set
,
and
are as follows:
where
denotes the index set of
.
is the index set for labeled data.
means the index set for unlabeled data and representative points. Then, the closed-form solution for HEM with bipartite graph is
and
whose computational complexity,
, is considerably lower than the
complexity of HEM, under the assumption that
.
Lastly, the proposed SSLCFBG algorithm is encapsulated in Algorithm 1, with its computational complexity to be discussed in the subsequent subsection.
Algorithm 1: Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG) |
Input: m, s, , |
Output: |
1 Selection of m representative points by k-means. |
2 Construct bipartite graph then compute the corresponding . Here, we use Equation (12) to decide the elements of Z. |
3 Compute , and . |
4 Compute using Equation (15). |
5 Compute the soft label matrix for unlabeled data using Equation (14) |
6 Determine the label for every data by Equation (3). |
2.5. Complexity Analysis
The computational complexity of Algorithm 1 is governed by four main steps, which have a cost of , with the exception of the third and final steps. Specifically: (1) the first step incurs a cost of , where T is typically small and denotes the number of iterations for k-means, thus simplifying the complexity of this step to . (2) The second step has a cost of , with s representing the number of nearest representative points to be connected to. (3) The fourth step costs , given that in SSL problems, we have and . And (4) The fifth step requires operations.
Consequently, the overall computational complexity of SSLCFBG is , assuming and . Moreover, since in SSL problems we typically have and , the complexity of the proposed SSLCFBG is nearly linear with respect to the number of data points, making it more efficient than other GSSL algorithms. We will conduct experiments in the following section to elucidate the advantages of the proposed algorithm.