Next Article in Journal
Existence of Positive Solutions for Singular Difference Equations with Nonlinear Boundary Conditions
Previous Article in Journal
Research on Path Planning for Intelligent Mobile Robots Based on Improved A* Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph

by
Zhongxing Peng
,
Gengzhong Zheng
* and
Wei Huang
*
School of Computer Information Engineering, Hanshan Normal University, Chaozhou 521041, China
*
Authors to whom correspondence should be addressed.
Symmetry 2024, 16(10), 1312; https://doi.org/10.3390/sym16101312
Submission received: 26 August 2024 / Revised: 25 September 2024 / Accepted: 2 October 2024 / Published: 4 October 2024
(This article belongs to the Section Computer)

Abstract

:
In this paper, we introduce an efficient and effective algorithm for Graph-based Semi-Supervised Learning (GSSL). Unlike other GSSL methods, our proposed algorithm achieves efficiency by constructing a bipartite graph, which connects a small number of representative points to a large volume of raw data by capturing their underlying manifold structures. This bipartite graph, with a sparse and anti-diagonal affinity matrix which is symmetrical, serves as a low-rank approximation of the original graph. Consequently, our algorithm accelerates both the graph construction and label propagation steps. In particular, on the one hand, our algorithm computes the label propagation in closed-form, reducing its computational complexity from cubic to approximately linear with respect to the number of data points; on the other hand, our algorithm calculates the soft label matrix for unlabeled data using a closed-form solution, thereby gaining additional acceleration. Comprehensive experiments performed on six real-world datasets demonstrate the efficiency and effectiveness of our algorithm in comparison to five state-of-the-art algorithms.

1. Introduction

In recent decades, the volume of user-generated content has experienced explosive growth. As an illustration, millions of new photos are added to Flickr daily, in addition to its already substantial collection of several billion images [1]. This massive influx of multimedia data presents novel challenges and vast opportunities for the fields of machine learning and data mining. Given the infeasibility of manually annotating such extensive datasets, Semi-supervised Learning (SSL) emerges as a pivotal approach, utilizing abundant unlabeled data in conjunction with a relatively small amount of labeled data. This paper focus on one principal category of SSL techniques: Graph-based SSL (GSSL) [2,3].
GSSL is an appealing approach due to its superior performance, despite its cubic computational complexity. In GSSL, we assume that similar labels are shared by nearby data points. Based on a small number of given labels, GSSL constructs an affinity graph and utilizes it to perform close-form label propagation, resulting in high accuracy. Representative GSSL methods include Harmonic Energy Minimization (HEM) [2], Local and Global Consistency (LGC) [4], and General Graph Semi-Supervised Learning (GGSSL) [5]. However, their computational complexity of label propagation is cubic (i.e., O ( n 3 ) ) for n data points, as it requires computing the inverse of the graph Laplacian. In addition, O ( n 2 ) operations are needed to construct their dense affinity graphs, which are impractical for large-scale datasets. Therefore, fast GSSL algorithms are highly desirable.
Several studies have been proposed to seek efficient GSSL algorithm. In [6], label propagation is first carried out on a randomly selected subset, and then on a truncated graph Laplacian which is constructed by connecting the selected subset with remaining data points. In [7], the Nyström method is used to approximate the eigenfunction with that of a randomly selected subset. However, both of these methods experience performance degradation, as they do not take into account the topology structure of the dataset during approximation. In [8], a fast SSL algorithm is proposed by sparsifying manifold regularization. In [9], a method combining the parametric mixture model with harmonic property is proposed to handle raw data and label propagation. Based on spectral graph theory, in [10], an efficient method is proposed to approximate the eigenvectors of the normalized graph Laplacian. However, its assumptions, such as dimension independence and 1-D uniform distributions, are often difficult to satisfy in many real-world datasets. In [11,12], Anchor Graph Regularization (AGR) is introduced, which considers manifold regularization by employing a low-rank approximation of the original graph. Nevertheless, it necessitates a linear transformation between raw data and anchor points. Evidently, the efficiency of these algorithms comes at the expense of accuracy. In [13], Optimal Bipartite Graph Semi-Supervised Learning (OBGSSL) is introduced, which learns a new bipartite graph iteratively instead of fixing the input data graph. In [14], semi-supervised learning via Bipartite graph Construction and Adaptive Neighbors (BCAN) considers both bipartite graph construction and label propagation simultaneously. Since the bipartite graph [15] must be updated iteratively until convergence in both OBGSSL and BCAN, their accuracy and speed are either inferior or marginally superior to AGR, as evidenced by their own experimental results in [13,14]. These inconsistent performances compared to AGR motivate us to develop our algorithm, which ultimately demonstrates a consistent outperformance over AGR, as confirmed by the experimental results presented at the end of this paper.
In this paper, we introduce an efficient and effective algorithm termed Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG). Here, closed-form label propagation implies that the soft label matrix F u for unlabeled data can be computed analytically through an equation, rather than employing an iterative algorithm. This algorithm is based on a bipartite graph, which uses a small set of representative points to represent a large set of raw data. Label propagation is subsequently performed on the graph Laplacian of this structure. Owing to the unique structure of its affinity matrix, the proposed SSLCFBG exhibits efficiency in both graph construction and label propagation steps. Firstly, constructing its sparse graph incurs a cost of O ( n m ) , as opposed to the O ( n 2 ) cost associated with traditional GSSL algorithms, where m represents the number of representative points and satisfies m n . Secondly, the block structure of the graph Laplacian facilitates accelerated close-form label propagation through an efficient matrix inversion, with a computational complexity that scales nearly linearly with the number of data points n. The advantages of the proposed SSLCFBG in terms of speed and accuracy are validated by experiments conducted on six real-world datasets, outperforming five state-of-the-art and baseline algorithms.
Next, we will first elucidate our notations and then briefly delve into the graph construction of GSSL and the label propagation process in related algorithms.

1.1. Notations

Let X = { x 1 , , x l , x l + 1 , , x n } be the dataset, where x i R d and i = 1 , , n . Label set is L = { 1 , , c } . Without loss of generality, we assume the first l data points x i ( i l ) are labeled as y i L and the remaining data points x u ( l + 1 u n ) are unlabeled. Similar to other SSL problems, we assume l u . Q = { q 1 , , q m } R d denotes the set of representative points (or anchor points for AGR) with m n . A soft label matrix is defined as F F , where F denotes the set of ( n + m ) × c matrices with nonnegative entries. Label indicator matrix is defined as Y R ( n + m ) × c with Y i j = 1 if x i is labeled as y i = j , and Y i j = 0 otherwise. For convenience, we express F and Y in sub-matrices: F = [ F l T , F u T , F q T ] T , Y = [ Y l T , Y u T , Y q T ] T , where F l , F u and F q correspond to the rows for labeled, unlabeled and the representative points in F, respectively. Similarly, we define Y l , Y u and Y q for Y.

1.2. Graph Construction in GSSL

The first step of GSSL is to construct an undirected graph G ( V , E , W ) , where V = X Q is a set including raw data and representative points. A set E V × V consists of edges between vertices. Associated with each edge e i j E , its weight is defined as w i j 0 , which forms an affinity matrix W R ( n + m ) × ( n + m ) , where w i j = 0 means no direct connection between v i and v j .
There are different ways to construct the affinity matrix. For example, we use k-nearest neighbor (kNN) to connects any pair of points if one is among the k-th nearest neighbors of the other [16]. The computational complexity of kNN in full graph construction is O ( ( n + m ) 2 ) , making it unsuitable for real-world datasets with large n. In this paper, we propose SSLCFBG with a fast graph construction. Meanwhile, graph Laplician is defined as L = D W , where L R l × l and D is a diagonal matrix with diagonal elements as d i i = j w i j .

1.3. Related Algorithms

After constructing the graph in the above subsection, we are going to discuss label propagation of three closely related algorithms: HEM, LGC and AGR.
Considering the given labels as additional constrains, HEM in [2] aims to minimize
J HEM ( F ) : = tr ( F T L F ) , s . t . F i = Y i ,
for i = 1 , , l , where tr ( · ) is matrix trace. Here, label consistence of given label is guaranteed by the explicit constraints of F i = Y i . The closed-form label propagation of HEM is
F u * = ( D u u W u u ) 1 W u l Y l ,
where D = diag ( D l l , D u u ) is the degree matrix in graph Laplician, while D l l and D u u correspond to labeled and unlabeled data. W u u R u × u is the weight for unlabeled data, and W l u R l × u is the weight between labeled and unlabeled data. For every data x k , its label y k can be estimated through
y k = arg max j c F k j ,
where k = l + 1 , , n .
Since HEM is unable to handle noised data, LGC in [4] generalizes HEM with a finite weight of γ > 0 , which has a following loss function:
J LGC ( F ) : = tr F T L F + γ i = 1 n F i Y i F 2 ,
where tr ( · ) is matrix trace, and · F represents the Frobenius norm. The closed-form solution for LGC is
F * = D 1 1 + γ W 1 Y .
To reduce the high computational complexities of HEM and LGC, AGR in [11,12] minimizes
min F 1 2 Z l q F q Y l F 2 + γ 2 tr F q T Z q T L Z q F q ,
where Z l q R l × q and Z q R n × q are matrices of sample-adaptive weights. AGR is much faster than HEM and LGC, because it employs a small number of anchor points. However, it requires a linear assumption of
F l + u = Z q F q ,
where F l + u R n × c . This assumption implies that the labels of the raw data can be linearly represented by the labels of a limited number of anchor points. Given that the ratio of n / m is typically very high, it becomes challenging to accurately fit all the raw data using a linear projection.
In the subsequent section, we introduce SSLCFBG, which aims to preserve the precision of HEM and LGC while maintaining the efficiency of AGR.

2. Materials and Methods

This section presents an efficient and effective algorithm, SSLCFBG, designed to address a general GSSL problem using a bipartite graph. On the one hand, it converges significantly faster than HEM and LGC. On the other hand, the proposed algorithm’s label propagation is faster and more precise, as it does not rely on the linear assumption inherent in AGR.

2.1. General Graph-Based Model

The GSSL problem can be formulated in different ways, such as those in [2,4,5]. In this paper, we consider a general model (loss function) as follows:
J ( F ) = i , j = 1 n + m w i j F i F j F 2 + i = 1 n + m λ i F i Y i F 2 ,
where parameter λ i > 0 trades off label smoothness and consistence. Particularly, the loss functions of HEM and LGC can be interpreted as special cases of Equation (8), if
  • We set m = 0 , λ i = + for i = 1 , , l , and λ i = 0 otherwise. Then, Equation (8) is the loss function of HEM.
  • We let λ 1 = = λ n > 0 , and m = 0 . Then, Equation (8) becomes the loss function of LGC.
For convenience, Equation (8) can be rewritten in matrix-form,
J ( F ) = tr ( F T L F ) + tr ( F Y ) T Λ ( F Y ) ,
where Λ = diag ( λ 1 , , λ n + m ) is a diagonal matrix.

2.2. Naive Label Propagation

Due to its convexity with respect to F, the minimum of Equation (9) attains when J ( F ) F = 2 L F + 2 Λ ( F Y ) = 0 , which leads to the closed-form solution of
F * = ( L + Λ ) 1 Λ Y .
Then, we apply Equation (3) to assign labels.
However, when it comes to computational complexity, efficiently minimizing Equation (9) presents a significant challenge. The crux lies in Equation (10), where computing the inverse of an ( n + m ) × ( n + m ) matrix requires O ( ( n + m ) 3 ) operations. Likewise, the closed-form solutions for HEM and LGC, as presented in Equation (2) and Equation (5), respectively, also entail cubic computational complexities.
Consequently, it is highly desirable to reduce the computational cost associated with label propagation.

2.3. Large-Scale Graph Construction

In this subsection, we employ a bipartite graph to expedite the graph construction phase. An illustrative example of constructing such a bipartite graph is provided in Figure 1. Figure 1a shows the original fully connected graph, comprising 8 data points and 28 edges. In Figure 1b, we introduce three representative points to represent the 8 data points; for instance, ‘A’is designated to represent data points ‘1’and ‘2’. This bipartite graph requires only 8 edges. Figure 1c displays just the 3 representative points connected by 3 edges, as opposed to the 8 data points and 28 edges in Figure 1a.
The affinity matrix of GSSL problem based on bipartite graph is sparse and block anti-diagonal, i.e.,
W = 0 Z Z T 0 R ( n + m ) × ( n + m ) ,
where Z = [ W l q T , W u q T ] T R n × m , W l q R l × m and W u q R u × m . Meanwhile, we assume that Z is element-wise nonnegative and row-wise normalized, i.e., Z i j 0 and j = 1 m Z i j = 1 . Since m n , a large portion of the elements in W of Equation (11) are zeros, which reduce the computational complexity to construct the graph. To make it more efficient, it is reasonable to assume that Z is sparse, since the manifold assumption of GSSL implies that it is good enough to have Z i j = 0 for data x i if it is far from a representative point q j .
The matrix Z can be determined by Nadaraya–Watson kernel estimation [11,17]:
Z i j = K ( x i , q j , σ ) k A i K ( x i , q k , σ ) ,
where set A i consists of the indices of the s nearest representative points of x i . Parameter s is given by user to define the sparsity number of each row in Z. K ( x i , q j , σ ) represents kernel function; for instance, we can use following Gaussian kernel:
K ( x i , q j , σ ) = exp x i q j 2 2 σ 2 .
Representative points can be chosen using various methods, including random sampling, k-means clustering, or Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [18]. Owing to space constraints, we opt for k-means in this paper for its simplicity and efficacy. Nevertheless, the proposed SSLCFBG framework can be adapted to incorporate other methods for selecting representative points.

2.4. Label Propagation and SSLCFBG

This subsection further enhances the efficiency of another step in label propagation by leveraging a bipartite graph, then we will summarize the proposed SSLCFBG algorithm. First, of all, the closed-form solution for label propagation in our SSLCFBG algorithm is established through the following theorem.
Theorem 1.
If the GSSL problem has
1. 
Undirected weighted graph as G ( V , E , W ) ;
2. 
Label indicator matrix Y = [ Y l T , Y u T , Y q T ] T where Y u = 0 and Y q = 0 ;
3. 
Affinity matrix W is defined as Equation (11).
Then, we have following closed-form solution for label propagation:
F u * = I α u W u q Θ ^ 1 I α m W l q T I β l Y l
where I α = diag ( I α l , I α u , I α m ) R ( n + m ) × ( n + m ) is a diagonal matrix, whose nonzero element satisfies α i i = 1 1 + λ i . I β = I I α = diag ( I β l , I β u , I β m ) , and
Θ ^ = I I α m W l q T I α l W l q I α m W u q T I α u W u q .
Proof. 
According to Equation (10), we have
F * = ( L + Λ ) 1 Λ Y = ( I I α W ) 1 I β Y = I + B α l Θ 1 B α m B α l Θ 1 Θ 1 B α m Θ 1 I β Y
where B α l = [ 0 , I α l W l q ] , B α m = 0 I α m W l q T and
Θ 1 = I I α u W u q I α m W u q T I I α m W l q T I α l W l q 1 = I + I α u W u q Θ ^ 1 I α m W u q T I α u W u q Θ ^ 1 Θ ^ 1 I α m W u q T Θ ^ 1 .
Then, we can expand Equation (16) to have the following equation:
F u * = I α u W u q Θ ^ 1 I α m W l q T I β l Y l + ( I + I α u W u q Θ ^ 1 I α m W u q T ) I β u Y u + I α u W u q Θ ^ 1 I β m Y q ,
Utilizing Y u = 0 and Y q = 0 results in Equation (14).    □
Moreover, based on Equation (14), a specific solution can be derived after taking into account additional information. For example, we can utilize Harmonic property in HEM of [2], if set Θ ^ = I W u q T W u q , α i and β i are as follows:
α i = 1 1 + μ i = 0 , i I l 1 1 + 0 = 1 , i I u + m ,
β i = 1 , i I l 0 , i I u + m ,
where I = { 1 , , n + m } = I l I u + m denotes the index set of V = X Q . I l is the index set for labeled data. I u + m means the index set for unlabeled data and representative points. Then, the closed-form solution for HEM with bipartite graph is F l * = Y l and
F u * = W u q ( I W u q T W u q ) 1 W l q T Y l ,
whose computational complexity, O ( m 3 ) , is considerably lower than the O ( u 3 ) complexity of HEM, under the assumption that u m .
Lastly, the proposed SSLCFBG algorithm is encapsulated in Algorithm 1, with its computational complexity to be discussed in the subsequent subsection.
Algorithm 1: Semi-Supervised Learning with Close-Form label propagation using Bipartite Graph (SSLCFBG)
   Input: m, s, Y l , λ 1 , , n + m
   Output: F u *
1 Selection of m representative points by k-means.
2 Construct bipartite graph then compute the corresponding Z = [ W l q T , W u q T ] T . Here, we use Equation (12) to decide the elements of Z.
3 Compute I α u , I α m and I β l .
4 Compute Θ ^ using Equation (15).
5 Compute the soft label matrix for unlabeled data using Equation (14)
6 Determine the label for every data by Equation (3).

2.5. Complexity Analysis

The computational complexity of Algorithm 1 is governed by four main steps, which have a cost of O ( n ) , with the exception of the third and final steps. Specifically: (1) the first step incurs a cost of O ( d m n T ) , where T is typically small and denotes the number of iterations for k-means, thus simplifying the complexity of this step to O ( d m n ) . (2) The second step has a cost of O ( s m n ) , with s representing the number of nearest representative points to be connected to. (3) The fourth step costs O ( m 2 n ) , given that in SSL problems, we have u > l and u n . And (4) The fifth step requires O ( n m c ) operations.
Consequently, the overall computational complexity of SSLCFBG is O ( max ( m , d ) · m n ) , assuming m c and m s . Moreover, since in SSL problems we typically have n d and n m , the complexity of the proposed SSLCFBG is nearly linear with respect to the number of data points, making it more efficient than other GSSL algorithms. We will conduct experiments in the following section to elucidate the advantages of the proposed algorithm.

3. Results

In this section, we conduct extensive experiments on six real-world datasets to compare our algorithm with five state-of-the-art and baseline approaches.

3.1. Setup and Datasets

In our experimental setup, labeled data are randomly selected from a dataset, with the remaining instances treated as unlabeled. The reported results are the average of 20 independent trials. Unless specified otherwise, we construct a set of 1000 representative points. For the sake of simplicity, we employ Equation (21) during the label propagation phase.
All experiments are conducted on a desktop computer equipped with a 3.4 GHz Intel CPU and 16 GB of RAM, using Matlab for implementation. The running time is calculated in Matlab by aggregating the costs associated with graph construction and label propagation.
We carry out experiments on six real-world datasets, each described as follows:
  • Handwritten (HW), named as “Multiple Features Data Set” in the UCI machine learning repository [19]). It consists of 2000 data points, each of which is a handwritten digit ranging from 0 to 9 with 10 classes. We use 64 dimensional Karhunen–Loève coefficients [20] as features.
  • Caltech-101 (CT-101) [21]: Consists of 9144 images of objects across 101 categories. Each image has a size of about 300 × 200 pixels. Similar to [22], we also consider two additional sub-datasets: Caltech-7 (CT-7) and Caltech-20 (CT-20). Caltech-7 consists of 1474 images among seven classes. Caltech-20 consists of 2386 images among 20 classes. We use a Histogram of Oriented Gradients [23] as feature with dimension of 1984.
  • USPS [24]. The training part of the USPS dataset is used in our experiments. It consists of 10 classes of digits ranging from 0 to 9, with 7291 samples in total. We use all 16 × 16 pixels in each sample as a feature vector with dimension of 256.
  • NUS-WIDE-OBJECT (NUS) [25]. It consists of 31 object categories with totally 30,000 images, in which 17,927 are training images and the other 12,073 are testing data. We first combine all these 30,000 images, then randomly choose our labeled and unlabeled data from it. We use color correlogram in HSV color space [26] as the feature, with dimension of 144.
  • MNIST [27]. It consists of 70,000 handwritten images in grayscale with size of 28 × 28 . Its 10 classes ranges from 0 to 9. We randomly choose labeled data, while the remaining images are used as unlabeled data. All of 28 × 28 pixels in each sample are considered as a feature vector with dimension of 784.
  • Animal with attributes (AWA) [28]. This dataset consists of 30,475 images among 50 animal classes, each of which contains at least 92 data points. With semantic attributes assigned to its class, animals can be uniquely characterized by their attribute vector. We employ rgSIFT [29] as the feature, with dimension of 2000.

3.2. Experimental Results

In this subsection, we report our experimental results into two groups: (1) with baseline algorithms: 1NN and SVM [30]. And (2) With the state-of-the-art algorithms: HEM [2], GLC [4], GGSSL [5] and AGR [12].
Meanwhile, six real-world datasets are categorized into two subgroups: (1) small-size datasets: Handwritten, Caltech-101 (with three subsets: Caltech-7, Caltech-20, and Caltech-101) and USPS. And (2) large-size datasets: NUS-WIDE-OBJECT, MNIST, and AwA.

3.2.1. Comparison to Baseline Methods

We first compare the proposed SSLCFBG with two baseline algorithms.
On small datasets, as detailed in Table 1 and Table 2, the accuracy of the proposed algorithm surpasses that of both 1NN and SVM, particularly in scenarios with a limited number of labeled data. This is due to the fact that 1NN requires a larger number of labeled neighbors to accurately determine labels for unlabeled data, while SVM struggles with low accuracy because it attempts to separate data using hyperplanes, which can easily deviate from the correct position when labeled data are scarce. The proposed LSLGG inherits the high accuracies characteristic of graph-based methods. In terms of speed, SSLCFBG operates more slowly than 1NN and SVM, because we employ a relatively large number (i.e., 1000) of representative points to maintain consistency with other experiments in this paper. Evidently, the speed of SSLCFBG can be increased by reducing the number of representative points, as will be demonstrated in subsequent experiments.
On the larger MNIST dataset, SSLCFBG achieves significantly higher accuracy than 1NN and SVM when fewer labeled data are available, as shown in the left sub-figure of Figure 2. In the right sub-figure, the running times of SVM and 1NN increase markedly as the number of available labeled data grows, limiting their applicability in scenarios with a large volume of labeled data. Intriguingly, SSLCFBG maintains a nearly constant running time, independent of the number of labeled data, due to the use of a fixed number of representative points.
Based on these findings, our algorithm demonstrates a favorable balance between efficiency and effectiveness when compared to baseline methods.

3.2.2. Comparisons to State-of-the-Art Methods

As highlighted in our theoretical analysis, the high computational cost poses a significant barrier for many GSSL algorithms. In this subsection, we aim to demonstrate the effectiveness and efficiency of our algorithm in comparison to other state-of-the-art GSSL algorithms, excluding baseline methods that are either computationally intensive or lacking in accuracy.
The accuracy and running times for small and large datasets are presented in Table 3 and Table 4, respectively. We observe that the proposed SSLCFBG yields the best outcomes in terms of both effectiveness and efficiency across different datasets when compared to all other algorithms.
Regarding accuracy, as shown in Table 3 the proposed algorithm outperforms HEM, GLC, and GGSSL. This is because it not only leverages the advantages of GSSL algorithms, but also employs representative points to capture the underlying manifold structure of the raw data. These points are generated using k-means, which serves as a filtering mechanism to eliminate noisy data. Additionally, SSLCFBG demonstrates greater accuracy than AGR, as it does not impose any linear assumptions on the representative points or raw data, as discussed in our theoretical analysis.
In terms of efficiency, as indicated in Table 4, the proposed algorithm is considerably faster than HEM, GLC, and GGSSL, aligning with our theoretical analysis. This is due to two reasons: (1) label propagation in HEM, GLC, and GGSSL involves computing the inverses of n × n matrices, whereas SSLCFBG only requires the inverse of an m × m matrix, with m n ; and (2) the bipartite graph used in graph construction is highly sparse. Furthermore, our algorithm is slightly faster than AGR, as it does not require additional time to accommodate the linear assumption.
To illustrate the limitations of the linear assumption in AGR, we compare its accuracy to that of our algorithm on the USPS dataset, as depicted in Figure 3. The number of representative points ranges from 100 to 1000, where the number of labeled data are set at 20. As evident from Figure 3, the performance of the proposed SSLCFBG consistently outperforms AGR. Therefore, the proposed algorithm is more effective than AGR, as it is not constrained by linear assumptions.
These experiments collectively highlight the advantages of the proposed SSLCFBG in managing large-scale data with respect to both efficiency and accuracy. By “big data”, we refer to the following scenarios: (1) an increased number of data points, (2) data with higher dimensions, (3) a greater number of labeled data, or (4) any combination of these factors. The benefits of SSLCFBG will be even more pronounced in larger datasets, making it well-suited for big data applications.

4. Conclusions

In this paper, we introduce a novel large-scale semi-supervised learning framework that utilizes a bipartite graph. The label propagation within this framework exhibits nearly linear computational complexity with respect to the number of data points. We attribute its success to the construction of a sparse, anti-diagonal affinity matrix. In this bipartite graph, a limited number of representative points are connected to a larger set of raw data points, resulting in a sparse and anti-diagonal affinity matrix that accelerates both graph construction and label propagation. The abundance of zero entries in the affinity matrix facilitates rapid graph construction. Moreover, the significantly smaller size of the matrix inverse required in our algorithm accelerates the process compared to traditional GSSL algorithms, as the affinity matrix simplifies the closed-form solution of our approach. Compared to existing fast algorithms like AGR, our proposed algorithm is consistently more efficient and accurate, as it does not require any linear assumptions. Extensive experiments conducted on six real-world datasets showcase the advantages of the proposed SSLCFBG over five state-of-the-art and baseline algorithms, both in terms of efficiency and effectiveness.

Author Contributions

Conceptualization, Z.P.; methodology, Z.P.; software, Z.P.; validation, Z.P., G.Z. and W.H.; formal analysis, Z.P.; investigation, Z.P.; resources, G.Z. and W.H.; data curation, Z.P.; writing—original draft preparation, Z.P.; writing—review and editing, Z.P., G.Z. and W.H.; visualization, Z.P.; supervision, G.Z. and W.H.; project administration, G.Z. and W.H.; funding acquisition, G.Z. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Platform Project of the Education Department of Guangdong Province (No. 2021KCXTD038, No. 2022KSYS003), the Discipline Construction and Promotion Project of Guangdong Province (No. 2022ZDJS065), Education and Teaching Reform Project of Hanshan Normal University (No. PX-161241546).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank the referees and the editors for their very useful suggestions, which improved significantly this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Notations used in the paper:
NotationsDescriptions
GA Graph
V The set of vertices in a graph
iThe index of i
v i The i-th vertex
E The set of edges in a graph
e i j The edge linked between vertices of v i and v j
Q The set of representative points (or anchor points)
q j The j-th representative point
WThe affinity matrix (or weight matrix) of a graph
w i j The weight associated with edge e i j
DThe degree matrix of a graph
d i i The degree of data point x i
LThe graph Laplacian matrix
X The set of data
x i The i-th data point
L The set of labels
y i The label for data point x i
YThe label indicator matrix
Y i j Y i j = 1 if x i is labeled as y i = j , otherwise Y i j = 0
FThe soft label matrix
F The set of ( n + m ) × c matrices with nonnegative entries
ZThe matrix of sample-adaptive weights
K ( x i , q j , σ ) The kernel function
nThe number of data points
mThe number of representative points
cThe number of different types of labels
lThe number of labeled points
uThe number of unlabeled points
dThe dimension of a data point
γ The finite weight in LGC [4]
σ The standard deviation
λ 1 , , n + m The parameters trade off label smoothness and consistence
Λ Λ = diag ( λ 1 , , λ n + m ) is a diagonal matrix
tr ( · ) The matrix trace
· F The Frobenius norm
Θ ^ and Θ 1 The intermediate matrices
I α A diagonal matrix with α i i = 1 1 + λ i and I α = diag ( I α l , I α u , I α m )
I α l Sub-matrix of I α corresponding to labeled data
I α u Sub-matrix of I α corresponding to unlabeled data
I α m Sub-matrix of I α corresponding to representative points
I β I β = I I α = diag ( I β l , I β u , I β m )
I β l , I β u , I β m Sub-matrices of I β and similar to I α l , I α u and I α m

References

  1. Wang, Q.; Shen, B.; Wang, S.; Li, L.; Si, L. Binary Codes Embedding for Fast Image Tagging with Incomplete Labels. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; Volume 8690, pp. 425–439. [Google Scholar] [CrossRef]
  2. Zhu, X.; Ghahramani, Z.; Lafferty, J. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003), Washington, DC, USA, 21–24 August 2003. [Google Scholar]
  3. Song, Z.; Yang, X.; Xu, Z.; King, I. Graph-Based Semi-Supervised Learning: A Comprehensive Review. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8174–8194. [Google Scholar] [CrossRef] [PubMed]
  4. Zhou, D.; Bousquet, O.; Lal, T.; Weston, J.; Scholkopf, B. Learning with Local and Global Consistency. Adv. Neural Inf. Process. Syst. 2003, 16, 321–328. [Google Scholar]
  5. Nie, F.; Xiang, S.; Liu, Y.; Zhang, C. A General Graph-based Semi-supervised Learning with Novel Class Discovery. Neural Comput. Appl. 2010, 19, 549–555. [Google Scholar] [CrossRef]
  6. Delalleau, O.; Bengio, Y.; Le Roux, N. Efficient Non-Parametric Function Induction in Semi-Supervised Learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS-2005), Savannah, GA, USA, 6–8 January 2005. [Google Scholar]
  7. Kumar, S.; Mohri, M.; Talwalkar, A. Sampling techniques for the Nystrom method. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS-2009), Clearwater Beach, FL, USA, 16–18 April 2009. [Google Scholar]
  8. Tsang, I.; Kwok, J. Large-scale sparsified manifold regularization. Adv. Neural Inf. Process. Syst. 2006, 19, 1401–1408. [Google Scholar]
  9. Zhu, X.; Lafferty, J. Harmonic mixtures: Combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005. [Google Scholar]
  10. Fergus, R.; Weiss, Y.; Torralba, A. Semi-supervised learning in gigantic image colletions. Adv. Neural Inf. Process. Syst. 2010, 522–530. [Google Scholar]
  11. Liu, W.; He, J.; Chang, S. Large Graph Construction for Scalabel Semi-Supervised Learning. In Proceedings of the International Conference on Machine Learning (ICML-2010), Haifa, Israel, 21–24 June 2010. [Google Scholar]
  12. Liu, W.; Wang, J.; Chang, S.F. Robust and Scalable Graph-Based Semisupervised Learning. Proc. IEEE 2012, 100, 2624–2638. [Google Scholar] [CrossRef]
  13. He, F.; Nie, F.; Wang, R.; Hu, H.; Jia, W.; Li, X. Fast Semi-Supervised Learning With Optimal Bipartite Graph. IEEE Trans. Knowl. Data Eng. 2021, 33, 3245–3257. [Google Scholar] [CrossRef]
  14. Wang, Z.; Zhang, L.; Wang, R.; Nie, F.; Li, X. Semi-Supervised Learning via Bipartite Graph Construction with Adaptive Neighbors. IEEE Trans. Knowl. Data Eng. 2023, 35, 5257–5268. [Google Scholar] [CrossRef]
  15. Rowshan, Y. The m-Bipartite Ramsey Number of the K2,2 Versus K6,6. Contrib. Math. 2022, 5, 36–42. [Google Scholar]
  16. Luxburg, U. A Tutorial on Spectral Clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  17. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  18. Ester, M.; Kriegel, H.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
  19. Bache, K.; Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu (accessed on 1 November 2013).
  20. Anwar, Z.; Chua, S.; Mowe, R. The MPH Cookbook; MPH Distributors (S): Singapore, 1978. [Google Scholar]
  21. Li, F.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Imagin Underst. 2007, 106, 59–70. [Google Scholar]
  22. Dueck, D.; Frey, B. Non-metric affinity propagation for unsupervised image categorization. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
  23. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
  24. Hull, J. A database for handwritten text recognition research. Pattern Anal. Mach. Intell. IEEE Trans. 1994, 16, 550–554. [Google Scholar] [CrossRef]
  25. Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM Conference on Image and Video Retrieval (CIVR’09), Santorini Island, Greece, 8–10 July 2009. [Google Scholar]
  26. Huang, J.; Kumar, S.R.; Mitra, M.; Zhu, W.J.; Zabih, R. Image indexing using color correlograms. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 762–768. [Google Scholar]
  27. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  28. Lampert, C.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar] [CrossRef]
  29. van de Sande, K.; Gevers, T.; Snoek, C. Evaluating Color Descriptors for Object and Scene Recognition. Pattern Anal. Mach. Intell. IEEE Trans. 2010, 32, 1582–1596. [Google Scholar] [CrossRef] [PubMed]
  30. Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27:1–27:27. [Google Scholar] [CrossRef]
Figure 1. Bipartite Graph Construction in GSSL Problem. (a) The original fully connected graph comprises 8 data points (circles) and 28 edges (lines). (b) The bipartite graph connects two distinct sets: one set includes three representative points (squares) labeled ‘A’, ‘B’, and ‘C’, while the other set contains the 8 data points. Each representative point is associated with several data points of the same color. (c) Label propagation can be performed across the 3 representative points and along the 3 connecting edges.
Figure 1. Bipartite Graph Construction in GSSL Problem. (a) The original fully connected graph comprises 8 data points (circles) and 28 edges (lines). (b) The bipartite graph connects two distinct sets: one set includes three representative points (squares) labeled ‘A’, ‘B’, and ‘C’, while the other set contains the 8 data points. Each representative point is associated with several data points of the same color. (c) Label propagation can be performed across the 3 representative points and along the 3 connecting edges.
Symmetry 16 01312 g001
Figure 2. Comparisons between the proposed method and baseline methods on the MNIST dataset, with the quantity of labeled data ranging from 20 to 4500. Left: accuracy. Right: running time.
Figure 2. Comparisons between the proposed method and baseline methods on the MNIST dataset, with the quantity of labeled data ranging from 20 to 4500. Left: accuracy. Right: running time.
Symmetry 16 01312 g002
Figure 3. The accuracy of AGR and the proposed SSLCFBG, in relation to varying numbers of representative points, on the USPS dataset, where the number of labeled data are set at 20.
Figure 3. The accuracy of AGR and the proposed SSLCFBG, in relation to varying numbers of representative points, on the USPS dataset, where the number of labeled data are set at 20.
Symmetry 16 01312 g003
Table 1. Accuracy of the proposed method and baseline methods on HW dataset.
Table 1. Accuracy of the proposed method and baseline methods on HW dataset.
MethodsLabel # 20 Label # 1000
1NN0.7337 ± 0.04200.9663 ± 0.0049
SVM0.7565 ± 0.03960.9607 ± 0.0060
Proposed0.9385 ± 0.02370.9725 ± 0.0046
“# of Label” denotes “the number of labeled data”.
Table 2. Running time of the proposed method and baseline methods on HW dataset.
Table 2. Running time of the proposed method and baseline methods on HW dataset.
MethodsLabel # 20 Label # 1000
1NN0.0012 ± 0.00060.0162 ± 0.0014
SVM0.0056 ± 0.00150.0524 ± 0.0019
Proposed0.1341 ± 0.00230.1389 ± 0.0002
“# of Label” denotes “the number of labeled data”.
Table 3. Accuracy of the proposed algorithm and other graph-based algorithms.
Table 3. Accuracy of the proposed algorithm and other graph-based algorithms.
Middle Size DatasetsLarge Size Datasets
No.MethodsHWCT-7CT-20CT-101USPSNUSMNISTAwA
1HEM0.89700.85870.75650.32560.17190.16290.18440.0600
2GLC0.89910.75190.65510.28040.15130.16070.18170.0347
3GGSSL0.90640.85870.75200.28950.15130.15870.17950.0231
4AGR0.92140.89230.80720.38820.85150.18280.84940.0660
5Proposed0.93850.90170.82020.39900.86130.19230.85910.0700
The number of labeled data are HW = 20 , CT-7 = 20 , CT-20 = 50 , CT-101 = 200 , USPS = 20 , NUS = 1000 , MNIST = 20 , AwA = 1000 .
Table 4. Running time of the proposed algorithm and other graph-based algorithm.
Table 4. Running time of the proposed algorithm and other graph-based algorithm.
Middle Size DatasetsLarge Size Datasets
No.MethodsHWCT-7CT-20CT-101USPSNUSMNISTAwA
1HEM0.141113.998342.3065955.989417.2055262.483813343.19895795.7914
2GLC0.134113.962442.2709964.453217.2987177.797212964.37916322.7574
3GGSSL0.134413.963742.2850961.067917.2555180.980913167.33526133.7992
4AGR0.15320.65701.32096.78021.43973.945924.2268295.9069
5Proposed0.13410.63491.30226.73871.41753.920624.1960295.8366
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, Z.; Zheng, G.; Huang, W. Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph. Symmetry 2024, 16, 1312. https://doi.org/10.3390/sym16101312

AMA Style

Peng Z, Zheng G, Huang W. Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph. Symmetry. 2024; 16(10):1312. https://doi.org/10.3390/sym16101312

Chicago/Turabian Style

Peng, Zhongxing, Gengzhong Zheng, and Wei Huang. 2024. "Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph" Symmetry 16, no. 10: 1312. https://doi.org/10.3390/sym16101312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop