1. Introduction
As technology progresses, particularly with the pervasive adoption of smartphones, the volume of data surrounding us is increasing at an exponential rate. Consequently, the effective processing and analysis of this vast dataset have become critically important. For instance, rapidly distinguishing images within a mobile device’s photo gallery presents significant challenges to both memory capacity and computational power. Therefore, it is imperative to develop algorithms that minimize memory usage while ensuring rapid solution speeds and achieving satisfactory learning outcomes.
Matrix factorization (MF) plays an important role in fields such as machine learning and data analysis. It can learn a low-dimensional representation of data from high-dimensional data. Many classical matrix factorization methods have been proposed, including non-negative matrix factorization [
1,
2], singular value decomposition (SVD) [
3], principal component analysis (PCA) [
4], and concept factorization [
5].
Non-negative matrix factorization (NMF) has become popular in recent years. It factorizes data into a set of non-negative bases and their representations under these non-negative bases. Despite its impressive performance and interpretability, NMF can only handle non-negative data due to non-negativity constraints. Since in the case of the same low rank the error of NMF is usually larger than that obtained by SVD, Ref. [
6] proposed a low-rank matrix factorization with orthogonal constraints.
Matrix factorization considers global information while ignoring local manifolds. Many graph-based matrix factorization methods have been proposed to address this problem. Cai et al. proposed GNMF [
7] and LCCF [
8], which consider the local manifold. Yi et al. proposed NMF-LCAG [
9], which considers both reconstruction and local information. Peng et al. proposed GCCF [
10], which uses the maximum correntropy criterion (MCC) against outliers. Wang et al. proposed CHNMF [
11], which uses a hypergraph to explore high-order geometric information.
The above methods are all unsupervised and cannot use known information. Using a small amount of labeled data leads to a significant improvement in performance. Liu et al. proposed CNMF [
12] and CCF [
13], which constrain samples with the same classes into the same representation. Zhang et al. proposed NMFCC [
14], which uses the labeled information to construct a must-link and cannot-link matrix. In fact, an unsupervised graph-based MF method can be transformed into a semi-supervised one by incorporating a must-link and cannot-link matrix. Peng et al. proposed CSNMF [
15] and CSCF [
16], which assign adaptive neighbors to construct an appropriate adjacency matrix. Zhou et al. proposed CLMF [
17], which uses sparsity-induced similarity (SIS) to adaptively learn the adjacency matrix.
A particular special type of matrix factorization, called symmetric matrix factorization (SMF), directly factorizes the graph matrix into identical low-rank matrices. Since the graph contains only local relations between data, its result contains only local information. Zhang et al. proposed SNMFCC [
14], which incorporates the must-link and cannot-link matrix into the adjacency matrix. Wu et al. proposed PCPSNMF [
18], which adaptively and simultaneously teaches the adjacency matrix and propagates the initial pairwise constraints among the data. Chavoshinejad et al. proposed S
4NMF, which introduces the self-supervised information into semi-supervised symmetric NMF. Yin et al. proposed HSSNMF [
19], which uses the hypergraph-based pairwise constraint propagation algorithm to capture the high-order information. It is important to note that the essence of SMF is
, which is non-convex.
Graph-based and symmetric matrix factorization methods are unsuitable for addressing excessive data because they require the storage and computation of an
n-order matrix. To address this issue, bipartite graph-based methods have been proposed. Liu et al. [
20] proposed the local anchor embedding (LAE) algorithm. Wang et al. proposed EAGR [
21], which uses FLAE, an improved AGR algorithm, by modifying the estimation of local weight and construction of an adjacency matrix. Nie et al. proposed the BGSSL [
22], which applies a bipartite graph to graph-based semi-supervised learning (GSSL). It should be noted that it is difficult to apply an
-order bipartite graph to SMF, as it requires an
n-order fully adjacent matrix.
In response to the substantial memory demands and high computational complexity in prior methodologies, this paper proposes a novel fast global local matrix factorization (FGLMF) model aimed at enhancing performance while addressing these critical considerations. The method has the following benefits.
A novel method using only matrix factorization is proposed. It simultaneously learns the global and local information between data, and it is easily interpretable.
A bipartite graph is introduced to the symmetric matrix factorization. The computational complexity of the proposed method needs , which is smaller than other bipartite graph-based methods’ .
The proposed SMF is convex due to using the label information. Therefore, every optimal solution is the global optimal solution, and it can be solved very fast.
3. Proposed Method
In this section, a new fast global local MF model is proposed. First, a novel method is introduced to construct the natural normalized full adjacency matrix using the bipartite graph. Thanks to this, the -order matrix, which is not explicitly represented, can effectively reduce the computational and spatial complexity. For local information, a bipartite graph-based semi-supervised symmetric matrix factorization framework is proposed, which is convex and unconstrained, thus enabling a fast solution while ensuring a global optimal solution. For global information, low-rank matrix factorization is utilized for rapid solution in one step, avoiding non-negative constraints that can handle non-negative data effectively.
3.1. Methodology
Due to the requirement of a full adjacency matrix in the symmetric matrix, it is not practical to use a bipartite graph directly. A concept of constructing the full adjacency matrix using anchor points is proposed, as shown in
Figure 1. By connecting each sample with two anchor points, the adjacency relationship between samples is obtained. Therefore, the new full adjacency matrix is as follows:
where
is a diagonal matrix with
. According to [
24], Equation (6) can be viewed as a non-weighted hypergraph; accordingly,
Figure 1 can be considered as connecting samples through hyperedges, each of which connects not just two vertices but multiple vertices. In addition, due to the row sum of the adjacency matrix of a bipartite graph being one, the fully connected matrix generated by Equation (6) is self-normalized:
Theorem 1. For a matrix with , the degree matrix of is .
Therefore, the full adjacency matrix generated by Equation (6) can capture high-order relationships between data, does not require additional normalization, and significantly reduces computational complexity by storing and computing an
-order matrix. Therefore, the bipartite graph-based symmetric matrix factorization can be expressed in the following form:
Traditional symmetric matrix factorization requires incorporating label information into the full adjacency matrix, which involves additional computation and necessitates the explicit appearance of
. Therefore, this paper incorporates the label information into the low-dimensional representation matrix as one-hot constraints. Specifically, let
, where
and
are auxiliary matrices.
,
is the low-dimensional representation of an unlabeled sample, and
is the identity matrix.
records the label information, where
and
. If the
j-th sample is labeled as the
i-th class,
; otherwise,
. If the
j-th column of
is a zero vector and the
j-th sample is the
i-th unlabeled one,
; otherwise,
. For example, there are eight samples with three classes, and
is labeled with class I,
and
are labeled with class II, and
is labeled with class III, then
becomes
Imposing a one-hot constraint on labeled samples is justifiable, as these samples do not necessitate further distinction. The reformed SMF is
This reduces computation and, more importantly, transforms the model from non-convex to convex. Therefore, there is no need to add constraints to
, and any local minimizer is a global minimizer [
25]. An explanation of the model’s convexity is provided in Theorem 2.
Theorem 2. The objective function of Equation (9) is convex.
Equation (9) only considers the local information between samples; it is also crucial to consider the global information. MF can effectively capture global information. Thus, the following MF framework is proposed,
The proof of Theorem 2 is in
Appendix B.
3.2. Optimization
Considering
, there is
where
is a diagonal matrix with
, and
is the number of the
i-th labeled class.
Firstly, expanding the MF term, there is
where
is the inner product of the matrix, representing the sum of element-wise products. Discarding all terms unrelated to
and
, we obtain
Secondly, expanding the SMF term, there is
Discarding all terms unrelated to
, we obtain
3.2.1. Fix , Optimize
According to Equation (13) and Equation (15), the gradient of
is
Due to the absence of constraints on
, it can be optimized using state-of-the-art unconstrained optimization algorithms. Here, CG_DESCENT 6.8 (
https://people.clas.ufl.edu/hager/software/, accessed on 1 September 2024) [
26,
27,
28] is employed. It is important to note that traditional matrix factorization methods employ multiplicative updating rules (MURs) for optimization. While these methods can ensure a decrease in the objective function, they do not guarantee convergence. Furthermore, as the objective function of
is convex, it can be solved rapidly.
3.2.2. Fix , Optimize
According to Equation (13), and using SVD as
, we have
Because of the orthogonality of
, its diagonal entries are not smaller than
. When all the diagonal elements of
equal
, Equation (17) has been minimized. Therefore, we have
, i.e.,
.
The pseudocode for FGLMF is presented in Algorithm 1.
Algorithm 1 FGLMF |
- 1:
Input: Data matrix , constraint matrix , parameters . - 2:
Output: Clustering indicator matrix (representation matrix) - 3:
Initialize the anchor point set by k-means, and generate bipartite graph by Equation (3) - 4:
while Not convergent do - 5:
Update by CG_DESCENT 6.8. - 6:
Compute SVD of . - 7:
Update . - 8:
Let . - 9:
end while
|
3.3. Computational Complexity
The computational complexity of FGLMF is mainly divided into six parts.
It takes to compute the anchors.
It takes to compute the distance between samples and anchors.
It takes to compute the bipartite graph, where k is the number of nearest neighbors.
It takes to compute the objective function of .
It takes to compute the gradient of .
It takes to compute the SVD of .
Due to and , the total computational complexity is . It should be noted that other bipartite graph-based methods, such as BGSSL and EAGR, need .
4. Experiments
4.1. Compared Method
To demonstrate the efficiency and effectiveness of the proposed FGLMF, we choose the following methods for comparison:
TSVD: Truncated singular value decomposition [
29], it can give the best approximation with a preset rank.
SemiGNMF: Semi-supervised graph regularized non-negative matrix factorization [
7].
CSNMF: Correntropy-based semi-supervised NMF [
15].
CSCF: Correntropy-based semi-supervised concept factorization [
16].
CLMF: Correntropy-based low-rank matrix factorization [
17].
PCPSNMF: pairwise constraint propagation-induced SNMF [
18].
S
4NMF: Self-supervised semi-supervised non-negative matrix factorization [
30].
HSSNMF: Hypergraph-based semi-supervised symmetric non-negative matrix factorization [
19].
EAGR: Efficient anchor graph regularization [
21].
BGSSL: Bipartite graph semi-supervised learning [
22].
Specifically, TSVD is an unsupervised MF method used as the baseline, and the rest are semi-supervised methods. Methods (1)–(5) are matrix factorization methods that factorize the samples, (6)–(8) are matrix factorization methods that factorize the full adjacency matrix, and (9)–(10) are bipartite graph-based methods.
4.3. Experiment Settings
To ensure the fairness of the experiments, all experiments in this paper were performed on a personal computer with an Intel i7-6800k CPU and 16 GB RAM. The parameters of all the compared methods were set according to the papers in which they are described. The number of nearest neighbors was set to five for all graph-based methods. For FGLMF, we fixed the parameter
at 100. For the COIL20 and YaleB datasets, the number of anchor points was set to 500 and 1500, respectively, and for the remaining datasets, the number of anchor points was set to 2000. According to [
22], we set the dimensions of the low-dimensional representation of BGSSL to the number of ground truths plus 1, and for the remaining methods, the low-dimensional representation was set to the number of ground truth classes of each dataset. For TSVD, GNMF, CSNMF, and CSCF, we obtained the results using
k-means on the low-dimensional representation, while for the remaining methods, we used
to obtain the results. The stop rule of CG_DESCENT was set to
. We randomly selected 30% of the samples as labeled samples. Four commonly used metrics were adopted, including accuracy (ACC), normalized mutual information (NMI), adjusted Rand index (ARI), and F-score. Specifically, accuracy calculates the percentage of correctly assigned labels by comparing real label information with learned label information. NMI measures the amount of shared information between two statistical distributions. ARI assesses the adjusted Rand index to evaluate the agreement between obtained labels and true labels, while the F-score determines clustering accuracy through a weighted average of precision and recall. The values for these evaluation metrics range from 0 to 1, where higher values indicate superior clustering quality. Since these four metrics emphasize different aspects within clustering applications, we present experimental results across these diverse criteria to conduct a comprehensive evaluation of clustering performance for various methods. We conducted ten random trials for all methods and reported the average results for each dataset.
4.4. Experiment Performance
Table 2,
Table 3,
Table 4 and
Table 5 shows the clustering results of the various methods with the best results highlighted. Due to negative values in the USPS data, both GNMF and CSNMF are unsuitable for this dataset. It can be observed that the proposed FGLMF can achieve the best effect in most cases. Additionally, on the YaleB data, CLMF demonstrates the best performance and, compared with other bipartite graph-based methods, the performance of FGLMF is significantly enhanced. It should be noted that similar to FGLMF, CLMF is a method that considers global and local information. This indicates that considering global and local information on facial data can yield better results. It should be noted that CLMF can only handle a small amount of data due to the need for a large number of computational resources. FGLMF performs better than other MF methods in most cases, although it is an anchor-based method. This might be because of the non-negative constraints of other MF methods, which use MUR for optimization, meaning they can only guarantee a decrease in the objective function but not convergence. Since a convex, unconstrained SMF is proposed in this paper, convergence to the global optimal solution can be guaranteed for any algorithms that guarantee local optimal solutions. Furthermore, the most advanced unconstrained optimization algorithms are often much faster than MUR.
We also compared the average performance of SemiGNMF, PCPSNMF, HSSNMF, EAGR, BGSSL, and FGLMF under different labeling ratios in
Figure 2. As the labeled ratio increases, the clustering performance improves in most cases. Under the USPS dataset, for the remaining MF methods, there exists a situation where the accuracy varies to different degrees, and the performance decreases as the labeling ratio increases. On the one hand, it is observed that the improvement of all methods is not significant, indicating that the labeling information has reached a bottleneck. On the other hand, other MF methods cannot guarantee convergence, and therefore, more iterations may be required, but this will increase the already high computational cost. In most cases, the proposed FGLMF shows good performance, demonstrating its effectiveness.
4.5. Time Consumption
The average time taken for each method is presented in
Table 6. The time associated with anchor-based methods can be categorized into two components: the first component involves the duration spent on the
k-means algorithm, while the second depends on the models. It is noteworthy that numerous efficient
k-means algorithms are currently available, such as [
31] and others. However, since this article does not primarily focus on the exploration of
k-means algorithms, we opted for the built-in
k-means algorithm provided by the scikit-learn [
32] library in Python to ensure fairness. FGLMF is the fastest among the anchor-based methods, especially for large-scale datasets, where this improvement is particularly evident. It should be noted that both FGLMF and TSVD require SVD. However, FGLMF performs SVD on a
-order matrix, while TSVD needs to perform SVD on a
-order matrix (
c is much smaller than
d), which explains why FGLMF is faster than TSVD in the solution process. Unfortunately, FGLMF requires
k-means to find anchors, which makes its total elapsed time slower than TSVD. Nevertheless, FGLMF is still the second fastest among all the methods.
4.6. Sensitivity to Parameter
FGLMF has a parameter
, and the selection of this parameter is examined in this section.
Figure 3 presents the average clustering accuracy of FGLMF under different lambda values. It can be observed that either an excessively high or an excessively low
leads to the degradation of the performance of FGLMF. A
that is too low neglects the local structure and only considers global information, reducing clustering performance. An overly high
ignores the global information, reducing the clustering performance. When
takes values of 10 and 100, it exhibits the best performance across all datasets.
4.7. Effect of Anchors
To verify the sensitivity of the proposed method to anchor points, we conducted tests on each dataset using only different anchor points, presented in
Figure 4. With the increase in anchor points, the clustering performance improved. Among them, the YaleB dataset is the most sensitive to the number of anchor points. In contrast, the USPS and MNIST datasets are not sensitive to the number of anchor points. This might be attributed to the complexity of the data. Both USPS and MNIST are handwritten digit datasets, and a small number of anchor points can represent the entire dataset. However, YaleB is a dataset of human faces with different expressions and shadows, and a relatively large number of anchor points is necessary to represent the entire dataset. In addition, the solution time tends to increase as the number of anchor points increases. It should be noted that this increase is not significant. This is because FGLMF has a linear relationship with the number of anchor points, rather than
as in other methods. This indicates that FGLMF is suitable for situations with a large number of anchor points. As mentioned above, for more complex datasets, a smaller number of anchor points may only represent part of the dataset. Sometimes, it is necessary to increase the number of anchor points to improve performance.
4.8. Bases
To verify the degree of global information extraction by the proposed method, we present the first ten bases extracted by FGLMF for each dataset in
Figure 5. It can be observed that each base represents its respective category relatively clearly. It should be noted that each category of the Letters dataset contains both uppercase and lowercase letters. FGLMF does an excellent job of extracting the facial features of each person in the YaleB dataset, which also explains why the proposed method performs much better than other bipartite graph-based methods on this dataset.
4.9. Generated Graph
To validate the situation of the graph generated by Equation (6), we compared the graph generated by Equation (6) with the one generated using Gaussian kernel on the COIL20 dataset, as shown in
Figure 6.
Figure 6a, which displays the adjacency matrix of the bipartite graph generated using 500 anchors and five nearest neighbors. It can be observed that points are scattered randomly in the graph, and no special structure of the adjacency matrix is apparent.
Figure 6b presents the full adjacency matrix generated from the adjacency matrix in
Figure 6a, showing a clear block structure, i.e., connections between the same classes.
Figure 6c illustrates the graph generated using Gaussian kernel, which exhibits a similar structure to that in
Figure 6b, validating our approach’s feasibility. Furthermore,
Figure 6c emphasizes connections with the nearest samples, while
Figure 6b captures high-order relationships between the data, i.e., connections within classes.
4.10. Convergence Study
To verify the convergence of FGLMF, the convergence curves are presented in
Figure 7. The figures show the convergence curves of five iterations. The colored broken lines represent the inner loop of CG_DESCENT, and the blank spaces between the dots of different colors represent the process of solving
using SVD. If there is only one dot of a specific color, it indicates that the solution at this time has reached the target accuracy, and CG_DESCENT no longer works. At this point, neither
nor
will continue to be optimized. It should be noted that the algorithm should be terminated when the number of iterations of CG_DESCENT is 0. We provide the results of five iterations here to make the pictures easier to understand. A complete iteration can be considered as starting from the dot of one color and ending at the dot of another color.
The figure shows that FGLMF reached the best result after three iterations on USPS, while only two iterations were required to reach the best result on other datasets. Additionally, in an internal loop, only a few iterations are needed to reach the optimal solution. This is attributed to the convexity of the proposed SMF, which eliminates the non-negative constraints and enables the algorithm to solve the problem rapidly. Furthermore, such a small number of iterations further explains why FGLMF requires so little time. Since the first iteration was close to the optimal value, CG_DESCENT requires only a few iterations after the first iteration to reach the optimal value.
To better demonstrate the convergence of FGLMF, we have also magnified and presented the situations before and after the first iteration in the subgraphs. It can be observed from the figure that the objective function continues to decline. After the update of , the objective function undergoes a significant drop. Additionally, based on the decline in the objective function after updating , it can also be seen that is already very close to the optimal solution.
5. Conclusions
Traditional matrix decomposition methods face challenges related to high computational complexity and substantial memory requirements. This paper proposes a rapid global–local matrix decomposition model that employs LMF (local matrix factorization) to account for global information while utilizing SMF (sparse matrix factorization) to capture local information. Compared to conventional approaches, the computational complexity is reduced from to , significantly alleviating both the computational burden and memory demands. Furthermore, we introduce a convex, unconstrained symmetric matrix decomposition method for bipartite graphs and provide an analysis and proof of its convexity. Thanks to the proposed symmetric matrix decomposition technique, our method demonstrates considerable advantages in performance.
Limitations and future work: Despite the reduction in computational and spatial complexity, the proposed method struggles to handle extremely large-scale datasets due to its requirement to store an matrix. When n becomes excessively large, there may be instances of insufficient memory. In future work, it would be beneficial to explore optimization techniques such as mini-batch stochastic gradient descent to alleviate memory constraints on the model.