Next Article in Journal
Bi-Level Phase Load Balancing Methodology with Clustering-Based Consumers’ Selection Criterion for Switching Device Placement in Low Voltage Distribution Networks
Previous Article in Journal
Some Fixed Point Results of Weak-Fuzzy Graphical Contraction Mappings with Application to Integral Equations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Block Coordinate Descent-Based Projected Gradient Algorithm for Orthogonal Non-Negative Matrix Factorization

1
Institute for Data Science, School of Engineering, University of Applied Sciences and Arts Northwestern Switzerland, 5210 Windisch, Switzerland
2
Faculty of Mechanical Engineering, University of Ljubljana, Aškerčeva ulica 6, SI-1000 Ljubljana, Slovenia
3
Institute of Mathematics, Physics and Mechanics, Jadranska 19, SI-1000 Ljubljana, Slovenia
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(5), 540; https://doi.org/10.3390/math9050540
Submission received: 8 December 2020 / Revised: 15 February 2021 / Accepted: 24 February 2021 / Published: 4 March 2021
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
This article uses the projected gradient method (PG) for a non-negative matrix factorization problem (NMF), where one or both matrix factors must have orthonormal columns or rows. We penalize the orthonormality constraints and apply the PG method via a block coordinate descent approach. This means that at a certain time one matrix factor is fixed and the other is updated by moving along the steepest descent direction computed from the penalized objective function and projecting onto the space of non-negative matrices. Our method is tested on two sets of synthetic data for various values of penalty parameters. The performance is compared to the well-known multiplicative update (MU) method from Ding (2006), and with a modified global convergent variant of the MU algorithm recently proposed by Mirzal (2014). We provide extensive numerical results coupled with appropriate visualizations, which demonstrate that our method is very competitive and usually outperforms the other two methods.

1. Introduction

1.1. Motivation

Many machine learning applications require processing large and high dimensional data. The data could be images, videos, kernel matrices, spectral graphs, etc., represented as an m × n matrix R. The data size and the amount of redundancy increase rapidly when m and n grow. To make the analysis and the interpretation easier, it is favorable to obtain compact and concise low rank approximation of the original data R. This low-rank approximation is known to be very efficient in a wide range of applications, such as: text mining [1,2,3], document classification [4], clustering [5,6], spectral data analysis [1,7], face recognition [8], and many more.
There exist many different low rank approximation methods. For instance, two well-known strategies, broadly used for data analysis, are singular value decomposition (SVD) [9] and principle component analysis (PCA) [10]. Much of real-world data are non-negative, and the related hidden parts express physical features only when the non-negativity holds. The factorizing matrices in SVD or PCA can have negative entries, making it hard or impossible to put a physical interpretation on them. Non-negative matrix factorization was introduced as an attempt to overcome this drawback, i.e., to provide the desired low rank non-negative matrix factors.

1.2. Problem Formulation

A non-negative matrix factorization problem (NMF) is a problem of factorizing the input non-negative matrix R into the product of two lower rank non-negative matrices G and H:
R G H ,
where R R + m × n usually corresponds to the data matrix, G R + m × p represents the basis matrix, and H R + p × n is the coefficient matrix. With p we denote the number of factors for which it is desired that p min ( m , n ) . If we consider each of the n columns of R being a sample of m-dimensional vector data, the factorization represents each instance (column) as a non-negative linear combination of the columns of G, where the coefficients correspond to the columns of H. The columns of G can be therefore interpreted as the p pieces that constitute the data R. To compute G and H, condition (1) is usually rewritten as a minimization problem using the Frobenius norm:
min G , H f ( G , H ) = 1 2 R G H F 2 , G 0 , H 0 .
It is demonstrated in certain applications that the performance of the standard NMF in (NMF) can often be improved by adding auxiliary constraints which could be sparseness, smoothness, and orthogonality. Orthogonal NMF (ONMF) was introduced by Ding et al. [11]. To improve the clustering capability of the standard NMF, they imposed orthogonality constraints on columns of G or on rows of H. Considering the orthogonality on columns of G, it is formulated as follows:
min G , H f ( G , H ) = 1 2 R G H F 2 , s . t . G 0 , H 0 , G T G = I .
If we enforce orthogonality on the columns of G and on rows of H, we obtain the bi-orthogonal ONMF (bi-ONMF), which is formulated as
min G , H f ( G , H ) = 1 2 R G H F 2 , s . t . G 0 , H 0 , G T G = I , H H T = I ,
where I denotes the identity matrix.
While the classic non-negative matrix factorization problem (NMF) has achieved a great attention in the recent decade, see also the recent book [12], and several methods have been devised to compute approximate optimal solutions, the problems with the orthogonality constraints (ONMF)–(bi-ONMF) were much less studied and the list of available methods is much shorter. Most of them are related to the fixed point method and to some variant of update rules. Especially meeting both orthogonality constraints in (bi-ONMF), which is relevant for co-clustering of the data, is still challenging and very limited research has been done in this direction, especially with methods that are not related to the fixed point method approach.

1.3. Related Work

The NMF was firstly studied by Paatero et al. [13,14] and was made popular by Lee and Seung [15,16]. There are several different existing methods to solve (NMF). The most used approach to minimize (NMF) is a simple MU method proposed by Lee and Seung [15,16]. In Chu et al. [17], several gradient-type approaches have been mentioned. Chu et al., reformulated (NMF) as an unconstrained optimization problem, and then applied the standard gradient descent method. Considering both G and H as variables in (NMF), it is obvious that f ( G , H ) is a non-convex function. However, considering G and H separately, we can find two convex sub-problems. Accordingly, a block-coordinate descent (BCD) approach [16] is applied to obtain values for G and H that correspond to a local minimum of f ( G , H ) . Generally, the scheme adopted by BCD algorithms is to recurrently update blocks of variables only, while the remaining variables are fixed. NMF methods which adopt this optimization technique are, e.g., the MU rule [15], the active-set-like method [18], or the PG method for NMF [19]. In [19], two PG methods were proposed for the standard NMF. The first one is an alternating least squares (ALS) method using projected gradients. This way, H is fixed first and a new G is obtained by PG. Then, with the fixed G at the new value, the PG method looks for a new H. The objective function in each least squares problem is quadratic. This enabled the author to use Taylor’s extension of the objective function to obtain an equivalent condition with the Armijo rule, while checking the sufficient decrease of the objective function as a termination criterion in a step-size selection procedure. The other method proposed in [19] is a direct application of the PG method to (NMF). There is also a hierarchical ALS method for NMF which was originally proposed in [20,21] as an improvement to the ALS method. It consists of a BCD method with single component vectors as coordinate blocks.
As the original ONMF algorithms in [5,6] and their variants [22,23,24] are all based on the MU rule, there has been no convergence guarantee for these algorithms. For example, Ding et al. [11] only prove that the successive updates of the orthogonal factors will converge to a local minimum of the problem. Because the orthogonality constraints cannot be rewritten into a non-negatively constrained ALS framework, convergent algorithms for the standard NMF (e.g., see [19,25,26,27]) cannot be used for solving the ONMF problems. Thus, no convergent algorithm was available for ONMF until recently. Mirzal [28] developed a convergent algorithm for ONMF. The proposed algorithm was designed by generalizing the work of Lin [29] in which a convergent algorithm was provided for the standard NMF based on a modified version of the additive update (AU) technique of Lee [16]. Mirzal [28] provides the global convergence for his algorithm solving the ONMF problem. In fact, he first proves the non-increasing property of the objective function evaluated by the sequence of the iterates. Secondly, he shows that every limit point of the generated sequence is a stationary point, and finally he proves that the sequence of the iterates possesses a limit point. In more recent literature, NMF is used in very applicative areas. In [30], a procedure for mining biologically meaningful biomarkers from microarray datasets of different tumor histotypes is illustrated. The proposed methodology allows automatically identifying a subset of potentially informative genes from microarray data matrices, which differs either in the number of rows (genes) and columns (patients). The methodology integrates NMF to allow the analysis of omics input data with different row size. In [31], the authors propose the correntropy-based orthogonal nonnegative matrix tri-factorization algorithm, which is robust to noisy data contaminated by non-Gaussian noise and outliers. In contrast to previous NMF algorithms, this algorithm firstly applies correntropy, which is defined as a measure of similarity between two random variables, to non-negative matrix tri-factorization problem to measure the similarity, and preserves double orthogonality conditions and dual graph regularization. Then, they adapt the half-quadratic technique which is based on conjugate function theory to solve the resulting optimization problem, and derive the multiplicative update rules. In [32], the blind audio source separation problem which consists of isolating and extracting each of the sources is studied. To perform this task, the authors use NMF based on the Kullback-Leibler and Itakura-Saito β -divergences as a standard technique that itself uses the time-frequency representation of the signal. The new NMF model is based on the minimization of β -divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. In [33], the authors use NMF to analyze microarray data which are a kind of numerical non-negative data used to collect gene expression profiles. Since the number of genes in DNA is huge, they are usually high dimensional, therefore they require dimensionality reduction and clustering techniques to extract useful information. The authors use NMF for dimensionality reduction to simplify the data and the relations in the data. To improve the sparseness of the base matrix in incremental NMF, the authors of [34] present a new method, orthogonal incremental NMF algorithm, which combines the orthogonality constraint with incremental learning. This approach adopts batch update in the process of incremental learning.

1.4. Our Contribution

In this paper, we consider the penalty reformulation of (bi-ONMF), i.e., we add the orthogonality constraints multiplied with penalty parameters to the objective function to obtain reformulated problems (ONMF) and (bi-ONMF). The main contributions are:
  • We develop an algorithm for (ONMF) and (bi-ONMF), which is essentially a BCD algorithm, in the literature also known as alternating minimization, coordinate relaxation, the Gauss-Seidel method, subspace correction, domain decomposition, etc., see e.g., [35,36]. For each block optimization, we use a PG method and Armijo rule to find a suitable step-size.
  • We construct synthetic data sets of instances for (ONMF) and (bi-ONMF), for which we know the optimum value by construction.
  • We use MATLAB [37] to implement our algorithm and two well-known (MU-based) algorithms: the algorithm of Ding [11] and of Mirzal [28]. The code is available upon request.
  • The implemented algorithms are compared on the constructed synthetic data-sets in terms of: (i) the accuracy of the reconstruction, and (ii) the deviation of the factors from the orthonormality. This deviation is a measure the feasibility of the obtained solution and has not been analyzed in the work Of Ding [11] and Mirzal [28]. Accuracy is measured by the so-called root-square error (RSE), defined as
    RSE : = R G H F 1 + R F ,
    Please note that we added 1 to the denominator in the formula above to prevent numerical difficulties when the data matrix R has a very small Frobenius norm.
    Deviations from the orthonormality are computed using Formulas (17) and (18) from Section 4. Our numerical results show that our algorithm is very competitive and almost always outperforms the MU based algorithms.

1.5. Notations

Some notations used throughout our work are described here. We denote scalars and indices by lower-case Latin letters, vectors by lowercase boldface Latin letters, and matrices by capital Latin letters. R m × n denotes the set of m by n real matrices, and I symbolizes the identity matrix. We use the notation to show the gradient of a real-valued function. We define + and as the positive and (unsigned) negative parts of ∇, respectively, i.e., = + . ⊙ and ⊘ denote the element-wise multiplication and the element-wise division, respectively.

1.6. Structure of the Paper

The rest of our work is organized as follows. In Section 2, we review the well-known MU method and the rules being used for updating the factors per iteration in our computations. We also outline the global convergent MU version of Mirzal [28]. We then present our PG method and discuss the stopping criteria for it. Section 4 presents the synthetic data and the result of implementation of the three decomposition methods presented in Section 3. This implementation is done for both the problem (ONMF), as well as (bi-ONMF). Some concluding results are presented in Section 5.

2. Existing Methods to Solve (NMF)

2.1. The Method of Ding

Several popular approaches to solve (NMF) are based on so-called MU algorithms, which are simple to implement and often yield good results. The MU algorithms originate from the work of Lee and Seung [16]. Various MU variants were later proposed by several researchers, for an overview see [38]. At each iteration of these methods, the elements of G and H are multiplied by certain updating factors.
As already mentioned, (ONMF) was proposed by Ding et al. [11] as a tool to improve the clustering capability of the associated optimization approaches. To adapt the MU algorithm for this problem, they employed standard Lagrangian techniques: they introduced the Lagrangian multiplier Λ (a symmetric matrix of size p × p ) for the orthogonality constraint, and minimized the Lagrangian function where the orthogonality constraint is moved to the objective function as the penalty term Trace ( Λ ( G T G I ) ) . The complementarity conditions from the related KKT conditions can be rewritten as a fixed point relation, which finally can lead to the following MU rule for (ONMF):
G i j = G i j ( R H T ) i j ( G G T R H T ) i j , i = 1 , , m , j = 1 , , p , H s t = H s t ( R T G ) s t ( H T G T G ) s t , s = 1 , , p , t = 1 , , n .
They extended this approach to non-negative three factor factorization with demand that two factors satisfy orthogonality conditions, which is a generalization of (bi-ONMF). The MU rules (28)–(30) from [11], adapted to (bi-ONMF), are the main ingredients of Algorithm 1, which we will call Ding’s algorithm.
Algorithm 1: Ding’s MU algorithm for (bi-ONMF).
INPUT: R R + m × n , p N
 1.  Initialize: generate G 0 as an m × p random matrix and H 0 as a p × n random matrix.
 2.  Repeat
G i j = G i j ( R H T ) i j ( G G T R H T ) i j + δ , i = 1 , , m , j = 1 , , p , H s t = H s t ( G T R ) s t ( G T R H T H ) s t + δ , s = 1 , , p , t = 1 , , n .
 3.  Until convergence or a maximum number of iterations or maximum time is reached.
OUTPUT: G , H .
Algorithm 1 converges in the sense that the solution pairs G and H generated by this algorithm yield a sequence of decreasing RSEs, see [11], Theorems 5 and 7.
If R has zero vector as columns or rows, a division by zero may occur. In contrast, denominators close to zero may still cause numerical problems. To escape this situation, we follow [39] and add a small positive number δ to the denominators of the MU terms (4). Pleas note that Algorithm 1 can be easily adapted to solve (ONMF) by replacing the second MU rule from (4) with the second MU rule of (3).

2.2. The Method of Mirzal

In [28], Mirzal proposed an algorithm for (ONMF) which is designed by generalizing the work of Lin [29]. Mirzal used the so-called modified additive update rule (the MAU rule), where the updated term is added to the current value for each of the factors. This additive rule has been used by Lin in [29] in the context of a standard NMF. He also provided in his paper a convergence proof, stating that the iterates generated by his algorithm converge in the sense that RSE is decreasing and the limit point is a stationary point. In [28], Mirzal discussed the orthogonality constraint on the rows of H, while in [40] the same results are developed for the case of (bi-ONMF).
Here we review the Mirzal’s algorithm for (bi-ONMF), presented in the unpublished paper [40]. This algorithm actually solves the equivalent problem (pen-ONMF) where the orthogonality constraints are moved into the objective function (the so-called penalty approach), and the importance of the orthogonality constraints are controlled by the penalty parameters α , β :
min G , H F ( G , H ) = 1 2 R G H F 2 + α 2 H H T I F 2 + β 2 G T G I F 2 , s . t . G 0 , H 0
The gradients of the objective function with respect to G and H are:
G f ( G , H ) = G H H T R H T + β G G T G β G , H f ( G , H ) = G T G H G T R + α H H T H α H .
For the objective function in (pen-ONMF), Mirzal proposed the MAU rules along with the use of G ¯ = ( g ¯ ) i j and H ¯ = ( h ¯ ) i j , instead of G and H, to avoid the zero locking phenomenon [28], Section 2:
g ¯ i j = g i j , if G f ( G , H ) i j 0 max { g i j , ν } , if G f ( G , H ) i j < 0
h ¯ s t = h s t , if H f ( G , H ) s t 0 max { h s t , ν } , if H f ( G , H ) s t < 0
where ν is a small positive number.
Please note that the algorithms working with the MU rules for (pen-ONMF) must be initialized with positive matrices to avoid zero locking from the start, but non-negative matrices can be used to initialize the algorithm working with the MAU rules (see [40]).
Mirzal [40] used the MAU rules with some modifications by considering G ¯ and H ¯ in order to guarantee the non-increasing property, with a constant step to make δ G and δ H grow in order to satisfy the property. Here, δ G and δ H are the values added within the MAU terms to the denominator of update terms for G and H, respectively. The proposed algorithm by Mirzal [40] is summarised as Algorithm 2 below.
Algorithm 2: Mirzal’s algorithm for bi-ONMF [40]
INPUT: inner dimension p, maximum number of iterations: maxit; small positive δ , small positive s t e p to increase δ .
 1.  Compute initial G 0 0 and H 0 0 .
 2.  For k = 0 : maxit
      δ G = δ ;
     Repeat
        g i j ( k + 1 ) = g i j ( k ) g ¯ i j ( k ) · G f ( G ( k ) , H ( k ) ) i j ( G ¯ ( k ) H ( k ) H ( k T ) + β G ¯ ( k ) G ¯ ( k T ) G ¯ ( k ) ) i j + δ G ( k ) , i = 1 m , j = 1 , p ;
        δ G = δ G · step ;
     Until f ( G ( k + 1 ) , H ( k ) ) f ( G ( k ) , H ( k ) )
      δ H = δ ;
     Repeat
        h s t ( k + 1 ) = h s t ( k ) h ¯ s t ( k ) · H f ( G ( k + 1 ) , H ( k ) ) s t ( G ( k + 1 ) T G ( k + 1 ) H ¯ ( k ) + α H ¯ ( k ) H ¯ ( k T ) H ¯ ( k ) ) s t + δ H ( k ) , s = 1 , p , t = 1 , n ;
        δ H = δ H · step ;
     Until f ( G ( k + 1 ) , H ( k + 1 ) ) f ( G ( k + 1 ) , H ( k ) )
      δ H = δ ;
OUTPUT: G , H .

3. PG Method for (ONMF) and (bi-ONMF)

3.1. Main Steps of PG Method

In this subsection we adapt the PG method proposed by Lin [19] to solve both (ONMF) as well as (bi-ONMF). Lin applied PG to (NMF) in two ways. The first approach is actually a BCD method. This method consecutively fixes one block of variables (G or H) and minimizes the simplified problem in the other variable. The second approach by Lin directly minimizes (NMF). Lin’s main focus was on the first approach and we follow it. We again try to solve the penalized version of the problem (pen-ONMF) by the block coordinate descent method, which is summarised in Algorithm 3.
Algorithm 3: BCD method for (pen-ONMF)
INPUT: inner dimension p, initial matrices G 0 , H 0 .
 1.  Set k = 0 .
 2.  Repeat
     Fix H : = H k and compute new G as follows:
G k + 1 : = argmin G 0 1 2 R G H k F 2 + α 2 H k H k T I F 2 + β 2 G T G I F 2
     Fix G : = G k + 1 and compute new H as follows:
H k + 1 : = argmin H 0 1 2 R G k + 1 H F 2 + α 2 H H T I F T + β 2 G ( k + 1 ) T G k + 1 I F 2
      k : = k + 1
 3.  Until some stopping criteria is satisfied
OUTPUT: G , H .
The objective function in (pen-ONMF) is not quadratic any more, so we lose the nice properties about Armijo’s rule that represent advantages for Lin. We managed to use the Armijo rule directly and still obtained good numerical results, see Section 4. Armijo [41] was the first to establish convergence to stationary points of smooth functions using an inexact line search with a simple “sufficient decrease” condition. The Armijo condition ensures that the line search step is not too large.
We refer to (8) or (9) as sub-problems. Obviously, solving these sub-problems in every iteration could be more costly than Algorithms 1 and 2. Therefore, we must find effective methods for solving these sub-problems. Similarly to Lin, we apply the PG method to solve the sub-problems (8) and (9). Algorithm 4 contains the main steps of the PG method for solving the latter and can be straightforwardly adapted for the former.
For the sake of simplicity, we denote by F H the function that we optimize in (8), which is actually a simplified version (pure H terms removed) of the objective function from (pen-ONMF) for H fixed:
F H ( G ) : = 1 2 R G H F 2 + β 2 G T G I F 2 .
Similarly, for G is fixed, the objective function from (9) will be denoted by:
F G ( H ) : = 1 2 R G H F 2 + α 2 H H T I F T .
In Algorithm 4, P is the projection operator which projects the new point (matrix) on the cone of non-negative matrices (we simply put negative entries to 0).
Inequality (10) shows the Armijo rule to find a suitable step-size guaranteeing a sufficient decrease. Searching for λ k is a time-consuming operation, therefore we strive to do only a small number of trials for new λ in Step 3.1.
Similarly to Lin [19], we allow for λ any positive value. More precisely, we start with λ = 1 and if the Armijo rule (10) is satisfied, we increase the value of λ by dividing it with γ < 1 . We repeat this until (10) is no longer satisfied or the same matrix H λ as in the previous iteration is obtained. If the starting λ = 1 does not yield H λ which would satisfy the Armijo rule (10), then we decrease it by a factor γ and repeat this until (10) is satisfied. The numerical results obtained using different values of parameters γ (updating factor for λ ) and σ (parameter to check (10)) are reported in the following subsections.
Algorithm 4: PG method using Armijo rule to solve sub-problem (9)
INPUT: 0 < σ < 1 , γ < 1 , and initial H 0 .
 1.  Set k = 0
 2.  Repeat
     Find a λ (using updating factor γ ) such that for H λ : = P [ H k λ F G ( H k ) ]
     we have
F G ( H λ ) F G ( H k ) σ F G ( H k ) ( H λ H k ) ;
     Set H k + 1 : = H λ
     Set k = k + 1 ;
 3.  Until some stopping criteria is satisfied.
OUTPUT: H = H k + 1 .

3.2. Stopping Criteria for Algorithms 3 and 4

As practiced in the literature (e.g., see [42]), in a constrained optimization problem with the non-negativity constraint on the variable x, a common condition to check whether a point x k is close to a stationary point is
P f ( x k ) ε f ( x 0 ) ,
where f is the differentiable function that we try to optimize and P f ( x k ) is the projected gradient defined as
P f ( x ) i = f ( x ) i , if x i > 0 , min { 0 , f ( x ) i } , if x i = 0 ,
and ε is a small positive tolerance. For Algorithm 3, (11) becomes
P F G k , H k F ε F G 0 , H 0 F .
We impose a time limit in seconds and a maximum number of iterations for Algorithm 4 as well. Following [19], we also define stopping conditions for the sub-problems. The matrices G k + 1 and H k + 1 returned by Algorithm 4, respectively, must satisfy
G P F G k + 1 , H k F ε ¯ G , H P F G k + 1 , H k + 1 F ε ¯ H ,
where
ε ¯ G = ε ¯ H = max { 10 7 , ε } F G 0 , H 0 F ,
and ε is the same tolerance used in (13). If the PG method for solving the sub-problem (8) or (9) stops after the first iteration, then we decrease the stopping tolerance as follows:
ε ¯ G τ ε ¯ G , ε ¯ H τ ε ¯ H ,
where τ is a constant smaller then 1.

4. Numerical Results

In this section we demonstrate, how the PG method described in Section 3, performs compared to the MU-based algorithms of Ding and Mirzal, which were described in Section 2.1 and Section 2.2, respectively.

4.1. Artificial Data

We created three sets of synthetic data using MATLAB [37]. The first set we call bi-orthonormal set (BION). It consists of instances of matrix R R + n × n , which were created as products of G and H, where G R + n × k has orthonormal columns while H R + k × n has orthonormal rows. We created five instances of R, for each pair ( n , k 1 ) and ( n , k 2 ) from Table 1.
Matrices G were created in two phases: firstly, we randomly (uniform distribution) selected a position in each row; secondly, we selected a random number from ( 0 , 1 ) (uniform distribution) for the selected position in each row. Finally, if it happens that after this procedure some column of G is zero or has a norm below 10 8 , we find the first non-zero element in the largest column of G (according to Euclidean norm) and move it into the zero column. Then we normalized the columns of G. We created H similarly.
Each triple ( R , G , H ) was saved as a triple of txt files. For example, NMF_BIOG_data_R_n=200_k=80_id=5.txt contains 200 × 200 matrix R obtained by multiplying matrices G R 200 × 80 and H R 80 × 200 , which were generated as explained above. With id=5, we denote that this is a 5th matrix corresponding to this pair ( n , k ) .
The second set contains similar data to BION, but only one factor (G) is orthonormal, while the other (H) is non-negative but not necessarily orthonormal. We call this dataset uni-orthonormal (UNION).
The third data set is a nosy variant of the first data set. For each triple R , G , H from the BION data, we computed a new nosy R n = R + E , where E is a random matrix of the same size as R, with entries uniformly distributed on [ 0 , μ ¯ ] . Parameter μ ¯ is defined such that the expected value of RSE satisfies:
E R n G H F 1 + R n F = E E F 1 + R n F E ( E F ) 1 + G H F μ ,
where μ is parameter chosen by us and was set to 10 2 , 10 4 , 10 6 . By using basic properties of the uniform distribution, we can easily derive that μ ¯ μ ( 1 + G H F ) 3 / n , where n is the order of square matrix R. We indeed used the right hand side term of this inequality to generate the noise matrices E.
All computations are done using MATLAB [37] and a high performance computer available at Faculty of Mechanical Engineering of University of Ljubljana. This is Intel Xeon X5670 (1536 hyper-cores) HPC cluster and an E5-2680 V3 (1008 hyper-cores) DP cluster, with an IB QDR interconnection, 164 TB of LUSTRE storage, 4.6 TB RAM and with 24 TFlop/s performance.

4.2. Numerical Results for UNION

In this subsection, we present numerical results, obtained by Ding’s, Mirzal’s, and our algorithm for a uni-orthogonal problem (ONMF), using the UNION data, introduced in the previous subsection. We adapted the last two algorithms (Algorithms 2 and 3) for UNION data by setting α = 0 in the problem formulation (bi-ONMF) and in all formulas underlying these two algorithms.
The maximum number of outer iterations for all three algorithms was set to 1000. In practices, we stop Algorithms 1 and 3 only when the maximum number of iterations is reached, while for Algorithm 2, the stopping condition involved also checking the progress of RSE. If this is too small (below 10 5 ) we also stop.
Recall that for UNION data we have for each pair n , k from Table 1 five symmetric matrices R for which we try to solve (ONMF) by Algorithms 1–3. Please note that all these algorithms demand as input the internal dimension k, i.e., the number of columns of factor G, which is in general not known in advance. Even though, we know this dimension by construction for UNION data, we tested the algorithms using internal dimensions p equal to 20 % , 40 % , , 100 % of k. For p = k , we know the optimum of the problem, which is 0, so for this case we can also estimate how good are the tested algorithms in terms of finding the global optimum.
The first question we had to answer was which value of β to use in Mirzal’s and PG algorithms. It is obvious that larger values of β moves the focus from optimizing the RSE to guaranteeing the orthonormality, i.e., feasibility for the original problem. We decided not to fix the value of β but to run both algorithms for β { 1 , 10 , 100 , 1000 } and report the results.
For each solution pair G , H returned by all algorithms, the non-negativity constraints are held by the construction of algorithms, so we only need to consider deviation of G from orthonormality, which we call infeasibility and define it as
infeas G : = G T G I F 1 + I F .
The computational results that follow in the rest of this subsection were obtained by setting the tolerance in the stopping criterion to ε = 10 10 , the maximum number of iterations to 1000 in Algorithm 3 and to 20 in Algorithm 4.
We also set a time limit to 3600 seconds. Additionally, for σ and γ (updating parameter for λ in Algorithm 4) we choose 0.001 and 0.1 , respectively. Finally, for τ from (16) we set a value of 0.1 .
In general, Algorithm 3 converges to a solution in early iterations and the norm of the projected gradient falls below the tolerance shortly after running the algorithm.
Results in Table 2 and Table 3 and their visualizations in Figure 1 and Figure 2 confirm expectations. More precisely, we can see that the smaller the value of β , the better RSE. Likewise, the larger the value of β , the smaller the infeasibility infeas G . In practice, we want to reach both criteria: small RSE and small infeasibility, so some compromise should be made. If RSE is more important than infeasibility, we choose the smaller value of β and vice versa. We can also observe that regarding RSE the three compared algorithms do not differ a lot. However, when the input dimension p approaches the real inner dimension k, Algorithm 3 comes closest to the global optimum RSE = 0 . The situation with infeasibility is a bit different. While Algorithm 1 performs very well in all instances, Algorithm 2 reaches better feasibility for smaller values of n. Algorithm 3 outperforms the others for β = 1000 .
Results from Table 3, corresponding to n = 100 , 500 , 1000 are depicted in Figure 2.

4.3. Numerical Results for Bi-Orthonormal Data (BION)

In this subsection, we provide the same type of results as in the previous subsection, but for the BION dataset. We used almost the same setting as for UNION dataset: ε = 10 10 , maxit = 1000 , σ = 0.001 and time limit = 3600 s. Parameters γ , τ were slightly changed (based on experimental observations): γ = 0.75 and τ = 0.5 . Additionally, we decided to take the same values for α , β in Algorithms 2 and 3, since the matrices R in BION dataset are symmetric and both orthogonality constraints are equally important. We computed the results for values of α = β from { 1 , 10 , 100 , 1000 } . In Table 4 and Table 5 we report average RSE and average infeasibility, respectively, of the solutions obtained by Algorithms 1–3. Since for this dataset we need to monitor how orthonormal are both matrices G and H, we adapt the measure for infeasibility as follows:
infeas G , H : = G T G I F + H H T I F 1 + I F .
Figure 3 and Figure 4 depict RSE and infeasibility reached by the three compared algorithms, for n = 100 , 500 , 1000 . We can see that all three algorithms behave well; however, Algorithm 3 is more stable and less dependent on the choice of β . It is interesting to see that β does not have a big impact on RSE and infeasibility for Algorithm 3, a significant difference can be observed only when the internal dimension is equal to the real internal dimension, i.e., when p = 100 % . Based on these numerical results, we can conclude that smaller β achieve better RSE and almost the same infeasibility, so it would make sense to use β = 1 .
For Algorithm 2 these differences are bigger and it is less obvious which β is appropriate. Again, if RSE is more important then smaller values of β should be taken, otherwise larger values.

4.4. Numerical Results on the Noisy BION Dataset

In this subsection, we report RSE and infeasibility computed on the noisy BION dataset with dimension n = 200 . We decided to skip the other dimension since this n is already well representative for the whole noisy dataset and implies a large new Table 6 and six new plots in Figure 5. For Algorithms 2 and 3, we included results only for β = 1 , according to conclusions from Section 4.3. We can see that with increasing noise, the computed RSE also increase. However, we can see that all three algorithms are robust to noise, i.e., the resulting RSE for the noisy and the original BION data are very close. The same holds for the infeasibility, depicted in Figure 5.
On the noisy dataset, we also demonstrate what happens if the internal dimension is larger than the true internal dimension (this is demonstrated by p = 120 % , 140 % ) . Algorithm 1 finds solution that is slightly closer to the optimum compared to the non-noisy data. Algorithm 2 does not improve RSE, actually RSE slightly increases with p. Algorithm 3 has best performance. It comes with RSE very close to 0 and stays there with increasing p.
Regarding infeasibility, the situation from Figure 4 can be observed also on the noisy dataset. Figure 5 shows that with p > 100 % the infeasibility increases. This is not surprising, the higher the internal dimension, the more difficult is to achieve orthonormality. However, resulting vales of i n f e a s G , H are still surprisingly small.
We also analyzed how close are the matrices G ˜ and H ˜ , computed by all three algorithms, to the matrices G and H that were used to generate the data matrices R. This comparison is possible only when the inner dimension is equal to the real inner dimension ( p = 100 % ). We figured out that the Frobenious norms between these pairs of matrices, i.e., G ˜ G F and H ˜ H F are quite large (of order k ), which means that on a first glance, the solutions are quite different. However, since for every pair G , H and every permutation matrix Π , we have G H = G Π Π T H , the differences between the computed pairs matrices are mainly due to the fact that they have permuted columns (G) or rows (H).

4.5. Time Complexity of All Algorithms

Based on the previous subsections, we can observe that Algorithm 3 is best performing regarding RSE and infeasibility. In this section we perform time complexity analysis of all three algorithms. Following their descriptions we can see that Algorithms 1 has the most simple descriptions and also its implementation is rather simple, only few lines of code. Algorithms 2 and 3 are more involved regarding their theoretical description and implementation, since both involve computations of gradients.
In practices, we stop all three algorithms after 1000 iterations of (outer) loop. For Algorithms 1 and 3, this is the only stopping criteria, while for Algorithm 2, the stopping condition involves also checking the progress of RSE. If this is too small (below 10 5 ) we also stop.
We first demonstrate how RSE decreases with iterations. The left plot in Figure 6 depicts that Algorithm 3 has the fastest decrease with the number of iterations and needs only a few dozens of iterations to reach optimum RSE. The other two algorithms need much more iterations. However, in each iteration, Algorithm 3 involves solving two sub-problems (9). This results in much higher times needed for one iteration. The right plot of Figure 6 depicts how RSE is decreasing with time. We can see that Algorithms 1 and 2 are much faster. We could decrease this difference by involving more advanced stopping criteria for Algorithm 3, which will be addressed during our future research.

5. Discussion and Conclusions

We presented a projected gradient method to solve the orthogonal non-negative matrix factorization problem. We penalized the deviation from orthonormality with some positive parameters and added the resulted terms to the objective function of the standard non-negative matrix factorization problem. Then, we considered minimizing the resulted objective function under the non-negativity conditions only, in a block coordinate decent approach. The method was tested on three sets of synthetic data: the first containing the uni-orthonormal matrices, the second containing the bi-orthonormal matrices and the third containing the noisy variants of the bi-orthonormal matrices. Different values for penalising parameters were applied in the implementation to determine recommendations which values shall be used in practise.
The performance of our algorithm was compared with two algorithms based on multiplicative updates rules. Algorithms were compared regarding the quality of factorization (RSE) and how much the resulting factors deviate from orthonormality. We provided an extensive list of numerical results which demonstrate that our method is very competitive and outperforms the others in terms of quality of the solution, measured by RSE, and feasibility of the solution, measured by infeas G or infeas G , H . If we take into account also the computing time, the Ding’s Algorithm 1 is also very competitive, since it computes solutions with slightly worse RSE and infeas G , H , but in much shorter time.
We expect that the difference in time complexity between Algorithms 1 and 3 can be reduced if we implement more advanced stopping criteria for the latter algorithm. This will be addressed in our future research.

Author Contributions

Methodology, S.A.; resources, J.P.; software, S.A. and J.P.; supervision, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

The work of the first author is supported by the Swiss Government Excellence Scholarships grant number ESKAS-2019.0147. The work of the second author was partially funded by Slovenian Research Agency under research program P2-0162 and research projects J1-2453, N1-0071, J5-2552, J2-2512 and J1-1691.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is available on git portal: https://github.com/Soodi1/ONMFdata (accessed on 8 November 2020).

Acknowledgments

The work of the first author is supported by the Swiss Government Excellence Scholarships grant number ESKAS-2019.0147. This author also thanks the University of Applied Sciences and Arts, Northwestern Switzerland for supporting the work. The work of the second author was partially funded by Slovenian Research Agency under research program P2-0162 and research projects J1-2453, N1-0071, J5-2552, J2-2512 and J1-1691. The authors would also like to thank to Andri Mirzal (Faculty of Computing, Universiti Teknologi Malaysia) for providing the code for his algorithm (Algorithm 2) to solve (ONMF). This code was also adapted by the authors to solve (bi-ONMF).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Berry, M.W.; Browne, M.; Langville, A.N.; Pauca, V.P.; Plemmons, R.J. Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 2007, 52, 155–173. [Google Scholar] [CrossRef] [Green Version]
  2. Pauca, V.P.; Shahnaz, F.; Berry, M.W.; Plemmons, R.J. Text mining using non-negative matrix factorizations. In Proceedings of the 2004 SIAM International Conference on Data Mining, Lake Buena Vista, FL, USA, 22–24 April 2004; pp. 452–456. [Google Scholar]
  3. Shahnaz, F.; Berry, M.W.; Pauca, V.P.; Plemmons, R.J. Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 2006, 42, 373–386. [Google Scholar] [CrossRef]
  4. Berry, M.W.; Gillis, N.; Glineur, F. Document classification using nonnegative matrix factorization and underapproximation. In Proceedings of the 2009 IEEE International Symposium on Circuits and Systems, Taipei, Taiwan, 24–27 May 2009; pp. 2782–2785. [Google Scholar]
  5. Li, T.; Ding, C. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; pp. 362–371. [Google Scholar]
  6. Xu, W.; Liu, X.; Gong, Y. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, ON, Canada, 28 July–1 August 2003; pp. 267–273. [Google Scholar]
  7. Kaarna, A. Non-negative matrix factorization features from spectral signatures of AVIRIS images. In Proceedings of the 2006 IEEE International Symposium on Geoscience and Remote Sensing, Denver, CO, USA, 31 July–4 August 2006; pp. 549–552. [Google Scholar]
  8. Zafeiriou, S.; Tefas, A.; Buciu, I.; Pitas, I. Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification. IEEE Trans. Neural Netw. 2006, 17, 683–695. [Google Scholar] [CrossRef] [Green Version]
  9. Golub, G.H.; Reinsch, C. Singular value decomposition and least squares solutions. In Linear Algebra; Springer: Berlin, Germany, 1971; pp. 134–151. [Google Scholar]
  10. Jolliffe, I. Principal Component Analysis; Wiley Online Library: Hoboken, NJ, USA, 2005. [Google Scholar]
  11. Ding, C.; Li, T.; Peng, W.; Park, H. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 126–135. [Google Scholar]
  12. Gillis, N. Nonnegative Matrix Factorization; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2020. [Google Scholar]
  13. Paatero, P.; Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994, 5, 111–126. [Google Scholar] [CrossRef]
  14. Anttila, P.; Paatero, P.; Tapper, U.; Järvinen, O. Source identification of bulk wet deposition in Finland by positive matrix factorization. Atmos. Environ. 1995, 29, 1705–1718. [Google Scholar] [CrossRef]
  15. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788. [Google Scholar] [CrossRef] [PubMed]
  16. Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27 November–2 December 2000; pp. 556–562. [Google Scholar]
  17. Chu, M.; Diele, F.; Plemmons, R.; Ragni, S. Optimality, computation, and interpretation of nonnegative matrix factorizations. SIAM J. Matrix Anal. 2004, 4, 8030. [Google Scholar]
  18. Kim, H.; Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 2008, 30, 713–730. [Google Scholar] [CrossRef]
  19. Lin, C. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 2007, 19, 2756–2779. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  20. Cichocki, A.; Zdunek, R.; Amari, S.i. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. In International Conference on Independent Component Analysis and Signal Separation; Springer: Berlin, Germany, 2007; pp. 169–176. [Google Scholar]
  21. Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 2011, 53, 217–288. [Google Scholar] [CrossRef]
  22. Yoo, J.; Choi, S. Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds. In International Conference on Intelligent Data Engineering and Automated Learning; Springer: Berlin, Germany, 2008; pp. 140–147. [Google Scholar]
  23. Yoo, J.; Choi, S. Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on stiefel manifolds. Inf. Process. Manag. 2010, 46, 559–570. [Google Scholar] [CrossRef]
  24. Choi, S. Algorithms for orthogonal nonnegative matrix factorization. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1828–1832. [Google Scholar]
  25. Kim, D.; Sra, S.; Dhillon, I.S. Fast Projection-Based Methods for the Least Squares Nonnegative Matrix Approximation Problem. Stat. Anal. Data Mining 2008, 1, 38–51. [Google Scholar] [CrossRef] [Green Version]
  26. Kim, D.; Sra, S.; Dhillon, I.S. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 343–354. [Google Scholar]
  27. Kim, J.; Park, H. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 353–362. [Google Scholar]
  28. Mirzal, A. A convergent algorithm for orthogonal nonnegative matrix factorization. J. Comput. Appl. Math. 2014, 260, 149–166. [Google Scholar] [CrossRef]
  29. Lin, C.J. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 2007, 18, 1589–1596. [Google Scholar]
  30. Esposito, F.; Boccarelli, A.; Del Buono, N. An NMF-Based methodology for selecting biomarkers in the landscape of genes of heterogeneous cancer-associated fibroblast Populations. Bioinform. Biol. Insights 2020, 14, 1–13. [Google Scholar] [CrossRef] [PubMed]
  31. Peng, S.; Ser, W.; Chen, B.; Lin, Z. Robust orthogonal nonnegative matrix tri-factorization for data representation. Knowl.-Based Syst. 2020, 201, 106054. [Google Scholar] [CrossRef]
  32. Leplat, N.V.; Gillis, A.A. Blind audio source separation with minimum-volume beta-divergence NMF. IEEE Trans. Signal Process. 2020, 68, 3400–3410. [Google Scholar] [CrossRef]
  33. Casalino, G.; Coluccia, M.; Pati, M.L.; Pannunzio, A.; Vacca, A.; Scilimati, A.; Perrone, M.G. Intelligent microarray data analysis through non-negative matrix factorization to study human multiple myeloma cell lines. Appl. Sci. 2019, 9, 5552. [Google Scholar] [CrossRef] [Green Version]
  34. Ge, S.; Luo, L.; Li, H. Orthogonal incremental non-negative matrix factorization algorithm and its application in image classification. Comput. Appl. Math. 2020, 39, 1–16. [Google Scholar] [CrossRef]
  35. Bertsekas, D. Nonlinear Programming; Athena Scientific optimization and Computation Series; Athena Scientific: Nashua, NH, USA, 2016. [Google Scholar]
  36. Richtárik, P.; Takác, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 2014, 144, 1–38. [Google Scholar] [CrossRef] [Green Version]
  37. The MathWorks. MATLAB Version R2019a; The MathWorks: Natick, MA, USA, 2019. [Google Scholar]
  38. Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.i. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  39. Piper, J.; Pauca, V.P.; Plemmons, R.J.; Giffin, M. Object Characterization from Spectral Data Using Nonnegative Factorization and Information theory. In Proceedings of the AMOS Technical Conference; 2004. Available online: http://users.wfu.edu/plemmons/papers/Amos2004_2.pdf (accessed on 8 November 2020).
  40. Mirzal, A. A Convergent Algorithm for Bi-orthogonal Nonnegative Matrix Tri-Factorization. arXiv 2017, arXiv:1710.11478. [Google Scholar]
  41. Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 1966, 16, 1–3. [Google Scholar] [CrossRef] [Green Version]
  42. Lin, C.J.; Moré, J.J. Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 1999, 9, 1100–1127. [Google Scholar] [CrossRef] [Green Version]
Figure 1. This figure depicts data from Table 2. It contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE on UNION instances with n = 100, 500, 1000, for β ∊ {1, 10, 100, 1000}. We can see that regarding RSE the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield larger values of RSE, but the differences are rather small. However, when p approached 100% of k, Algorithm 3 comes closest to the global optimum RSE = 0.
Figure 1. This figure depicts data from Table 2. It contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE on UNION instances with n = 100, 500, 1000, for β ∊ {1, 10, 100, 1000}. We can see that regarding RSE the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield larger values of RSE, but the differences are rather small. However, when p approached 100% of k, Algorithm 3 comes closest to the global optimum RSE = 0.
Mathematics 09 00540 g001
Figure 2. This figure depicts data from Table 3. It contains six plots which illustrate the quality of Algorithms 1–3 regarding infeasibility on UNION instances with n = 100, 500, 1000, for β ∊ {1, 10, 100, 1000}. We can see that regarding infeasibility the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield smaller values of infeasG, but the differences are rather small.
Figure 2. This figure depicts data from Table 3. It contains six plots which illustrate the quality of Algorithms 1–3 regarding infeasibility on UNION instances with n = 100, 500, 1000, for β ∊ {1, 10, 100, 1000}. We can see that regarding infeasibility the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield smaller values of infeasG, but the differences are rather small.
Mathematics 09 00540 g002
Figure 3. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE on BION instances with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∊ {1, 10, 100, 1000}. We can observe that Algorithm 3 is more stable, less dependent to the choice of β and is computing better values of RSE.
Figure 3. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE on BION instances with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∊ {1, 10, 100, 1000}. We can observe that Algorithm 3 is more stable, less dependent to the choice of β and is computing better values of RSE.
Mathematics 09 00540 g003
Figure 4. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding the infeasibility on BION instances with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∊ {1, 10, 100, 1000}. We can observe that Algorithm 3 computes solutions with infeasibility (18) slightly smaller compared to solutions computed by Algorithm 2.
Figure 4. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding the infeasibility on BION instances with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∊ {1, 10, 100, 1000}. We can observe that Algorithm 3 computes solutions with infeasibility (18) slightly smaller compared to solutions computed by Algorithm 2.
Mathematics 09 00540 g004
Figure 5. On the left, we depict how RSE is changing with increasing the inner dimension p from 20% of real inner dimension k to 140% of k. For each algorithm, we depict RSE on the original BION data and on the noisy BION data with µ ∊ {10−2, 10−4, 10−6}. On the right plots, we demonstrate how (in)feasible are the optimum solutions obtained by each algorithm, for different relative inner dimensions p, for the original and the noisy BION data.
Figure 5. On the left, we depict how RSE is changing with increasing the inner dimension p from 20% of real inner dimension k to 140% of k. For each algorithm, we depict RSE on the original BION data and on the noisy BION data with µ ∊ {10−2, 10−4, 10−6}. On the right plots, we demonstrate how (in)feasible are the optimum solutions obtained by each algorithm, for different relative inner dimensions p, for the original and the noisy BION data.
Mathematics 09 00540 g005
Figure 6. This figure depicts how RSE is changing with the number of outer iterations (left) and with time (right), for all three algorithms. Computations are done on the noisy BION data set with µ = 10−2, for n = 200 and the inner dimension was equal to the true inner dimension (p = 100 %).
Figure 6. This figure depicts how RSE is changing with the number of outer iterations (left) and with time (right), for all three algorithms. Computations are done on the noisy BION data set with µ = 10−2, for n = 200 and the inner dimension was equal to the true inner dimension (p = 100 %).
Mathematics 09 00540 g006
Table 1. Paris ( n , k ) for which we created UNION and BION datasets.
Table 1. Paris ( n , k ) for which we created UNION and BION datasets.
n501002005001000
k 1 102040100200
k 2 204080200400
Table 2. In this table we demonstrate how good RSE is achieved by Algorithms 1–3 on UNION dataset. For each n { 50 , 100 , 200 , 500 , 1000 } we take all 10 matrices R (five of them corresponding to k = 0.2 n and five to k = 0.4 n ). We run all three algorithms on these matrices with inner dimensions p { 0.2 k , 0.4 k , , 1.0 k } with all possible values of β { 1 , 10 , 100 , 1000 } . Each row represents the average (arithmetic mean value) RSE obtained on instances corresponding to given n. For example, the last row shows the average value of RSE in 10 instances of dimension 1000 (five of them corresponding to k = 200 and five to k = 400 ) obtained by all three algorithms for all four values of β , which were run with the input dimension p = k . The bold number is the smallest one in each line.
Table 2. In this table we demonstrate how good RSE is achieved by Algorithms 1–3 on UNION dataset. For each n { 50 , 100 , 200 , 500 , 1000 } we take all 10 matrices R (five of them corresponding to k = 0.2 n and five to k = 0.4 n ). We run all three algorithms on these matrices with inner dimensions p { 0.2 k , 0.4 k , , 1.0 k } with all possible values of β { 1 , 10 , 100 , 1000 } . Each row represents the average (arithmetic mean value) RSE obtained on instances corresponding to given n. For example, the last row shows the average value of RSE in 10 instances of dimension 1000 (five of them corresponding to k = 200 and five to k = 400 ) obtained by all three algorithms for all four values of β , which were run with the input dimension p = k . The bold number is the smallest one in each line.
npRSE ofRSE of Algorithm 2RSE of Algorithm 3
(% of k)Algorithm 1 β = 1 β = 10 β = 100 β = 1000 β = 1 β = 10 β = 100 β = 1000
50400.31430.29650.30700.33290.38980.29630.30810.34250.3508
50600.23480.22270.23560.26760.34590.22010.23820.27330.2765
50800.17380.14920.16340.18940.32770.14680.16200.19530.2053
501000.00020.01330.00040.09320.29730.00000.00000.00000.0000
100200.40630.39140.39550.40630.42540.39060.39590.40830.4210
100400.33840.31390.32100.34150.36770.31160.32100.34880.3625
100600.26740.24620.25410.27300.29780.24030.25280.28010.2974
100800.18470.17370.15810.19090.22630.16290.17440.19590.2090
1001000.01260.05320.04270.00890.15150.00000.00000.00000.0075
200200.42130.40240.40770.40800.42570.40050.40320.41620.4337
200400.35620.33150.33980.34010.36470.32700.33130.34970.3738
200600.28450.26750.27460.27480.29550.25730.26170.28120.3061
200800.19590.19580.20130.19960.20850.17730.18190.19600.2133
2001000.01910.07530.06320.06220.04150.00000.00000.00690.0181
500200.43320.41200.41190.41200.41210.40920.40960.41970.4346
500400.37110.35060.35090.35070.35050.34300.34400.35370.3753
500600.30030.29190.29230.29160.29090.27560.27660.28450.3031
500800.20980.21860.21920.22070.21510.19310.19410.19990.2122
5001000.02730.08220.08640.08530.07130.00020.00030.00020.0097
1000200.43860.41950.41940.41930.41950.41560.41600.42160.4324
1000400.37770.36410.36400.36380.36370.35450.35480.35880.3707
1000600.30700.30470.30550.30510.30360.28810.28800.29060.3006
1000800.21640.22650.22480.22540.22360.20240.20290.20500.2106
10001000.03290.07250.07720.07610.07090.01730.00300.00350.0035
Table 3. In this table we demonstrate how feasible (orthonormal) the solutions are G computed by Algorithms 1–3 on UNION data set, i.e., in this table we report the average infeasibility of the solutions underlying Table 2. The bold number is the smallest one in each line.
Table 3. In this table we demonstrate how feasible (orthonormal) the solutions are G computed by Algorithms 1–3 on UNION data set, i.e., in this table we report the average infeasibility of the solutions underlying Table 2. The bold number is the smallest one in each line.
npInfeas. ofInfeas. of Algorithm 2Infeas. of Algorithm 3
(% of k)Algorithm 1 β = 1 β = 10 β = 100 β = 1000 β = 1 β = 10 β = 100 β = 1000
50200.09640.24900.09240.01550.00380.22980.09090.01540.0022
50400.07400.18860.06760.01310.00400.18450.06700.01350.0023
50600.05530.13240.04650.00680.00400.12450.04400.00910.0015
50800.03240.09640.02410.00530.00340.07890.02500.00690.0020
501000.00230.02570.00220.00230.00390.00000.00000.00000.0000
100200.07740.26240.14410.02580.00640.25880.13080.02580.0036
100400.05390.17540.09280.01680.00360.16540.08190.01820.0035
100600.04000.12050.05450.01020.00240.11090.04870.01380.0033
100800.02390.08900.03240.00620.00220.06230.02580.00830.0018
1001000.00620.04520.01530.00090.00160.00020.00000.00000.0000
200200.05840.21570.14370.14330.00540.20870.15120.03480.0074
200400.03560.13790.10040.10000.00360.12400.08060.02070.0053
200600.02600.09550.07910.07930.00310.07540.04340.01430.0047
200800.01540.06570.06340.06290.00170.04160.02180.00800.0026
2001000.00590.04120.05170.05120.00160.00020.00010.00020.0001
500200.03320.15870.18940.19080.19080.14750.12680.04360.0087
500400.01890.11550.13430.13490.13470.07700.06210.02270.0069
500600.01340.08890.10950.11020.10550.04120.03120.01230.0038
500800.00840.06560.09460.09540.08260.03000.01540.00610.0021
5001000.00500.04990.08470.08530.06930.02490.00030.00010.0001
1000200.02110.12000.13440.13490.13500.10430.09700.04710.0097
1000400.01220.08630.09510.09540.09540.05420.04220.01990.0059
1000600.00730.06620.07760.07790.07790.04140.02050.00980.0037
1000800.00450.05390.06710.06750.06750.03360.01030.00470.0018
10001000.00400.04750.06000.06030.06040.02960.00660.00050.0003
Table 4. RSE obtained by Algorithms 1–3 on the BION data. For the latter two algorithms, we used α = β { 1 , 10 , 100 , 1000 } . For each n { 50 , 100 , 200 , 500 , 1000 } we take all ten matrices R (five of them corresponding to k = 0.2 n and five to k = 0.4 n ). We run all three algorithms on these matrices with inner dimensions p { 0.2 k , 0.4 k , , 1.0 k } with all possible values of α = β . Like before, each row represents the average (arithmetic mean value) of RSE obtained on instances corresponding to given n and given p as a percentage of k. We can see that the larger the β , the worse the RSE, which is consistent with expectations. The bold number is the smallest one in each line.
Table 4. RSE obtained by Algorithms 1–3 on the BION data. For the latter two algorithms, we used α = β { 1 , 10 , 100 , 1000 } . For each n { 50 , 100 , 200 , 500 , 1000 } we take all ten matrices R (five of them corresponding to k = 0.2 n and five to k = 0.4 n ). We run all three algorithms on these matrices with inner dimensions p { 0.2 k , 0.4 k , , 1.0 k } with all possible values of α = β . Like before, each row represents the average (arithmetic mean value) of RSE obtained on instances corresponding to given n and given p as a percentage of k. We can see that the larger the β , the worse the RSE, which is consistent with expectations. The bold number is the smallest one in each line.
npRSE ofRSE of Algorithm 2RSE of Algorithm 3
(% of k)Algorithm 1 β = 1 β = 10 β = 100 β = 1000 β = 1 β = 10 β = 100 β = 1000
50200.70530.70530.70530.70530.82830.70530.70530.70550.8259
50400.61080.61080.61080.61080.90660.61080.61080.61080.6631
50600.49870.49870.49870.54420.96650.49870.49870.49870.5000
50800.35260.36710.37420.44971.02820.35260.37960.35270.4374
501000.06070.17120.27860.51981.07810.11450.18200.26040.3689
100200.75160.75160.75160.75170.90700.75160.75160.75170.8224
100400.65090.65090.65090.71740.97790.65090.65090.65090.6514
100600.53150.53150.53150.55041.04010.53150.53150.53150.5352
100800.37580.37870.41060.45421.10820.38010.38880.39170.3898
1001000.13770.19930.33110.48981.17340.04570.10160.27580.3757
200200.78840.78840.78840.78840.94990.78840.78840.78840.7888
200400.68280.68280.68280.68281.03250.68280.68280.68280.6828
200600.55750.55750.55750.56471.09380.55750.55750.55750.5610
200800.39420.39420.39650.50191.16180.39420.39420.39420.4373
2001000.14470.18510.30140.54001.22970.02020.14290.29640.3315
500200.82420.82420.82420.82420.99560.82420.82420.82420.8243
500400.71380.71380.71380.71381.06790.71380.71380.71380.7138
500600.58280.58280.58280.60451.15340.58280.58280.58280.5828
500800.41210.41210.42030.52851.21600.41210.41210.41210.4334
5001000.14050.18140.34010.58541.28220.00670.10590.20440.3378
1000200.84360.84360.84360.84361.02610.84360.84360.84360.8436
1000400.73060.73060.73060.73091.09160.73060.73060.73060.7306
1000600.59650.59650.59650.61211.16690.59650.59650.59650.5968
1000800.42180.42180.42560.53381.23890.42180.42180.42180.4397
10001000.13460.16350.33240.57551.30800.00960.06970.16610.2188
Table 5. In this table we demonstrate how feasible (orthonormal) are the solutions G and H computed by Algorithms 1–3 on the BION dataset, i.e., in this table we report the average infeasibility (18) of the solutions underlying Table 4. We can observe that with these settings of all algorithms we can bring infeasibility to order of 10 3 very often, for all values of β . The bold number is the smallest one in each line.
Table 5. In this table we demonstrate how feasible (orthonormal) are the solutions G and H computed by Algorithms 1–3 on the BION dataset, i.e., in this table we report the average infeasibility (18) of the solutions underlying Table 4. We can observe that with these settings of all algorithms we can bring infeasibility to order of 10 3 very often, for all values of β . The bold number is the smallest one in each line.
npInfeas. ofInfeas. of Algorithm 2Infeas. of Algorithm 3
(% of k)Algorithm 1 β = 1 β = 10 β = 100 β = 1000 β = 1 β = 10 β = 100 β = 1000
50200.00010.00700.00360.00100.00680.00170.00210.00210.0026
50400.00000.00410.00210.00040.00560.00080.00120.00120.0014
50600.00000.00300.00090.00320.00380.00050.00080.00090.0009
50800.00000.01830.00300.00210.00280.00040.02020.00060.0013
501000.03550.05330.01270.00450.00270.04180.04780.01230.0021
100200.00010.00510.00240.00060.00630.00100.00120.00130.0016
100400.00000.00290.00170.00660.00400.00040.00060.00070.0007
100600.00000.00190.00080.00090.00270.00030.00040.00050.0005
100800.00000.00390.00480.00150.00210.00620.01490.00370.0006
1001000.06060.04540.01050.00220.00180.01060.02280.01730.0028
200200.00020.00330.00190.00050.00430.00050.00070.00070.0007
200400.00010.00170.00100.00020.00270.00020.00030.00040.0003
200600.00010.00100.00050.00040.00190.00010.00020.00020.0004
200800.00000.00060.00060.00150.00140.00010.00010.00020.0013
2001000.04250.02800.00640.00190.00150.00460.02240.02400.0034
500200.00010.00170.00110.00030.00250.00020.00030.00030.0003
500400.00010.00080.00050.00010.00160.00010.00010.00020.0002
500600.00000.00050.00030.00060.00130.00010.00010.00010.0002
500800.00000.00030.00090.00090.00080.00000.00010.00010.0016
5001000.02580.01840.00450.00130.00070.00170.01010.01750.0053
1000200.00010.00100.00060.00020.00240.00010.00020.00020.0002
1000400.00000.00050.00030.00010.00090.00010.00020.00030.0002
1000600.00000.00030.00020.00040.00090.00030.00020.00030.0003
1000800.00000.00020.00050.00070.00060.00400.00010.00020.0020
10001000.01730.01170.00310.00090.00050.00430.00500.01210.0060
Table 6. This table contains numerical data obtained by running all three algorithms on the third dataset—BION data with three levels of noise, represented by μ { 10 2 , 10 4 , 10 6 } . The bold number is the smallest one in each line.
Table 6. This table contains numerical data obtained by running all three algorithms on the third dataset—BION data with three levels of noise, represented by μ { 10 2 , 10 4 , 10 6 } . The bold number is the smallest one in each line.
Algorithm 1Algorithm 2Algorithm 3
μ n p RSE infeas G infeas H RSE infeas G infeas H RSE infeas G infeas H
10 2 202000.78770.00190.00190.78770.00410.00410.78770.00420.0042
402000.68210.00140.00140.68230.00240.00240.68210.00270.0028
602000.55700.00100.00100.55710.00150.00150.55690.00200.0020
802000.39390.00070.00070.39390.00100.00100.39380.00150.0015
1002000.00720.00030.00030.13270.00970.00980.00390.00100.0010
1202000.02780.04830.04830.13250.02830.03260.00360.04650.0141
1402000.03440.05890.05900.19090.03660.03630.00340.05650.0188
10 4 202000.78840.00020.00020.78840.00180.00180.78840.00030.0003
402000.68280.00010.00010.68280.00090.00090.68280.00010.0001
602000.55750.00010.00010.55750.00060.00060.55750.00010.0001
802000.39420.00000.00000.39420.00040.00030.39420.00010.0001
1002000.05750.00860.00860.17170.00890.01920.00010.00040.0004
1202000.00430.04900.04890.14070.02750.03210.00030.04680.0159
1402000.00490.05960.05960.17430.03630.03900.00030.05580.0221
10 6 202000.78840.00010.00010.78840.00170.00170.78840.00020.0002
402000.68280.00000.00000.68280.00090.00080.68280.00010.0001
602000.55750.00000.00000.55750.00050.00050.55750.00010.0001
802000.39420.00000.00000.39420.00030.00030.39420.00010.0001
1002000.06240.00920.00910.19660.01590.01670.01370.00290.0006
1202000.00310.04900.04920.12500.03010.03090.00030.04780.0179
1402000.00510.05950.05970.16920.03670.03810.00030.05620.0268
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Asadi, S.; Povh, J. A Block Coordinate Descent-Based Projected Gradient Algorithm for Orthogonal Non-Negative Matrix Factorization. Mathematics 2021, 9, 540. https://doi.org/10.3390/math9050540

AMA Style

Asadi S, Povh J. A Block Coordinate Descent-Based Projected Gradient Algorithm for Orthogonal Non-Negative Matrix Factorization. Mathematics. 2021; 9(5):540. https://doi.org/10.3390/math9050540

Chicago/Turabian Style

Asadi, Soodabeh, and Janez Povh. 2021. "A Block Coordinate Descent-Based Projected Gradient Algorithm for Orthogonal Non-Negative Matrix Factorization" Mathematics 9, no. 5: 540. https://doi.org/10.3390/math9050540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop