The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability

Sun, Haichao; Yang, Jie

doi:10.3390/electronics12051147

Open AccessArticle

The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability

by

Haichao Sun

¹ and

Jie Yang

^2,*

¹

Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

School of Physics and Electronic Science, Zunyi Normal University, Zunyi 563002, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(5), 1147; https://doi.org/10.3390/electronics12051147

Submission received: 5 February 2023 / Revised: 23 February 2023 / Accepted: 23 February 2023 / Published: 27 February 2023

(This article belongs to the Special Issue Advances in Fuzzy and Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The Non-negative Matrix Factorization (NMF) is a popular technique for intelligent systems, which can be widely used to decompose a nonnegative matrix into two factor matrices: a basis matrix and a coefficient one, respectively. The main objective of NMF is to ensure that the operation results of the two matrices are as close to the original matrix as possible. Meanwhile, the stability and generalization ability of the algorithm should be ensured. Therefore, the generalization performance of NMF algorithms is analyzed from the perspective of algorithm stability and the generalization error bounds are given, which is named AS-NMF. Firstly, a general NMF prediction algorithm is proposed, which can predict the labels for new samples, and then the corresponding loss function is defined further. Secondly, the stability of the NMF algorithm is defined according to the loss function, and two generalization error bounds can be obtained by employing uniform stability in the case where U is fixed and it is not fixed under the multiplicative update rule. The bounds numerically show that its stability parameter depends on the upper bound on the module length of the input data, dimension of hidden matrix and Frobenius norm of the basis matrix. Finally, a general and stable framework is established, which can analyze and measure generalization error bounds for the NMF algorithm. The experimental results demonstrate the advantages of new methods on three widely used benchmark datasets, which indicate that our AS-NMF can not only achieve efficient performance, but also outperform the state-of-the-art of recommending tasks in terms of model stability.

Keywords:

stability analysis; multiplicative rule; uniformly stability; generalization error bound

1. Introduction

Currently, Non-negative Matrix Factorization (NMF) [1,2,3,4] is gaining more attention in the machine learning method community, and it has become one of the most effective feature selection methods implemented in various fields, such as documents of clustering and classification [5,6], face recognition [7,8], source separation [9,10], digital image processing [11,12] and video semantic recognition [13]. Unfortunately, due to the non-negative constraints on the two factor matrices (basic matrix and conefficient matrix), only additive operations are allowed, resulting in only parts-based representation for NMF.

How to minimize the reconstruction error of NMF on the training and test datasets is a hot issue in NMF research. Most recent studies aim to use different regularization terms and loss functions, which are used to optimize the generalization performance of the NMF algorithm. Furthermore, in order to decompose non-negative matrices robustly, many loss functions have been proposed and used to improve their robustness, such as the

l_{1}

loss [14,15] and the

l_{2}

loss [16]. Sandler and Lindenbaum [12] proposed NMF algorithms that minimize error between the data and matrix using earth mover’s distance (EMD). Lee and Seung [17] provided a valid multiplication update rule (MUR) for NMF. In addition, better generalization performance can be achieved by using different regularization, which is accomplished by strengthening the weight on many attributes in the NMF. Incorporating sparseness [18] and the orthogonality penalty in NMF [19,20] can achieve sparse parts-based representation. Ye Zhang et al. [21] explored the blind separation problem for uncorrelated signals and proposed NMF methods with the least correlated component constraints for uncorrelated source signals. Nevertheless, for the electroencephalography (EEG) signals, the greater challenge is the nature of noise and capturing the inconspicuous relations between EEG signals and specific brain activities; Chen et al. [22] introduced a novel spatio-temporal, preserving representations of raw EEG streams to precisely identify human intentions. Zhang et al. [23] proposed an incremental NMF based on correlation and graph regularizer, which considers the geometric structure in the data and correlation among the data. Soon after, Liu et al. [24] proposed a large-cone NMF (LCNMF) algorithm for the problem of obtaining an attractive local solution for NMF, which can obtain attractive local solutions from a large-cone penalty framework for NMF, and they prove that LCNMF can obtain a smaller generalization upper bound under the Vapnik Chervonenkis theory. Therefore, it is interesting to characterize the NMF generalization error upper bound through some theoretical framework for the perspective of NMF generalization performance.

The capacity of an intelligent system is commonly evaluated by its generalization ability; the stronger the generalization ability, the more accurate a system’s prediction of new events. Intelligent systems are an essential approximation of the real-model of a problem, which is called a hypothesis. The risk is defined as an error between the true solution and hypothesis or the accumulation of errors. However, if the real model is unknown, the risk cannot be calculated. Meanwhile, empirical risk is used to approximate the true solution. However, the difference between the predicted results from the hypothesis overfit on the sample data and the true solution of the samples can lead to inaccuracies.

Recently, intelligent systems have increasingly adopted the strategy of empirical risk minimization to find the optimal solution. However, it has been shown that this approach can lead to overfitting of the classification function, resulting in poor generalization performance. The reason for this is that the empirical risk, which is used to approximate the true risk, is not a reliable indicator of the algorithm’s performance due to the limited number of training samples compared to the real-world data to be classified.

To address this issue, the concept of generalization error bounds has been established in statistics, which provides a measure of the bias and convergence rate between the empirical risk and expected risk of a learning algorithm. In order to accurately assess the performance of a learning algorithm, both the empirical risk and confidence risk must be taken into account. The empirical risk measures the error of the classifier on a given sample, while the confidence risk reflects the degree of trust we can place in the classifier’s predictions on new, unknown data.

Hence, the key point is to improve algorithm efficiency and performance, especially for real-world problems [25].

Obviously, there is no way to calculate the confidence risk accurately; therefore, only an estimated interval can be given, which also makes it possible to calculate only an upper bound on the overall error with the accurate value. For this reason it is called the generalization error bound. There is a number of methods [26,27,28,29] that have been proposed to solve this problem. One of the different approaches, based on sensitivity analysis, was introduced and its purpose is to determine how many changes in the input data affect the output of the algorithm. The motivation for such analysis is to design robust systems that will not be affected by noise-corrupted inputs. The main motivation of this analysis is to design a robust system in which the inputs of this model are not affected by noise-corrupted pollution. Bousquet et al. [26] defined the stability concept of a learning algorithm based on sensitivity analysis, which showed how to derive generalized error boundaries based on empirical and omission errors.

The generalization error bound typically has the following properties: it is a function of increasing sample size, where the generalization error bound tends to zero; it is a function of hypothesis space capacity, where the larger the capacity of the hypothesis space, the harder the model is to learn, and the larger the generalization error bound.

Comparing the upper limits of generalization error is a common way to evaluate the generalization ability of learning algorithms. In this work, we analyze the generalization error bound of NMF algorithms with graph regularization using the stability properties of the algorithms.

Analyzing the generalization and stability of machine learning algorithms is a fundamental research topic in intelligent systems and data mining tasks. Our main contribution in this work is to study the generalization performance of the NMF algorithm through stability analysis. We use stability theory to prove the generalization bound of NMF and provide a qualitative description of its generalization performance.

The structure of this paper is as follows: Section 2 introduces necessary notions and definitions relevant to algorithmic stability; In Section 3, we characterize the stability of NMF algorithms through uniform stability and give the generalization of the upper bound of NMF algorithms when U is fixed, and when U is not fixed under the multiplicative rule; In Section 4, we describe the stability of non-negative matrix decomposition algorithms; In Section 5, we perform numerical experiments to demonstrate the conclusion proved in previous sections. In Section 6, we conclude the paper with a summary and outlook for further research.

2. Related Work

In traditional intelligent systems theory, methods for the bounding of generalization error can be classified into two categories: The first one is to constrain the generalization error bound by controlling the complexity of hypothesis spaces, as in the Vapnik-Chervonenkis (VC) theory [30] and Rademacher complexity [31]; The second category is based on algorithmic stability [26] and robustness-based analysis [32]. Based on the stability of the learning algorithm, this paper analyzes the stability and generalization of the NMF algorithm, qualitatively characterizes the generalization performance of NMF, and proves the generalization bound of NMF by using the stability theory.

The robustness of the machine learning method is considered by most real-world applications, especially human activity recognition (HAR) [33], and the stability of an algorithm is also an important theory to analyze the generalization in the machine learning method. The pioneering work can be tracked to [34], who introduced the idea that the expected sensitivity of the k-Nearest Neighbors algorithm to changes of individual examples could be used to obtain a variance bound of a leave-one-out estimator. Besides, the authors obtained a generalization bound for the k-nearest neighbor algorithm from this result. Kutin and Niyogi [35] introduced many weaker variants for stability, which could get a good and stable algorithm generalization boundary. Further, other notions of stability have been proposed, such as uniform argument stability [36], hypothesis stability [26], and hypothesis set stability [37]. Shalev-Shwartz et al. [38] proved that the stability of the algorithm also implies learnability for certain tasks. Hardt et al. [39] argued that models trained by the stochastic gradient method (SGM) with few iterations have fading away generalization error, which is algorithmically stable in the sense of Bousquet and Elisseeff in convex and non-convex optimization.

Recently, a generalization bound based on Bayesian Reinforcement Learning was obtained through the concept of algorithmic stability [40]. Additionally, Wu at al. [41] proposed an innovative unsupervised feature selection algorithm and obtained its generalization error bound by analyzing the stability of the algorithm. Our stability-based approach to NMF is based on the seminal work of Bousquet and Elisseeff [26]. To our knowledge, we are the first to provide a stability result for NMF.

3. Preliminaries

In this section, we begin with introduction of the notation and definitions associated with algorithmic stability analysis. Given a training set

S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}

drawn i.i.d. from a joint probability distribution

D (X, Y)

, we denote a labeled sample

(x, y)

by z, then

S = {z_{1}, z_{2}, \dots, z_{n}}

,

X \in R^{n}

, and

Y \in {- 1, + 1}

. Hypothesis space is given by the finite family of functions

H = {h_{1}, h_{2}, \dots, h_{d}}

, d is a function and h is called a hypothesis. Let h be the hypothesis chosen from

H

. Our loss function L is defined as the following:

L : Y \times Y \to R_{+},

(1)

then the loss

L_{z} (h)

of hypothesis h at point z can be defined as:

L_{z} (h) = L (h (x), y) .

(2)

The empirical risk (or loss)

\hat{R} (h)

of hypothesis h on the training set

S = {z_{1}, z_{2}, \dots, z_{n}}

is

\hat{R} (h) = \sum_{i = 1}^{n} L_{z_{i}} (h),

(3)

and its generalization error

R (h)

are defined by

R (h) = \underset{\begin{matrix} z \sim D \end{matrix}}{E} L_{z} (h) .

(4)

Given an algorithm

A

, let

h_{S} \in H

be the hypothesis returned by training algorithm

A

on the training set S, L is the loss function defined on S, if

L_{z} (h) \leq M (M \geq 0)

for any

h \in H

and

z \in X \times Y

, the loss function L is said to be bounded and

L \leq M

. This paper assumes that all hypotheses

h_{S}

returned by algorithm

A

on the training set S are bounded, i.e., for any

h_{S} \in H

,

L_{z} (h_{S}) \leq M .

A stable learning algorithm has the ability to remove an element of its train datasets without changing its results. If the differences between empirical and generalization error is considered as a variable, then it should have a small variance. And the smaller its expectation, the closer an empirical error for a stable algorithm is to its generalization error. In the following, we give the definition of consistent stability as followed.

Definition 1 ([26]).

Let S and

S^{^{'}}

be any two group of training samples that are divided into two parts by one single point,

h_{S}

and

h_{S^{'}}

are two hypotheses returned by algorithm

A

on the training sets S and

S^{^{'}}

, respectively, if for any

z \in S

and

z \in S^{^{'}}

, we have

| L_{z} (h_{S}) - L_{z} (h_{S^{^{'}}}) | \leq β,

(5)

where

S_{i}^{^{'}} = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{i}^{^{'}}, y_{i}^{^{'}}), \dots, (x_{n}, y_{n})},

then the learning algorithm

A

is said to be uniformly β-stable, and the smallest β, which satisfies the inequality, which is called the stability coefficient of

A .

The definition of uniform stability of the algorithm also states that if the training sets S and

S^{^{'}}

are similar, the loss incurred by the corresponding hypotheses

h_{S}

and

h_{S^{^{'}}}

is returned by

A

, which cannot split by more than β. A uniformly β-stable learning algorithm is frequently referred to as β-stable, or, if no specific is given, the algorithm is considered to be stable. In general, the coefficient β is determined by the size of samples n.

Theorem 1 ([26]).

Let the loss function

L \leq M (M > 0)

, the learning algorithm

A

be β-stable and the training set

S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}

drawn i.i.d. from a distribution

D

. Then it is at least possible that this inequality is true

1 - δ (0 < δ < 1)

for the hypothesis

h_{S}

obtained by learning the algorithm

A

on the training set S:

R (h_{S}) \leq \hat{R} (h_{S}) + β + (2 n β + M) \sqrt{\frac{l o g \frac{1}{δ}}{2 n}} .

(6)

The generalization bound given in Theorem 1 converges to

n β / \sqrt{n} = o (1)

, i.e.,

β = o (1 / \sqrt{n})

. For instance, when the coefficient

β = O (1 / n)

, Theorem 1 guarantees that

R (h_{S}) - \hat{R} (h_{S}) = O (1 / n)

with high probability.

4. Characterizing the Stability of Non-Negative Matrix Decomposition Algorithms

Numerous studies on NMF have concentrated on improving the generalization ability of NMF models by incorporating regularization terms or other methods. Additionally, generalization analysis has resulted in tighter generalization error bounds. However, no research has explored the generalization error bound for NMF algorithms using algorithmic stability analysis. Generally, analyzing the stability and generalization of learning algorithms is a fundamental and critical research task in intelligent systems. As such, this paper focuses on analyzing the generalization error bound of NMF algorithms through stability-based analysis.

The overview of our algorithm AS-NMF is described in Figure 1. There are three main parts: Algorithmic Stability, Definition of NMF based Algorithmic Stability and Generalization bound of NMF Algorithmic Stability. It defines the concept of algorithmic stability, which can analyze and measure generalization error bounds for NMF algorithm.

4.1. Stability of NMF

The above definition of algorithmic stability for binary classification problems can also be extended to the NMF algorithm. In this section, the NMF algorithm is primarily used to address the collaborative filtering problem, which involves item ratings as the dataset. These item ratings can be considered equivalent to labels in a binary classification problem. However, the datasets encountered an extended binary classification problem and a multi-categorical problem, i.e.,

{- 1, + 1}

to the item rating

{1, 2, \dots, R}

, where

R > 0

is the highest rating given to the item by a user. Let us assume that m is the number of users and n is the labels; each user’s rating for each item can be considered as an input and its prediction as an output; thus, the output of the NMF algorithm in the collaborative filtering algorithm is

Y = {1, 2, \dots, R}^{m \times n}

. In the collaborative filtering problem, each item is rated by m users, so each input datum

x \in R^{m}

on the training datasets

S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})} \in X \times Y

, let

X = {x_{1}, x_{2}, \dots, x_{n}}

, then

X \in R^{m \times n}

.

In fact, the scoring matrix X is a sparse matrix, i.e., there are some missing values in the matrix. Let

Ω = {(i_{1}, j_{1}), (i_{2}, j_{2}), \dots, (i_{m}, j_{m})}

denotes the index of a known value in X, the transpose of

Ω

is denoted by

Ω^{T} = {(j_{1}, i_{1}), (j_{2}, i_{2}), \dots, (j_{m}, i_{m})}

. Thus, the training set can be denoted as

X = {X_{i, j} : (i, j) \in Ω}

, and

{(P_{Ω} (X))}_{i j} = \{\begin{matrix} X_{i j}, (i, j) \in Ω, \\ 0, otherwise . \end{matrix}

The matrix

X_{m \times n}

processed by NMF is non-negative, and the matrix X is decomposed with

X \approx P_{Ω} (U V^{T}), U, V \geq 0

(7)

where

U = {u_{1}, u_{2}, u_{3}, \dots, u_{m}} \in R_{+}^{m \times r}

is the basis matrix,

V = {v_{1}, v_{2}, v_{3}, \dots, v_{n}} \in R_{+}^{n \times r}

is the coefficient matrix,

r ≪ min {m, n}

.

The result of NMF is only an approximate solution and does not require X to be strictly equal to the inner product of U and V. It is necessary to establish a loss function to evaluate the degree of approximation before and after decomposition. Thus, the objective of the NMF problem becomes to find two non-negative matrices U and V that can minimize our loss function

L (X, U V^{T})

. There are many loss functions available; here, we have chosen the Euclidean distance:

L_{X} (U V^{T}) = {∥ P_{Ω} (X - U V^{T}) ∥}_{F}^{2},

(8)

The most important goal of NMF is to minimize the error of the loss function

L_{X} (U V^{T})

by optimizing

∥ P_{Ω} (X - U V^{T}) ∥_{F}^{2}

. In the presence of both U and V,

L_{X} (U V^{T})

can only be obtained as a local minimum since

L_{X} (U V^{T})

is non-convex. However, if the basis matrix U is fixed, L is a convex function, thus, V is chosen to minimize

L_{X} (U V^{T})

. The family of hypothesis functions

H_{U}

, induced by U on X for NMF is defined as follows:

H_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\} .

(9)

According to the definition of the loss function based on NMF, let

h_{U}

be the hypothesis chosen from

H_{U}

; then the loss

L_{x} (h_{U})

of hypothesis

h_{U}

at point

x

is defined as:

L_{x} (h_{U}) = h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} .

(10)

The loss or empirical error

\hat{R} (h)

of

h_{U}

on the samples

{x_{1}, x_{2}, \dots, x_{n}}

are defined, respectively, by

{\hat{R}}_{n} (h_{U}) = \frac{1}{n} \sum_{i = 1}^{n} L_{x_{i}} (h_{U}),

(11)

and their generalization error

R (h_{U})

can be defined as

R (h_{U}) = \underset{\begin{matrix} x \sim D \end{matrix}}{E} L_{x} (h_{U}) .

(12)

$H_{U} (X)$ for two different training sets X and $X^{^{'}}$ with U fixed, only the error of the last training sample differ, for any $x \in X \ {x_{n}}$ and $x^{^{'}} \in X^{^{'}} \ {x_{n}^{^{'}}}$ , we have $h_{U} (x) - h_{U} (x^{^{'}}) = 0 .$ Thus, $| h_{U} (x) - h_{U^{^{'}}} (x^{^{'}}) | = | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) |$ . When U is fixed, we only need to define $| h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) | \leq β$ ;
If U is not fixed, different training sets X and $X^{^{'}}$ will update different basis matrices U and $U^{^{'}}$ ; so, for any $x \in X, z \in X \ {x_{n}}$ and $x^{^{'}} \in X^{^{'}}$ , $β$ needs to be defined as follows:

$| h_{U} (x) - h_{U^{^{'}}} (x^{^{'}}) | = max {| h_{U} (z) - h_{U^{^{'}}} (z) |, | h_{U} (x_{n}) - h_{U^{^{'}}} (x_{n}^{^{'}}) |} \leq β .$

(13)

Uncertainty U is easy to obtain, by the nature of the absolute inequality,

\begin{matrix} | h_{U} (x_{n}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | & = | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) + h_{U} (x_{n}^{^{'}}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | \\ \leq | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) | + | h_{U} (x_{n}^{^{'}}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | . \end{matrix}

(14)

Therefore, when U is fixed, the stability bound is smaller than the parameter bound when U is not fixed.

Based on the above two analyses, the following two classes of definitions of uniform stability of NMF are provided.

4.2. The Generalization upper Bound of NMF

Definition 2.

Let

S = (X, Y)

and

S^{^{'}} = (X^{^{'}}, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

and

X^{^{'}} = {x_{1}, x_{2}, \dots, x_{n}^{^{'}}}

differ only by the last sample,

h_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is a hypothesis returned by NMF to the training set X, U is the given basis matrix. For any

x \in X

and

x^{^{'}} \in X^{^{'}}

, we have

| L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U}) | \leq β,

(15)

then the NMF is uniformly β-stable when U is not fixed, and minimum β satisfies the inequality that can be called the stability coefficient of the NMF.

Definition 2 characterizes the definition of stability of NMF when U is fixed. The definition of stability of NMF in the case of multiplicative iterations (when the basis matrix U is uncertain in advance) is presented below.

Definition 3.

Let

S = (X, Y)

and

S^{^{'}} = (X^{^{'}}, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

and

X^{^{'}} = {x_{1}, x_{2}, \dots, x_{n}^{^{'}}}

differ only by the last sample,

h_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is the assumption that it is returned by the NMF to the training sample data X, U and

U^{^{'}}

are the basis matrices generated by NMF on the training sets X and

X^{^{'}}

. For any

x \in X

and

x^{^{'}} \in X^{^{'}}

, we have

| L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) | \leq β,

(16)

then NMF is said to be uniformly β-stable when U is not fixed, and the minimum β, which satisfies this inequality, is called the stability coefficient of the NMF.

Theorem 2.

Let training set

S = (X, Y), X = {x_{1}, x_{2}, \dots, x_{n}}

, and

X \approx U V^{T} (X \in R_{+}^{m \times n}, U \in R_{+}^{m \times r}, V \in R_{+}^{n \times r}),

if for any

x \in X (x \in R_{+}^{m})

, there is

∥ x ∥ \leq R (R > 0)

, then every row of V has an upper bound R, i.e.,

∥ v ∥ \leq R (v \in V) .

In Theorem 2, we give some basic theory of NMF, where U is the basis matrix and V is the coefficient matrix, and every row of V has an upper bound R.

Theorem 3.

Let

S = (X, Y)

and

S^{^{'}} = (X^{^{'}}, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

and

X^{^{'}} = {x_{1}, x_{2}, \dots, x_{n}^{^{'}}}

differ only by the last sample,

h_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is a hypothesis returned by NMF to the training set X, U is the given basis matrix, and U is bounded, for example, there exists a constant

C \geq 0,

∥ U ∥ \leq C

. Then the NMF is uniformly β-stable, representing the following β when the basis matrix U is fixed,

| L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U}) | \leq β,

(17)

where

β = R (\sqrt{m r} + r) ∥ U ∥ .

Proof.

For any

x = {x_{1}, x_{2}, \dots, x_{m}} \in X

and

x^{^{'}} = {x_{1}^{^{'}}, x_{2}^{^{'}}, \dots, x_{m}^{^{'}}} \in X^{^{'}}

,

\begin{matrix} | L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U}) | = | L_{x_{n}} (h_{U}) - L_{x_{n}^{^{'}}} (h_{U}) | \\ = & | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) | \\ = & | min_{v \in R_{+}^{r}} ∥ P_{Ω} (x_{n} - U v^{T}) ∥^{2} - min_{v \in R_{+}^{r}} {∥ P_{Ω} (x_{n}^{^{'}} - U {(v^{^{'}})}^{T}) ∥}^{2} | \\ = & | min_{v \in R_{+}^{r}} ∥ P_{Ω} (x_{n} - U v^{T}) ∥^{2} + max_{v^{^{'}} \in R_{+}^{r}} (- ∥ P_{Ω} (x_{n}^{^{'}} - U {(v^{^{'}})}^{T}) ∥^{2}) | \\ \leq & | max_{v, v^{^{'}} \in R_{+}^{r}} (∥ P_{Ω} (x_{n} - U v^{T}) ∥^{2} - ∥ P_{Ω} (x_{n}^{^{'}} - U {(v^{^{'}})}^{T}) ∥^{2}) | \\ \leq & | max_{v, v^{^{'}} \in R_{+}^{r}} (2 P_{Ω^{T}} {(x_{n}^{^{'}})}^{T} P_{Ω} (U {(v^{^{'}})}^{T}) - 2 P_{Ω^{T}} (x_{n}^{T}) P_{Ω} (U v^{T})) | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} (∥ P_{Ω} (U v^{T}) ∥^{2} - {∥ P_{Ω} (U {(v^{^{'}})}^{T}) ∥}^{2}) | \\ = & | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} x_{i}^{^{'}} 〈 e_{i}, P_{Ω} (U {(v^{^{'}})}^{T}) 〉 - \sum_{i = 1}^{m} x_{i} 〈 e_{i}, P_{Ω} (U v^{T}) 〉 | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} (U v^{T}) + P_{Ω} (U {(v^{^{'}})}^{T}), P_{Ω} (U v^{T}) - P_{Ω} (U {(v^{^{'}})}^{T}) 〉 | \\ \leq & R | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U {(v^{^{'}})}^{T}) 〉 - \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U v^{T}) 〉 | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} (U {(v + v^{^{'}})}^{T}), P_{Ω} (U {(v - v^{^{'}})}^{T}) 〉 | \\ = & R | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U {(v^{^{'}} - v)}^{T}) 〉 | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} (U {(v + v^{^{'}})}^{T}), P_{Ω} (U {(v - v^{^{'}})}^{T}) 〉 | \\ = & R | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} 〈 e_{i}, \sum_{j = 1}^{r} (v_{j} - v_{j}^{^{'}}) P_{Ω} (U e_{j}) 〉 | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} 〈 \sum_{j = 1}^{r} (v_{j} + v_{j}^{^{'}}) P_{Ω} (U e_{j}), \sum_{k = 1}^{r} (v_{k} - v_{k}^{^{'}}) P_{Ω} (U e_{k}) 〉 | \\ = & R | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{j = 1}^{r} (v_{j} - v_{j}^{^{'}}) \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U e_{j}) 〉 | + \\ | max_{v, v^{^{'}} \in R_{+}^{r}} \sum_{j, k = 1}^{r} (v_{j} + v_{j}^{^{'}}) (v_{k} - v_{k}^{^{'}}) 〈P_{Ω} (U e_{j}), P_{Ω} (U e_{k})〉 | \\ \leq & R \sqrt{m} \sum_{j = 1}^{r} (v_{j} - v_{j}^{^{'}}) ∥ P_{Ω} (U e_{j}) ∥ + \sum_{j, k = 1}^{r} (v_{j} + v_{j}^{^{'}}) (v_{k} - v_{k}^{^{'}}) ∥ P_{Ω} (U e_{j}) ∥ ∥ P_{Ω} (U e_{k}) ∥ \\ \leq & R \sqrt{m r} ∥ U ∥ + R r ∥ U ∥ = β . □ \end{matrix}

The third inequality holds based on Holder inequality, and the fourth and fifth inequalities hold based on the Cauchy-Schwarz inequality.

In Theorem 3 we give the stability parameter

β

of the NMF algorithm, and show that when the basis matrix U is fixed, the stability coefficient of NMF depends on the upper bound of the module length R for input data, the dimension r of the hidden matrix, the number of dimensions m is feature space and the Frobenius norm

∥ U ∥

of the basis matrix.

Theorem 4.

Let

S = (X, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

, and

x \in X

,

∥ x ∥ \leq R

.

H_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is a hypothesis returned by NMF to the training set X, U are the basis matrices generated by NMF on the training sets X, where U is bounded, i.e., there exists a constant

C \geq 0

,

∥ U ∥ \leq C

. When U is fixed, the following inequalities can be established at least

1 - δ

:

R (h_{U}) \leq \hat{R} (h_{U}) + R \sqrt{m r} ∥ U ∥ + R r ∥ U ∥ + (2 n R \sqrt{m r} ∥ U ∥ + 2 n R r ∥ U ∥ + M) \sqrt{\frac{l o g \frac{1}{δ}}{2 n}} .

(18)

In Theorem 3 we have obtained that NMF is

β

-stable, and by plugging the stability coefficients

β

obtained in Theorem 3 into the bounds of Theory 1, we obtain an upper bound of this generalization for the NMF algorithm when U is fixed.

The results of Theorem 4 present the upper bound of the generalization for the NMF algorithm, which are controlled by the upper bound of the module length R for the input datasets, the dimension r and m is the hidden matrix and the feature space, the Frobenius norm

∥ U ∥

of basis matrix as well as the input data n.

4.3. The Generalization upper Bound of NMF under Multiplicative Rule

Theorem 5.

Let

S = (X, Y)

,

S^{^{'}} = (X^{^{'}}, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

,

X^{^{'}} = {x_{1}, x_{2}, \dots, x_{n}^{^{'}}}

, for any

x \in X

and

x^{^{'}} \in X^{^{'}}

,

∥ x ∥ \leq R

,

∥ x^{^{'}} ∥ \leq R (R > 0)

.

H_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is a hypothesis returned by NMF to the training set X, U and

U^{^{'}}

are the basis matrices generated by NMF on the training sets X and

X^{^{'}}

, where U and

U^{^{'}}

is the bound, for example, the existing constant

C \geq 0,

∥ U ∥ \leq C, ∥ U^{^{'}} ∥ \leq C

. Then NMF is consistently β-stable with respect to the following β,

| L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) | \leq β,

(19)

where

β = R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥ .

Proof.

For any

x = {x_{1}, x_{2}, \dots, x_{m}} \in X

,

x^{^{'}} = {x_{1}^{^{'}}, x_{2}^{^{'}}, \dots, x_{m}^{^{'}}} \in X^{^{'}}

and

z = {x_{1}, x_{2}, \dots, x_{m}} \in X \ {x_{n}}

, when U is not fixed in advance, to different training datasets X and

X^{^{'}}

, the NMF will update different basis matrices U and

U^{^{'}}

, based on the above analysis, in this case,

\begin{matrix} L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) & = | h_{U} (x) - h_{U^{^{'}}} (x^{^{'}}) | \\ = max {| h_{U} (z) - h_{U^{^{'}}} (z) |, | h_{U} (x_{n}) - h_{U^{^{'}}} (x_{n}^{^{'}}) |} . □ \end{matrix}

(20)

Meanwhile, according to the nature of the absolute inequality, we know

\begin{matrix} | h_{U} (x_{n}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | & = | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) + h_{U} (x_{n}^{^{'}}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | \\ \leq | h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) | + | h_{U} (x_{n}^{^{'}}) - h_{U^{^{'}}} (x_{n}^{^{'}}) | . \end{matrix}

(21)

Theorem 3 has proved that

| h_{U} (x_{n}) - h_{U} (x_{n}^{^{'}}) | \leq R \sqrt{m r} ∥ U ∥ + R r ∥ U ∥ .

Therefore, only the upper bound of

| h_{U} (x^{^{'}}) - h_{U^{^{'}}} (x^{^{'}}) |

is proved below.

\begin{matrix} | L_{x^{^{'}}} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) | \\ = & | min_{v \in R_{+}^{r}} ∥ P_{Ω} (x_{n}^{^{'}} - U v^{T}) ∥^{2} - min_{v^{^{'}} \in R_{+}^{r}} {∥ P_{Ω} (x_{n}^{^{'}} - U^{^{'}} {(v^{^{'}})}^{T}) ∥}^{2} | \\ \leq & | ∥ P_{Ω} (x_{n}^{^{'}} - U {(v^{^{'}})}^{T}) ∥^{2} + max_{v^{^{'}} \in R_{+}^{r}} (- ∥ P_{Ω} (x_{n}^{^{'}} - U^{^{'}} {(v^{^{'}})}^{T}) ∥^{2}) | \\ \leq & | max_{v^{^{'}} \in R_{+}^{r}} (∥ P_{Ω} (x_{n}^{^{'}} - U {(v^{^{'}})}^{T}) ∥^{2} - ∥ P_{Ω} (x_{n}^{^{'}} - U^{^{'}} {(v^{^{'}})}^{T}) ∥^{2}) | \\ \leq & | max_{v^{^{'}} \in R_{+}^{r}} (2 P_{Ω^{T}} {(x_{n}^{^{'}})}^{T} P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) - 2 P_{Ω^{T}} {(x_{n}^{^{'}})}^{T} P_{Ω} (U {(v^{^{'}})}^{T})) | + \\ | max_{v^{^{'}} \in R_{+}^{r}} (∥ P_{Ω} (U {(v^{^{'}})}^{T}) ∥^{2} - {∥ P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) ∥}^{2}) | \\ = & | max_{v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} x_{i}^{^{'}} 〈 e_{i}, P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) 〉 - \sum_{i = 1}^{m} x_{i}^{^{'}} 〈 e_{i}, P_{Ω} (U {(v^{^{'}})}^{T}) 〉 | + \\ | max_{v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} (U {(v^{^{'}})}^{T}) + P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}), P_{Ω} (U {(v^{^{'}})}^{T}) - P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) 〉 | \\ \leq & R | max_{v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) 〉 - \sum_{i = 1}^{m} 〈 e_{i}, P_{Ω} (U {(v^{^{'}})}^{T}) 〉 | + \\ | max_{v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} (U {(v^{^{'}})}^{T}) + P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}), P_{Ω} (U {(v^{^{'}})}^{T}) - P_{Ω} (U^{^{'}} {(v^{^{'}})}^{T}) 〉 | \\ = & R | max_{v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} 〈e_{i}, P_{Ω} ((U^{^{'}} - U) {(v^{^{'}})}^{T})〉 | + \\ | max_{v^{^{'}} \in R_{+}^{r}} 〈 P_{Ω} ((U + U^{^{'}}) {(v^{^{'}})}^{T}), P_{Ω} ((U - U^{^{'}}) {(v^{^{'}})}^{T}) 〉 | \\ = & R | max_{v^{^{'}} \in R_{+}^{r}} \sum_{i = 1}^{m} \sum_{j = 1}^{r} v_{j}^{^{'}} 〈 e_{i}, P_{Ω} ((U^{^{'}} - U) e_{j}) 〉 | + \\ | max_{v^{^{'}} \in R_{+}^{r}} \sum_{j, k = 1}^{r} v_{j}^{^{'}} v_{k}^{^{'}} 〈 P_{Ω} ((U + U^{^{'}}) e_{j}), P_{Ω} ((U - U^{^{'}}) e_{k}) 〉 | \\ \leq & R \sqrt{m} \sum_{j = 1}^{r} v_{j}^{^{'}} ∥ P_{Ω} ((U - U^{^{'}}) e_{j}) ∥ + \sum_{j, k = 1}^{r} v_{j}^{^{'}} v_{k}^{^{'}} ∥ P_{Ω} ((U + U^{^{'}}) e_{j}) ∥ ∥ P_{Ω} ((U - U^{^{'}}) e_{k}) ∥ \\ \leq & R^{2} \sqrt{m} ∥ U ∥ + R r \sqrt{m} ∥ U ∥ \\ \leq & R \sqrt{m} (R + r) ∥ U ∥, \end{matrix}

The fifth inequality holds based on the Cauchy-Schwarz inequality. Therefore,

\begin{matrix} | L_{x_{n}} (h_{U}) - L_{x_{n}^{^{'}}} (h_{U}) | + | L_{x^{^{'}}} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) | \\ \leq & R \sqrt{m r} ∥ U ∥ + R r ∥ U ∥ + R \sqrt{m} (R + r) ∥ U ∥ \\ \leq & R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥, \end{matrix}

(22)

thus,

| L_{x} (h_{U}) - L_{x^{^{'}}} (h_{U^{^{'}}}) | \leq R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥ = β .

(23)

In Theorem 4, we give the stability parameter

β

of the NMF algorithm when the basis matrix U is not fixed. And the conclusion of NMF shows that its stability parameter depends on the upper bound for the module length R of input data, the dimension r of the hidden matrix and Frobenius norm

∥ U ∥

of the basic matrix.

Theorem 6.

Let

S = (X, Y)

, where

X = {x_{1}, x_{2}, \dots, x_{n}}

, and

x \in X

,

∥ x ∥ \leq R

.

H_{U} (X) = \{h_{U} (x) = min_{v \in R_{+}^{r}} {∥ P_{Ω} (x - U v^{T}) ∥}^{2} : x \in R_{+}^{m}, U \in R_{+}^{m \times r}\}

is a hypothesis returned by NMF to the training set X, U are the basis matrices generated by NMF on the training sets X, where U is the bound, for example, there exists a constant

C \geq 0

,

∥ U ∥ \leq C

. When U is not fixed, the following inequalities hold with a probability of at least

1 - δ

:

R (h_{U}) \leq \hat{R} (h_{U}) + R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥ + (2 n (R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥) + M) \sqrt{\frac{l o g \frac{1}{δ}}{2 n}} .

(24)

In Theorem 5, we have obtained that NMF is

β

-stable, and, by plugging the stability coefficients

β

obtained in Theorem 5 into the bounds of Theory 1, we obtain an upper bound for the generalization of the NMF algorithm when U is not fixed.

The results of Theorem 6 present that the generalization for the upper bound of the NMF is controlled via the upper bound of the module length R for the input datasets, the dimension r of the hidden matrix and the Frobenius norm

∥ U ∥

for the basic matrix as well as the input data n.

5. Experiments

5.1. Experimental Datasets

Our experiments are performed on three common and real-world datasets; these datasets are widely used to evaluate the recommendation systems: Movielens-100k [42], Movielens-1m [43] and Yelp2018 [44]. We are primarily verifying the performance of recommendation tasks and in terms of algorithmic stability analysis. We use some existing preprocessing methods to convert these datasets into implicit feedback. Specifically, a user’s rating is converted into a binary representation to indicate whether the user has commented on an item. The statistics of these three datasets are described in the Table 1.

5.2. Compared Models

We compare our method with some state-of-the-art machine learning model on different recommendation tasks. The compared models include:

BPR [45]: BPR is a commonly used recommendation model that optimizes by matrix factorization with the Bayesian, which can get the personalized ranking via objective function.
WMF [46]: WMF is a non-sampling recommendation model that can deal all missing interactions with negative instances and weight them uniformly; it is trained via the efficient non-sampling method.
SISA [47]: SISA is realized by a machine unlearning method that can randomly split the training samples into parts, and then aggregates the results from all models through averaging in order to make the prediction.

5.3. Evaluation Methodology and Settings

In our experiments, we select the 60%, 70%, 80%, 90% samples to construct the training datasets and take the remaining data as test sets. Then, the 10% samples are randomly selected from the training datasets as the validation set. NDCG@K are used to evaluate the recommendation performance. They are widely used methods for measuring the ground truth which is ranked among the top K items, and they also offer a position-ranking metric that is used to measure whether to assign higher scores for higher positions. We use the grid search to adjust our AS-NMF and achieve the best hype-parameters of the optimal performances.

5.4. Performance and Algorithmic Stability Efficiency

In this section, the proposed AS-NMF method is evaluated with different methods of recommendation results, and we set the split rate N = 60, 70, 80 and 90 in our experiments. The experimental results are shown in Table 2.

Our proposed AS-NMF algorithm outperformed state-of-the-art baseline methods on three datasets, namely Yelp2018, Movielens-100k, and Movielens-1m. Specifically, we observed maximum improvements of 2.41%, 7.13%, and 6.60% on Yelp2018, Movielens-100k, and Movielens-1m datasets, respectively, compared to the base models BPR, WMF, and SISA.

In comparison to the other three methods, our AS-NMF achieved accuracy improvements of 7.13%, 5.04%, 0.27%, and 5.91% on Movielens-100k, respectively. Since learning user preferences from sparse user-item interactions is challenging, the sparsity parameter of AS-NMF was carefully set to balance efficiency and performance.

Despite the relative density of Movielens-100k and Movielens-1m datasets, our AS-NMF algorithm produced remarkable results. Our proposed method effectively preserves collaborative information with different partition rates, which is reflected in the enhanced performance observed in the results.

Otherwise, we observe the performances of different methods on the top-ranking 10, 20, 30 and 50 with NDCG@K in order to evaluate performance on each dataset, respectively. The NDCG@K results of different methods are shown in Table 3 on datasets of Yelp2018, Movielens-100k and Movielens-1m. The best performance results are improved by 1.83%, 6.73% and 5.92% compared with the other top-ranking algorithms. For example, the performances of AS-NMF on Yelps2018 and Movielens-100k are better than Movielens-1m on NDCG@50. Our proposed stability analysis method of NMF can further improve the performance of generalization. Experimental results show that our AS-NMF algorithm can effectively optimize the effectiveness of recommendation.

Furthermore, Figure 2 shows the performance and stability of models based on different numbers of samples, which is achieved by randomly selected training and the top-N item ranks. Figure 2a displays the accuracy and stability performance of four methods on Movielens-100k, with different training rates. The bar-plot depicts the accuracy for various training sample rates, while the line-plot shows the upper bound for stability performance. The x-axis denotes the percent of training set, and the y-axis represents the accuracy and stability performance of stable analysis.

Figure 2b shows the NDCG@K performance and stability of four models on Yelp2018, with varying top-N item ranks. The bar-plot describes NDCG@K for different top-N item ranks, and the line-plot shows the upper bound for stability performance. The x-axis denotes the percent of training set, while the y-axis represents the NDCG score and stability performance of our stable analysis of NMF.

Combining the bar and line plots, we observe that the accuracy and performance improve as the training samples increase. However, the stability of BPR, WMF, and SISA changes in the opposite direction, indicating that the stable performance of our AS-NMF is superior to those three methods.

The numerical results demonstrate that the stability parameter of AS-NMF depends on the upper bound of the module length of the input data, dimension of the hidden matrix, and Frobenius norm of the basis matrix. The bounds provide additional evidence of the superior stability of our proposed method.

6. Conclusions

In this paper, we propose a general NMF-based recommendation algorithm to predict and recommend the labels of new samples. The model uses NMF and enhanced stability to recommend items to users. Experimental results show that our proposed stability analysis method of NMF can further improve the performance of generalization and can effectively optimize the effectiveness of the recommendation. Then, based on the stability analysis of the algorithm, we studied the general generalization error bounds corresponding to the NMF algorithm, and obtained two general generalization bounds.

In the case of fixed U, the results show that the NMF algorithm is consistently

β

-stable, where

β = R (\sqrt{m r} + r) ∥ U ∥

, and it can be obtained that the generalization upper bound of the NMF prediction algorithm depends on the upper bound on the module length of the input data, the dimension of the hidden matrix, the dimension of the eigenspace and the Frobenius norm of the basis matrix, as well as the input data. In the case where U is not fixed under the multiplicative update rule, the results show that the NMF algorithm is consistently

β

-stable, where

β = R (\sqrt{m r} + r + R \sqrt{m} + r \sqrt{m}) ∥ U ∥

, and it can be obtained that the generalization upper bound of the NMF prediction algorithm depends on its upper bound of the module length for the input data, the dimension of hidden matrix, the dimensions of feature space, the Frobenius norm of the basic matrix, as well as the input data. In the experimental comparison on Yelp2018 and Movielens-1m datasets, our method shows the stability of the recommendation and performs better than others.

In the future, we will try to approach the upper bounds of the real error more closely, because the smaller the upper bound for an algorithm, the better the performance of the model. One can define different loss functions according to actual needs by this method alone. Each of the considered applications is just a straightforward demonstration of the advantages of enhanced stability. In future research, we will apply it to face recognition and other areas.

Author Contributions

Conceptualization, H.S.; methodology, H.S. and J.Y.; software, H.S.; validation, H.S.; formal analysis, J.Y.; investigation, H.S.; resources, J.Y.; data curation, H.S.; writing—original draft preparation, H.S. and J.Y.; writing—review and editing, H.S. and J.Y.; visualization, H.S.; supervision, J.Y.; project administration, H.S.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is jointly supported by the National Key Research and Development Program of China (2021YFF0704100), the National Natural Science Foundation of China (61936001, 61772096 and 62066049), the Chongqing Natural Science Foundation (cstc2019jcyj-cxttX0002, cstc2021ycjh-bgzxm0013, cstc2020jcyj-msxmX0737), the Key Cooperation Project of Chongqing Municipal Education Commission (HZ2021008), the Doctoral Innovation Talent Program of Chongqing University of Posts and Telecommunications (BYJS201913, BYJS202108), the Science and Technology Research Program of Chongqing Education Commission of China (KJQN201900638, KJQN202100629), and the Chongqing Postgraduate Research Innovation Project (CYB20174).

Data Availability Statement

The datasets are collected from the websites in [42,43,44]. However, no new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lee, D.; Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Tang, J.; He, X. Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1947–1960. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Xiang, Y.; Xie, K. Adaptive method for nonsmooth nonnegative matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 948–960. [Google Scholar] [CrossRef] [PubMed]
Zafeiriou, S.; Tefas, A.; Buciu, I. Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification. IEEE Trans. Neural Netw. 2006, 17, 948–960. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shahnaz, F.; Berry, M.; Pauca, V.; Plemmons, R. Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 2006, 42, 373–386. [Google Scholar] [CrossRef]
Chen, Z.; Li, L.; Peng, H.; Liu, Y.; Yang, Y. Attributed community mining using joint general non-negative matrix factorization with graph laplacian. Phys. A Stat. Mech. Appl. 2018, 495, 324–335. [Google Scholar] [CrossRef]
Zhi, R.; Flierl, M.; Ruan, Q.; Kleijn, K.B. Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition. IEEE Trans. Syst. Man Cybern. B Cybern. 2011, 20, 1973–1986. [Google Scholar]
Chen, Z.; Li, L.; Ruan, Q.; Peng, H. Sparse general non-negative matrix factorization based on left semi-tensor product. IEEE Access 2019, 7, 81599–81611. [Google Scholar] [CrossRef]
Zhou, G.; Yang, Z.; Xie, S.; Yang, J. Online blind source separation using incremental nonnegative matrix factorization with volume constraint. IEEE Trans. Neural Netw. 2011, 22, 550–560. [Google Scholar] [CrossRef]
Gao, B.; Woo, W.; Ling, B. Machine learning source separation using maximum a posteriori nonnegative matrix factorization. IEEE Trans. Cybern. 2014, 44, 1169–1179. [Google Scholar]
Yang, J.; Yang, S.; Fu, Y.; Li, X.; Huang, T. Non-Negative Graph Embedding. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Sandler, R.; Lindenbaum, M. Nonnegative Matrix Factorization with Earth Mover’s Distance Metric for Image Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1590–1602. [Google Scholar] [CrossRef] [PubMed]
Luo, M.; Chang, X.; Nie, L.; Yang, Y.; Hauptmann, A.G.; Zheng, Q. An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition. IEEE Trans. Cybern. 2018, 48, 648–660. [Google Scholar] [CrossRef] [PubMed]
MahNMF: Manhattan Non-Negative Matrix Factorization. Available online: https://arxiv.org/abs/1207.3438 (accessed on 14 July 2012).
Liu, T.; Tao, D. On the Performance of Manhattan Nonnegative Matrix Factorization. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1851–1863. [Google Scholar] [CrossRef] [PubMed]
Kong, D.; Ding, C.; Huang, H. Robust Nonnegative Matrix Factorization Using L21-Norm. In Proceedings of the CIKM ’11: Proceedings of the 20th ACM international conference on Information and Knowledge Management, Toronto, ON, Canada, 17–21 October 2011; pp. 673–682. [Google Scholar]
Lee, D.D.; Seung, H.S. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems 13-Proceedings of the 2000 Conference, NIPS 2000; Neural Information Processing Systems Foundation Inc.: Montréal, QC, Canada, 2000. [Google Scholar]
Hoyer, P.O. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 2004, 5, 1457–1469. [Google Scholar]
Li, S.Z.; Hou, X.W.; Zhang, H.J.; Cheng, Q.S. Learning Spatially Localized, Parts-based Representation. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; pp. 207–212. [Google Scholar]
Zheng, W.S.; Li, S.Z.; Lai, J.H.; Liao, S. On Constrained Sparse Matrix Factorization. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Zhang, Y.; Fang, Y. A NMF algorithm for blind separation of uncorrelated signals. In Proceedings of the 2007 International Conference on Wavelet Analysis and Pattern Recognition, Beijing, China, 2–4 November 2007; pp. 999–1003. [Google Scholar]
Chen, K.; Yao, L.; Zhang, D.; Wang, X.; Chang, X.; Nie, F. A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1747–1756. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Chen, D.; Wu, K. Incremental nonnegative matrix factorization based on correlation and graph regularization for matrix completion. Int. J. Mach. Learn. Cybern. 2019, 10, 1259–1268. [Google Scholar] [CrossRef]
Liu, T.; Gong, M.; Tao, D. Large-Cone Nonnegative Matrix Factorization. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2129–2142. [Google Scholar] [CrossRef]
Wang, J.; Lai, S.; Li, M. Improved image fusion method based on NSCT and accelerated NMF. Sensors 2012, 12, 5872–5887. [Google Scholar] [CrossRef] [Green Version]
Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
Wu, J.; Yu, X.; Zhu, L. Leave-two-out stability of ontology learning algorithm. Chaos Solitons Fractals 2016, 89, 322–327. [Google Scholar] [CrossRef]
Devroye, L.; Wagner, T. Distribution-free performance bounds for potential function rules. IEEE Trans. Inf. Theory 1979, 25, 601–604. [Google Scholar] [CrossRef] [Green Version]
Bonnans, J.F.; Shapiro, A. Optimization Problems with Perturbation: A Guided Tour. SIAM Rev. 1998, 40, 228–264. [Google Scholar] [CrossRef] [Green Version]
Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1997; p. 1564. [Google Scholar]
Bartlett, P.L.; Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]
Xu, H.; Mannor, S. Robustness and generalization. Mach. Learn. 2012, 86, 391–423. [Google Scholar] [CrossRef] [Green Version]
Zhang, D.; Yao, L.; Chen, K.; Wang, S.; Chang, X.; Liu, Y. Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition. IEEE Trans. Cybern. 2020, 50, 3033–3044. [Google Scholar] [CrossRef] [PubMed]
Rogers, W.H.; Wagner, T. Finite sample distribution-free performance bound for local discrimination rules. Ann. Stat. 1978, 6, 506–514. [Google Scholar] [CrossRef]
Kutin, S.; Niyogi, P. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the UAI’02: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Edmonton, AB, Canada, 1–4 August 2002; pp. 275–282. [Google Scholar]
Liu, T.L.; Gábor, L.; Niyogi, P.; Tao, D.C. Algorithmic Stability and Hypothesis Complexity. In Proceedings of the 34th International Conference on Machine Learning, San Francisco, CA, USA, 8–12 July 2002; Volume 70, pp. 2159–2167. [Google Scholar]
Foster, D.J.; Greenberg, S.; Kale, S.; Luo, H.P.; Mohri, M.; Sridharan, K. Hypothesis Set Stability and Generalization; Curran Associates, Inc.: Montreal, QC, Canada, 2019. [Google Scholar]
Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, stability and uniform convergence. J. Mach. Learn. Res. 2010, 11, 2635–2670. [Google Scholar]
Hardt, M.; Recht, B.; Singer, Y. Train Faster, Generalize Better: Stability of Stochastic Gradient Descent. In Proceedings of the 33rd 501 International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 1225–1234. [Google Scholar]
Tamar, A.; Soudry, D.; Zisselman, E. Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability. Proc. AAAI Conf. Artif. Intell. 2022, 36, 8423–8431. [Google Scholar] [CrossRef]
Wu, X.; Cheng, Q. Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. Adv. Neural. Inf. Process. Syst. 2021, 34, 19860–19875. [Google Scholar]
Available online: https://grouplens.org/datasets/movielens/100k/. (accessed on 28 October 2022).
Available online: https://grouplens.org/datasets/movielens/1m/ (accessed on 28 October 2022).
Available online: https://www.yelp.com/dataset/challenge (accessed on 28 October 2022).
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 July 2009; AUAI Press: Arlington, VA, USA, 2009; pp. 452–461. [Google Scholar]
Chen, C.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Efficient Neural Matrix Factorization without Sampling for Recommendation. ACM Trans. Inf. Syst. 2020, 38, 2. [Google Scholar] [CrossRef] [Green Version]
Bourtoule, L.; Chandrasekaran, V.; Choquette-Choo, C.A.; Jia, H.; Travers, A.; Zhang, B.; Lie, D.; Papernot, N. Machine unlearning. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 141–159. [Google Scholar]

Figure 1. Overview of our AS-NMF algorithm, which consist of three parts: Algorithmic Stability, Definition of NMF based Algorithmic Stability and Generalization bound of NMF Algorithmic Stability. It defines the concept of algorithmic stability, which can analyze and measure generalization error bounds for NMF algorithm.

Figure 2. The performance comparison on different training set partitions and top-K on Yelp2018 and Movielens-1m datasets. Impact of the shard number on the recommendation efficiency and performance on Yelp2018 and Movielens-100k datasets. The bar chart shows the recommendation performance and the line chart shows the stability of different methods.

Table 1. Statistical detail of the evaluation datasets.

Datasets	#Users	#Items	#Ratings	Density
Movielens-100k	943	1628	100,000	6.30%
Movielens-1m	6040	3706	1,000,209	4.47%
Yelps2018	31,668	38,048	1,561,406	0.13%

Table 2. Comparison of different methods for recommendation accuracy. The proposed AS-NMF method is tested with a different training set rate in this Table. The best results and improvement are highlighted in bold.

Datasets	Methods	Accuracy (Training Set Rate)
Datasets	Methods	60%	70%	80%	90%
Yelps2018	BPR	0.6135	0.6215	0.6367	0.6881
	WMF	0.6233	0.6381	0.6529	0.7190
	SISA	0.6201	0.7227	0.7371	0.7537
	AS-NMF	0.6289	0.7329	0.7419	0.7719
	Increase (%)	0.90	1.41	0.65	2.4
Movielens-100k	BPR	0.5881	0.6081	0.6638	0.6397
	WMF	0.5988	0.6228	0.6913	0.6981
	SISA	0.5976	0.7059	0.7286	0.7310
	AS-NMF	0.6415	0.7415	0.7306	0.7742
	Increase (%)	7.13	5.04	0.27	5.91
Movielens-1m	BPR	0.5203	0.6466	0.6071	0.6849
	WMF	0.5975	0.6095	0.6689	0.6736
	SISA	0.6148	0.7232	0.7092	0.7625
	AS-NMF	0.6384	0.7384	0.7521	0.7735
	Increase (%)	3.84	2.10	6.05	1.44

Table 3. Comparison of NDCG@K for different recommendation methods. The proposed AS-NMF method is tested with a different data partition in this Table. The best results and improvements are highlighted in bold.

Datasets	Methods	NDCG@K
Datasets	Methods	NDCG@10	NDCG@20	NDCG@30	NDCG@50
Yelps2018	BPR	0.4168	0.4936	0.5001	0.5021
	WMF	0.4229	0.5122	0.5128	0.5220
	SISA	0.4182	0.5294	0.5295	0.5384
	AS-NMF	0.4238	0.5391	0.5365	0.5473
	Increase (%)	0.21	1.83	1.32	1.65
Movielens-100k	BPR	0.4534	0.4614	0.4908	0.4939
	WMF	0.4295	0.4435	0.4924	0.4997
	SISA	0.4019	0.5138	0.4986	0.5026
	AS-NMF	0.4839	0.5146	0.5104	0.5169
	Increase(%)	6.73	0.16	2.37	2.85
Movielens-1m	BPR	0.3079	0.4137	0.4516	0.4633
	WMF	0.3198	0.4301	0.4543	0.4443
	SISA	0.3218	0.4409	0.4611	0.4593
	AS-NMF	0.3373	0.4670	0.4681	0.4636
	Increase (%)	4.82	5.92	1.52	0.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Yang, J. The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability. Electronics 2023, 12, 1147. https://doi.org/10.3390/electronics12051147

AMA Style

Sun H, Yang J. The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability. Electronics. 2023; 12(5):1147. https://doi.org/10.3390/electronics12051147

Chicago/Turabian Style

Sun, Haichao, and Jie Yang. 2023. "The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability" Electronics 12, no. 5: 1147. https://doi.org/10.3390/electronics12051147

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Generalization of Non-Negative Matrix Factorization Based on Algorithmic Stability

Abstract

1. Introduction

2. Related Work

3. Preliminaries

4. Characterizing the Stability of Non-Negative Matrix Decomposition Algorithms

4.1. Stability of NMF

4.2. The Generalization upper Bound of NMF

4.3. The Generalization upper Bound of NMF under Multiplicative Rule

5. Experiments

5.1. Experimental Datasets

5.2. Compared Models

5.3. Evaluation Methodology and Settings

5.4. Performance and Algorithmic Stability Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI