2. Related Work
In most of practical studies on NTF, the value of rank is either decided by trial and error or by specialists’ insights (see e.g., [
6,
14]). In addition, AIC and BIC have also been used for model selection in NTF [
15].
Unlike NTF, there are a number of rank selection methods for NMF. As a special case of NTF, NMF aims to factorize a non-negative data matrix
into factor matrices
and
as
, where
E is the approximation error matrix. In addition to general methods such as AIC, BIC and cross-validation [
16], more involved criterion have been developed to select the rank for NMF, such as the MDL criterion with latent variable completion [
17] and the non-parametric Bayes method [
18]. The MDL code-length under the assumption of model regularity is studied in [
19].
Squires et al. [
20] proposed two rank selection methods for NMF based on the MDL principle. In this study, data matrix
X is considered as the message from a transmitter to a receiver and it is sent in such a way that a transmitter send
W,
H and
E and a receiver reconstruct
X from them. When rank
R is low, encoding of
W and
H requires shorter code-lengths while that of
E does longer one. On the contrary, when
R is high, encoding of
W and
H requires longer code-lengths while that of
E does shorter one. The MDL principle is used to find the best solution of the trade-off between the accuracy and the complexity. Squires et al. proposed two methods for calculating code-lengths with Shannon information:
where the probability
p is known in advance. Note that the NML code-length we use can be thought of as an extension of Shannon information into the situation where the parameter
of probability distribution
is unknown in advance [
10].
3. Proposed Method
This section proposes an MDL-based rank estimation method for NTF. To do this, we extend the study on NMF by [
20] into NTF in a non-trivial way. As noted in
Section 1.2, in the case of tensors, we may suffer from the imbalance problem, that is, the number of elements in factor matrices is too much smaller than that of elements in a data tensor. For instance, in case of
and rank
, error tensor
has
= 125,000 elements, while the total number of elements in three factor matrices
is only
. In such cases, compared with the code-length for the error tensor, those of factor matrices are too small to have an influence on the total result. Therefore, the NTF-based rank selection methods tend to choose the model that best fits the data and ignore the complexity of model in some way. As a consequence, the trade-off between complexity and errors cannot be well formalized.
Our key idea is to take the tensor slice based approach. The overall flow of our method is summarized as follows: We first produce a number of tensor slices from a non-negative data tensor, then consider those slices as non-negative data matrices and employ NMF to factorize them. Next, we select a rank for each tensor slice so that the total code-length is minimized, and finally select the largest one among the selected ranks for slices as the rank of the original tensor. Note that we calculate the code-length with the
NML code-length, rather than Shannon information used in [
20].
First of all, for a third-order non-negative tensor
, the three kinds of its two-dimensional slices are: Horizontal slices
, lateral slices
, and frontal slices
. Each tensor slice can be treated as a matrix to be factorized as follows:
where
represents a tensor slice of the non-negative data tensor
.
and
are the two non-negative factor matrices, both of which consist of
R factors.
denotes the error matrix. When the data tensor has a true rank
R, any
has the rank no more than
R. Therefore, we could select appropriate ranks for all
s and select the maximum value among them as the rank of the tensor.
Next, we select a rank of
so that the total code-length is minimum. The total code-length of
with rank
R is given by:
where
is the code-length required for encoding
x under the prefix condition. In order to calculate the code-lengths for elements in
and
, we need to discretize them with an appropriate precision
, since they are real-valued numbers with unlimited precision, which cannot be encoded. For a given precision
, the elements in each matrix are first vectorized to build a vector, then discretized into bins with precision
to create a histogram. Letting the minimum and maximum of elements be
and
, respectively, the histogram has bins:
where
is the number of bins.
Then we employ the NML code-length in order to compute the code-length of
, each of which denotes an element assigned to a bin. The NML code-length is a theoretically reasonable coding method when the parameter value is unknown. Let
be the probability distribution with parameter
under the model
. According to [
12], given a data sequence
, the NML code-length
for
under the model
is given by:
where
denotes the maximum likelihood estimator of
from
. The second term in Equation (
3) is generally difficult to calculate. Rissanen [
12] derived its asymptotic approximation formula as follows:
where
k is the number of parameters in model
,
denotes the determinant of the Fisher information matrix, and
satisfies
uniformly over
.
For
and
, the zero terms in factor matrices are generally much more than the non-zero terms. Thus we separately compute the code-length of zero terms, namely the first bin in the histogram, and non-zero terms as follows:
where
is either
or
.
represents the zero terms in
, and
denotes the non-zero terms in
.
For the zero terms or the first bin in the histogram, by applying Equations (
3) and (
4) into the Bernoulli model, their NML code-length is calculated as follows:
where
n is the total number of elements in
or
, and
denotes the number of zero values in this matrix.
For the binned data in
,
and
, by applying Equations (
3) and (
4) into histogram densities with
s bins, we can compute their NML code-lengths as:
where
is the number of elements in the
i-th bin, and
is the gamma function.
is the code-length of an integer [
10], which can be computed as:
where the summation is taken over all the positive iterates. Using Equations (
5) and (
6), the total description length of
can be calculated as follows:
After we apply the MDL principle to select the rank with the shortest total code-length for each tensor slice, we select the largest one from all of slices’ ranks as the rank of the tensor. This estimate can be seen as an lower bound of the rank of the tensor from the fact that the ranks of slices is smaller than that of the data tensor, that is, if the rank of tensor X is R and the decomposition is represented as , each slice can be represented as with rank R matrices and . For example, each element of can be represented as , where and .
We show the entire procedure in Algorithm 1.
Algorithm 1 Rank Selection with Tensor Slices |
Slice a non-negative third-order data tensor into tensor slices for do for do Perform NMF on to obtain and Calculate Compute the total code-length using Equation ( 8) end for Select the rank of a tensor slice: end for Select the rank of tensor: return
|
4. Comparison Method: MDL2stage
We further developed a novel algorithm, MDL2stage, as a comparison method. MDL2stage is based on tensor slice similarly with the proposed method, but it encodes the factorized results of NMF via the two-stage code-length. All of its calculation is exactly the same as our proposed method except the way of encoding the error matrix and the non-zero terms in the factor matrices.
In MDL2stage, we fit a parametric probability distribution to the histogram to estimate the probability density of each bin. Generally, we assume that elements in the histogram generated by the error matrix
follow the normal distribution, and assume that the non-zero elements in the histograms of two factor matrices
and
are gamma-distributed. Then we use two-stage code-length to calculate the description length of
and the non-zero terms in
and
as follows:
where
is
or non-zero terms in
or
,
denotes the estimated probability of an element to be in the
i-th bin.
The total code-length for a tensor slice
with rank
R is:
Again, we apply the MDL principle to choose the rank with the shortest total code-length for each slice, and select the greatest rank from all of slices’ ranks as the rank of tensor.
As for the computational complexity for our proposed method and MDL2stage, the cost of factorizing a tensor slice with size is for each iteration. Since there are K such tensor slices, the total computational complexity of the two methods based on tensor slice is per iteration. By similar arguments, performing NMF on all tensor slices costs , where denotes the number of iterations in NMF.
Although theoretically we have to perform NMF on slices, numerical experiments have proven that actually we only need to factorize K tensor slices which have the biggest size , where we assume that K is the smallest value among I, J and K. This is because that smaller tensor slices usually give comparatively low ranks. Therefore, when we choose the largest rank over all ranks of slices, whether or not to use tensor slices with smaller sizes does not influence the final result.