1. Introduction
During the past decades, the landscape of computational mathematics has been dramatically changed by data revolution in the era of Big Data, which features 4V’s [
1] (i.e., high volume, high velocity, high veracity and high variety) and stimulates the demand for new linear algebra tools to cope with massive amounts of complex data. Among them, a major challenge is how to model high-dimensional data without losing their inherent multi-linear structure. In other words, a provably optimal representation of high-dimensional data is crucial to offer both theoretical and practical foundations for other tasks [
2]. In this sense, canonical data representation in the form of vectors or matrices is no longer feasible, because vectorization or matricization (i.e., unfolding of multi-dimensional data into vectors or matrices [
3]) for processing may lead to sub-optimal performance due to a loss of relationships across various dimensions [
4,
5]. However, in areas flooded with high-dimensional data, such as fluid mechanics, electrodynamics and general relativity, etc., concise mathematical frameworks can be successfully constructed for formulating and solving problems by means of tensors.
A tensor refers to an algebraic object that describes a multi-linear correlation between sets of algebraic objects. Mathematically, any tensor of the type
is represented by a multi-dimensional array with definite numbers of upper indices
r and lower indices
s to indicate an
r-times contravariance and an
s-times covariance, respectively. This term has a clear geometrical significance and can be formally defined based on the following general transformation formulas [
6,
7,
8]:
where
and
represent covariant and contravariant base vectors in coordinate systems
and
, respectively. In order to make this notoriously impenetrable concept easy to be understood, Ballard et al. presented the Chicago crime data (available at
www.cityofchicago.org) as an example to be represented by a tensor, whose modes correspond to 365 days, 24 h, 77 communities, and 11 crime types. Entry
is the number of times that crime
l happened in the neighborhood
k during hour
j on day
i, as shown in
Figure 1 [
9].
Tensor-based approaches have attracted more and more attention due to their capacity of exploiting multi-linear relationships in multi-way data, among which tensor decomposition (TD) is the art of disassembling multi-dimensional arrays into smaller parts and finds ubiquitous applications in machine learning, neuroscience, quantum computing, signal processing, etc. [
9]. Recently, TD has been employed to compress deep convolutional neural networks (DCNN), i.e., reducing the number of parameters and the time of training a model from scratch, with the purpose of accelerating DCNN’s [
10,
11].
1.1. A Brief Introduction to Tucker Decomposition
In 1963, Tucker [
12] proposed the well-known Tucker decomposition, which represents any
d-dimensional tensor
as a contraction between a
d-dimensional core tensor
and
d projection matrices
:
Projection matrices are typically orthogonal and serve as the principal component of the corresponding mode.
Figure 2 presents a comparison between the principles of the CP decomposition and that of the Tucker decomposition, revealing the correlation between them.
For a hyperspectral image, each projection matrix represents a different feature. For example,
and
correspond to spatial features, and
represents the spectral feature. The corresponding third-order tensor
can be decomposed as follows:
where
is the core tensor,
,
,
are orthogonal projection matrices. The mode-
k product (
) of
by
,
and
is calculated by
The decomposition may also be described more directly as
Here, are the number of columns in the projection matrices , and , respectively. If are smaller than I, J, K, the core tensor can be considered a compressed version of .
1.2. Challenges Incurred by the Streaming Data
With the advent of streaming data, i.e., data flowing continuously from a source to a sink, the problem becomes extremely challenging because data are usually generated simultaneously at high speed by different sources such as IoT sensors. What is worse, data in real-world applications may continuously arrive at every mode, i.e., data evolve over time at all modes [
13]. Take a movie recommendation system as an example, a user–movie–date tensor is constructed, with an element
denoting the rating given by user
i to movie
j on day
k. It is evident that each mode of this tensor is subject to a neverending temporal evolution.
Figure 3, cited from the website of the world famous data service vendor Qlik, demonstrates its technical complicatedness and arduousness.
Due to the wide variety of sources, as well as the scale and velocity by which the data are generated, traditional data pipelines cannot keep up for near-real-time or real-time processing because they have to extract, transform, and load data before they can be manipulated. However, in real-world applications (e.g., movie recommendation), data continuously arrive at every mode, i.e., data evolve over time at all modes [
13]. Since the existing online TD methods cannot cope with this problem because of high computation and storage costs [
14], a desired online tensor decomposition should be able to dynamically update the decomposition of real-time large-scale tensors while preserving its low-rank structure. In this sense, the Tucker decomposition is desirable because of its unique property, i.e., it does not need to decompose at all axes (modes) [
15]. Xiao H P et al. [
13] proposed an efficient online Tucker decomposition (eOTD) approach to track the tensor decomposition for dynamic large-scale tensors on the fly. In other words, for a given
K-th-order tensor
at time
t, it evolves at all modes, namely,
to generate a tensor stream
. The Tucker decomposition is defined by
where
and
denote the core tensor and projection matrices, respectively, at timestamp
t. The snapshot tensor at timestamp
is
, where
(i.e.,
is a sub-tensor of
, denoted as
, and
denotes new coming sub-tensors.
can be decomposed by eOTD. More specifically, eOTD is able to obtain projection matrices
by updating the
using
and
, together with some auxiliary matrices obtained at timestamp
. In a similar vein, the core tensor is updated by a sum of tensors that are calculated by multiplying smaller tensors with matrices.
However, most research on tensor decomposition has been confined to standalone (i.e., single-machine) computational environments, which are not powerful enough for handling large-scale data, e.g., making solar energy affordable requires petaFLOPS of computing resources. This challenge has boosted the involvement of parallel computation [
16]. Unlike serial computing, parallel architecture breaks a complex or formidable task into a set of sub-tasks, which are allocated to different computing equipment with the purpose that multiple operations are carried out simultaneously. Although parallel computing has taken a huge leap forward, its engagement with the tensor decomposition is not mature due to several factors, including the nonuniform task assignment and the synchronization among different computing nodes.
In this paper, we take eOTD as the foundation and a starting point and propose a two-level parallel incremental tensor Tucker decomposition method with multi-mode growth (TPITTD-MG) to reduce computational complexity and improve computing efficiency by advocating parallelization for the incremental tensor Tucker decomposition method, particularly for decomposition of tensors evolving over time at multiple modes. Specifically, the main contributions can be summarized as follows:
- (1)
Utilizing dynamic programming to achieve efficient and uniform partitioning of sub-tensors, with a tensor partitioning algorithm applied to assign sub-tensors across different task nodes, which is crucial for scaling to tens of millions of data elements. This approach not only accelerates execution by exploiting parallel computation for counting non-zero elements but also ensures more uniform task assignment.
- (2)
A two-level parallel update approach for projection matrices and core tensor is designed. The first level conducts updates via a parallel MTTKRP strategy, while the second level independently updates projection matrices based on categorized sub-tensors. This structured update mechanism substantially improves update efficiency.
- (3)
The experimental results demonstrate that our method outperforms existing algorithms in terms of a nearly 400% improvement in execution efficiency and a 20% enhancement in uniformity of partition at a parallelism level of 4 for large-scale datasets. Specifically, for third-order tensors, our approach shows a nearly 300% efficiency improvement compared with traditional single-layer update algorithms.
The remainder of this paper is organized as follows.
Section 2 reviews the related works, and description of important concepts involved in this paper is the subject of
Section 3. In
Section 4, we describe the proposed TPITTD-MG framework in detail, whose performance is evaluated. Finally, we conclude the paper in
Section 5.
2. Related Works
The Tucker decomposition can be considered a generalization of the matrix singular value decomposition (SVD), which has led to the development of the higher-order singular value decomposition (HOSVD) [
17], a popular approach capable of yielding an orthogonal core tensor [
18]. Thenceforth, numerous methods have emerged in the literature to compute the representation of a tensor. Since Tucker decomposition is essentially a best rank-
approximation of tensor
[
19],
the research in this field focuses on such topics as computation and storage complexity, parallel computing, optimization methods, as well as how to exploit latent features of tensors including sparsity, orthogonality, etc.
Kapteyn and Neudecker [
20] extended the capacity of the Tucker decomposition from dealing with third-order tensors to higher-order ones by employing orthogonal projection matrices. De Lathauwer et al. [
14] proposed the high-order orthogonal iteration (HOOI) method, which is proven to be more efficient in computing projection matrices, compared with HOSVD. Furthermore, Elden and Savas [
21] introduced the Newton–Grassmann algorithm for the Tucker decomposition of third-order tensors, offering a more efficient and less iterative approach by constraining the projection matrix to the Grassmann manifold space. Fang S et al. [
22] developed a Bayesian streaming sparse Tucker decomposition method (BASS-Tucker) capable of preventing overfitting and improving the interpretability by automatically selecting meaningful projection interactions.
Thanh et al. [
23] provided a contemporary and comprehensive survey on streaming tensor decomposition, with streaming Tucker decomposition algorithms broadly classified into three main classifications, i.e., online tensor dictionary learning, tensor subspace tracking, and multi-aspect streaming Tucker decomposition. Among them, the first two classifications are dedicated for two specific scenarios of single-aspect streaming Tucker decompositions; the last classification is for multi-aspect streaming tensors.
On the other hand, some researchers push the enhancement of Tucker decomposition in direct, randomized, and iterative ways. Vannieuwenhoven et al. [
24] presented an alternative strategy, called the sequentially truncated HOSVD (ST-HOSVD), to truncate the higher-order singular value decomposition (T-HOSVD) proposed by De Lathauwer et al, with the purpose to reduce computational complexity and improve the approximation error. Kressner et al. [
25] attained a fast algorithm by combining fast matrix-vector products that exploit the structure of Hadamard products with iterative methods, such as the Lanczos method and randomized algorithms, etc. Che et al. [
26] took randomized algorithms as powerful tools for scientific computing because such algorithms are usually faster and more robust compared with standard deterministic algorithms. They designed an adaptive randomized algorithm to compute a low multilinear rank approximate of tensor decomposition for tensors with unknown multilinear rank and analyzed its probabilistic error bound under certain assumptions. Chachlakis et al. [
27] designed a
-Tucker, i.e., a reformulation of the standard Tucker decomposition by substitution of the outlier-responsive
-norm by the sturdier
-norm, followed by proposal of the
-norm higher order orthogonal iterations (L1-HOOI) algorithm for the approximate solution to
-Tucker.
As for the parallel computing, the research focuses on sparse tensors, parallel platforms with multicore nodes, data dependencies, and combination of randomization and parallelization for efficient communication schemes, etc [
16,
28]. Sejoon Oh et al. [
29] developed the parallel Tucker decomposition algorithm P-TUCKER for sparse tensors, emphasizing memory savings and enhanced computational efficiency with scalability. Ballard G and Zhang J et al. [
30] explored parallel computing frameworks for tensor Tucker decomposition. The advent of GigaTensor [
31] marked a significant advance in tensor decomposition algorithms for large-scale tensors, followed by Park N et al. [
32], who integrated these developments into the Bigtensor algorithm library. Acer S et al. [
33] focused on tensor partitioning algorithms to further enhance decomposition efficiency. Minster et al. [
16] proposed two randomized algorithms based on HOSVD and STHOSVD, respectively, and offered a new parallel implementation for the structured randomized sketch. Their key idea is to perform randomized sketches with Kronecker-structured random matrices in order to reduce computational complexity. Meanwhile, a probabilistic error analysis of the proposed algorithms was provided. As for distributed and parallel algorithms for Tucker decomposition, Yao et al. [
34] and Ning et al. [
35] proposed implementations based on Apache Spark, which is especially effective in processing high-dimensional tensors with excellent speedups and high robustness.
The aforementioned tensor decomposition algorithms are executed in parallel, which can significantly improve computational efficiency and save memory space. However, these algorithms are all full-scale parallel computation algorithms for tensor decomposition. When streaming data are dealt with, the tensor must be continuously restored and decomposed, resulting in a large amount of redundant computation, which deteriorates computational efficiency. Researchers attempt to address this issue in different ways.
Some researchers focus on incremental tensor decomposition. Nion et al. [
36] proposed an adaptive CP decomposition algorithm for third-order tensors. Phan et al. [
37] studied incremental tensor decomposition using the idea of block computation. Sarwar et al. [
38] investigated the processing of dynamically growing data streams. However, these algorithms can only cope with tensors evolving at a single mode, whereas many real-world applications require tensor decomposition methods for tensors evolving at multiple modes, i.e., multi-aspect streaming tensors. Song et al. [
39] proposed the multi-aspect streaming tensor (MAST) algorithm to achieve incremental tensor CP decomposition. Yang et al. [
40] proposed the DisMASTD method to realize incremental tensor CP decomposition in a parallel manner. As for Tucker decomposition of the multi-aspect streaming tensors, Xiao et al. [
13] proposed the eOTD algorithm, which is designed for running in a standalone mode.
Other researchers adopt the online mechanism. Zhou et al. [
41] proposed the online CP algorithm for incremental tensor decomposition. Yu et al. [
42] studied the incremental tensor Tucker decomposition and proposed an online low-rank tensor learning algorithm. In 2023, Rontogiannis et al. [
43] proposed the online block-term decomposition reweighted least squares (O-BTD-RLS), which employs a sliding (truncated) windowing method with window duration being chosen according to the dynamics of the system under study or be adaptive.
In summary, although incremental tensor decomposition algorithms for processing the streaming data have currently become a research hotspot, the research is rather limited in the literature, with those about tensors of multi-mode growth in parallel particularly insufficient. Therefore, this article focuses on a parallel Tucker decomposition for multi-mode growth incremental tensors as the primary research object.
3. Two-Level Parallel Incremental Tensor Tucker Decomposition
Method with Multi-Mode Growth (TPITTD-MG)
In order to make descriptions clearly, we denote tensors, matrices, vectors and scalars with calligraphic letters (e.g.,
), uppercase bold letters (e.g.,
), lowercase bold letters (e.g.,
) and lowercase normal font (e.g., a), respectively. Important notations involved in this paper are listed in
Table 1.
3.1. Technical Details of TPITTD-MG
This subsection deals with the task of incremental Tucker decomposition method for tensors evolving over time at all modes and starts with a brief introduction to an off-the-shelf incremental Tucker decomposition method, i.e., efficient online Tucker decomposition (eOTD) proposed by Xiao et al. [
13], followed by our research on performance improvement in terms of computational efficiency by means of parallel computing on the basis of eOTD. When the parallel incremental Tucker decomposition is implemented, it is a prerequisite to allocate the partitioned sub-tensors across different workhorses. However, current tensor partitioning algorithms often encounter such problems as low execution efficiency and nonuniform partition, etc. To address these issues, we propose a parallel sub-tensor allocation method based on dynamic programming, which significantly improves computational efficiency and achieves more uniform sub-tensor allocation, facilitating an optimal balance of computation load among different workhorses. Furthermore, to address the issue of low parallelism in updating projection matrices and the core tensor in current incremental Tucker decomposition methods, a two-level parallel update method is proposed.
3.1.1. A Brief Introduction to the Efficient Online Tucker Decomposition (eOTD) Approach
Given a tensor and its Tucker decomposition snapshot, at timestamp t. When the tensor has evolved to at timestamp , the eOTD approach starts with a partition of into sub-tensors such that , which will be employed to efficiently implement the Tucker decomposition by updating the core tensor and the projection matrices (also called projection matrices) .
The
sub-tensors obtained are classified into
K categories according to their geometric positions relative to the
, denoted as
. Take a third-order tensor
as an example, it evolves to
at timestamp
, which is split into
sub-tensors
, as shown in
Figure 4. Among the 8 sub-tensors,
refers to
, and the other 7 sub-tensors, i.e.,
, correspond to the new coming data, i.e., incremental part of
during time range
, which can be categorized into three groups:
according to their geometric position relative to
: Elements in
can be considered the rear, right and downstairs neighbors of
, respectively. Namely, each of them shares a face with
. Similarly, elements in
and the only element in
share an edge or a vertex with
. This geometric understanding will be generalized in higher-order scenarios, which is more useful for tensor analysis, despite worse intuitiveness.
From the perspective of contribution to one or more auxiliary matrices, the three groups defined by Equation (
26) can also be understood according to the number of sub-indices that equal to 2, indicating the number of auxiliary matrices that can be updated. For example,
can be used for updating
,
can be used for updating both
and
, while
corresponds to
,
and
.
3.1.2. Update Methods for Projection Matrices
Every sub-tensor
will be used
m times for the update of auxiliary matrix
. According to the Corollary II (i.e., Block Tensor and Matrix Multiplication II defined in
Section 3),
, where
Let all the sub-indices to be updated form a set
. For each sub-index
, the update rules for auxiliary matrices
are defined as
where
, and † denotes the pseudo inverse of a matrix.
Subsequently, the
is augmented to
, which are orthogonalized and normalized by means of the modified Gram–Schmidt (MGS) to calculate
at timestamp
by updating
:
Namely,
is also the concatenation of two matrices:
and
.
Specifically, as for the scenarios of third-order tensors, each of the
split sub-tensors
of
can be expressed as
according to the Corollary II, where
has the same definition as Equation (
27). Let
in case a sub-index is missing, a set of auxiliary tensors can be defined by
i.e., they are multiplications of
with the remaining projection matrices. Equation (
31) to Equation (
34) lay the theoretic underpinning for defining update rules of auxiliary matrices using every sub-tensor
, category by category, in
.
A. Update Projection Matrices With Sub-tensors In classification
As described previously, each sub-tensor in
, i.e.,
,
and
, is used only once for updating the corresponding auxiliary matrix
,
and
, respectively.
,
and
are defined by Equation (
31).
When the update of projection matrices has been implemented with sub-tensors in , they should be further refined by successive updates with sub-tensors in and classifications, because sub-tensors in different classifications have different contributions to one or more auxiliary matrices.
B. Update Projection Matrices With Sub-tensors In Classification
As defined by Equation (
26), each sub-tensor in the
has two sub-indices equal to 2, implying it will be used twice to update two corresponding auxiliary matrices. As for
, it is employed to update both
and
. The update of
requires the auxiliary matrix
calculated based on sub-tensors in
classification, i.e.,
, which can be expressed by
Here,
is a forgetting factor, indicating how much information should be inherited from the previous step. The same principle goes for
and
.
means the updated auxiliary matrix
using sub-tensor
.
C. Update Projection Matrices With Sub-tensors In Classification
Since
classification contains only one sub-tensor
, with three sub-indices equal to 2,
associates with all auxiliary matrices:
D. Update Projection Matrices
Given projection matrices
and the auxiliary matrices
obtained at time
t, the projection matrices
at timestamp
are updated according to the rule:
which involves a concatenation of
and
at the
k-th mode. Since Equation (
47) cannot guarantee the generated
will be unitary, the modified Gram–Schmidt (MGS) algorithm is performed on
to produce orthonormal projection
.
3.1.3. Update Method for Core Tensor
Based on the projection matrices
, the core tensor
at timestamp
t, as well as the new coming data
at timestamp
, since
are unitary matrices, the core tensor
at timestamp
can be updated by
according to the Second Corollary. The update procedure incorporates splitting of
into two matrices:
and
.
In summary, eOTD is efficient in dealing with the tensor stream
defined in
Section 1.2 and can be applied to large-scale applications because it merely involves tensor matrix multiplication and matrix pseudo inverse operation, which are cheap compared with the computationally expensive SVD operations. Meanwhile, eOTD is established on a solid theoretical foundation, i.e., two corollaries proposed by Xiao et al. [
13], which makes it a powerful tool to track Tucker decomposition of dynamic tensors
with an arbitrary number of modes.
3.2. Parallel Sub-Tensor Partitioning Algorithm
It can be seen in
Section 3.1 that it is an indispensable task to split a tensor into sub-tensors, which act as the fundamental elements for updating the core tensor and the projection matrices. Despite the relatively satisfactory efficiency provided by the eOTD approach, its standalone operation paradigm becomes an obstacle on the path to successful applications in more large-scale scenarios, motivating the involvement of parallel decomposition methods. Generally, for parallel computing, the strategy of allocating the obtained sub-tensors uniformly among different computing resources is a major concern. For ease of representation, we iconically refer to each computing resource in a distributed environment as a workhorse.
In this subsection, we start from the sub-tensors obtained by eOTD and exploit their latent structure to enhance computing efficiency of parallel decomposition and reduce the bottleneck cost incurred by fundamental tensor-related operations, e.g., matricized tensor times Khatri–Rao Product (MTTKRP [
44]) in the Tucker decomposition. Since only non-zero elements of tensors contribute to the MTTKRP cost, an appropriate partition criteria for sub-tensors should be that each partitioned group has an equal number of non-zero tensor elements.
For example, given a sub-tensor
, We attempt to partition
at the mode-2, i.e., the second dimension of
, producing five slices in total, as shown by
Figure 5. Non-zero elements in each slice are
, respectively. Suppose the task in this scenario is to allocate the five slices among three workhorses (i.e., Workhorse I, Workhorse II and Workhorse III), the optimal strategy can be summarized as follows:
To allocate the first slice to Workhorse I, the corresponding six non-zero elements form Group I;
To allocate the second and fourth slices to Workhorse II, the corresponding six non-zero elements form Group II; and
To allocate the third and fifth slices to Workhorse III, the corresponding six non-zero elements form Group III.
As a result, all the non-zero elements are uniformly partitioned into three groups and each workhorse is allocated six non-zero elements, i.e., an extremely uniform allocation is realized. However, such an ideal situation as has occurred in this simple scenario may not be always true in practice, because splitting a sub-tensor into slices at a mode is analogous to the task of uniformly splitting a set of positive integers, a well-known NP-hard problem due to its complexity and the impracticality of finding an optimal solution [
45]. In order to address this issue, Yang et al. [
40] introduced a heuristic algorithm, DisMASTD, which counts the number of non-zero elements (NNZ) at each slice at mode-
n of the sub-tensor and sums all these counts to yield a total number,
. Let
represent the desired number of partitions at mode-
n; an optimal number of non-zero elements in each partition, named
, is calculated by
. The workflow of DisMASTD is described as follows, which can also be described in pseudo code, as shown by Algorithm 1.
Initially, DisMASTD traverses slices at the mode-n of a sub-tensor (i.e., has N modes in total, and mode-n has slices, ), counting non-zero elements at each slice, denoted as .
DisMASTD traverses mode-n of again, slice-by-slice, greedily assigning the traversed slices to the current partition , until its total number of non-zero elements (denoted as ) reaches .
When the i-th slice is being traversed and the of the first partition reaches , DisMASTD will compare the deviation from incurred by two different strategies:
i. Assigning slice i to the current partition incurs deviation , and
ii. Not assigning slice i to the current partition incurs deviation .
The strategy with smaller deviation will be accepted, i.e., if , slice i is assigned to the current partition, i.e., the -th partition . Otherwise, go to step (4).
Generate the next partition for mode-n, and assign the slice i to it.
DisMASTD will go ahead to traverse the -th slice of mode-n and decide whether this slice should be assigned to the partition . The loop will iterate until the of partition reaches .
The residual slices at mode-n are assigned to the last partition .
In case partitions at all modes are completed, DisMASTD outputs the partition result .
Algorithm 1 DisMASTD—Heuristic Tensor Partitioning Algorithm |
Input: (1) , Sub-tensor to be partitioned; |
- 1:
(2) , number of partitions at mode-n;
|
Output: , mode-wise allocation of sub-tensor . |
- 2:
for do - 3:
/ //optimal number of non-zero elements in each partition - 4:
//count number of non-zero elements at each slice at mode n of - 5:
//initialize the partition results container - 6:
//initialize a temporary container for intermediate partition results - 7:
//initialize other auxiliary variables. - 8:
// means the total number of non-zero elements allocated to - 9:
//the current partition, denotes the current partition, - 10:
//i.e., -th partition. - 11:
for do - 12:
- 13:
if then. - 14:
assign slice i to P - 15:
else - 16:
if then - 17:
assign slice i to P - 18:
end if - 19:
if then - 20:
; //generate a new partition to store the content in P - 21:
- 22:
else - 23:
Assign the remaining slices to , break - 24:
end if - 25:
end if - 26:
end for - 27:
end for - 28:
Output partitioned results .
|
Nevertheless, DisMASTD exhibits a low execution efficiency because the partition of sub-tensors requires traversing all slices of a sub-tensor at each mode multiple times, which executes in a serial manner and incurs unacceptable time consumption.
Table 2 enumerates the algorithm’s overall time consumption over sub-tensors of various ranks, as well as the specific time spent in counting non-zero elements. For smaller size tensors, counting non-zero elements accounts for over 50% of the total partitioning time. While for larger size ones, the ratio may reach 95% or even higher. In case non-zero elements are nonuniformly distributed, the performance of DisMASTD will go from bad to worse.
To address these issues, we propose a parallel sub-tensor partitioning mechanism based on the dynamic programming (PSTPA-DP), which attempts to realize parallel implementation from both the level of a sub-tensor and the level of a sub-tensor classification:
- (1)
From the level of a sub-tensor
If all slices of at mode-n are divided into parts, numbers of non-zero elements of all parts can be counted in a parallel manner. When all counting tasks are completed, the results are aggregated to obtain the output , i.e., the number of non-zero elements in each slice of .
- (2)
From the level of a sub-tensor classification
Based on the update rules for projection matrices defined in
Section 3.1.2, a tensor with
K modes is partitioned into
sub-tensors, which are further divided into
K classifications
. Because the update processes of auxiliary matrices, which involve
sub-tensors
in classification
, are independent of each other, it is desirable to partition
sub-tensors in
(i.e.,
) in a parallel manner. For example, there are three sub-tensors in
; we wish to partition
simultaneously.
PSTPA-DP introduces the dynamic programming mechanism to address the problem of nonuniform distribution of non-zero elements, it first clarifies the task of partition by determining the modes required to be expanded when updating the auxiliary matrices using sub-tensors in different classifications . Subsequently, the numbers of non-zero elements, denoted as at each slice at selected modes, are counted in a parallel manner. Finally, the average value of the number of non-zero elements in the remaining slices (i.e., the slices that have not been partitioned) is iteratively refreshed based on the idea of dynamic programming. The algorithm can be described in greater detail by pseudo code in Algorithm 2.
Compared with DisMASTD, PSTPA-DP is advantageous in terms of time complexity. Given a sub-tensor
, the heuristic tensor partitioning algorithm has a time complexity of
, which can be reduced to
by PSTPA-DP. The significant improvement in performance can be attributed to the division of the sub-tensor partitioning task into
sub-tasks, as well as allocation of each sub-task to a different workhorse, scheduled by dynamic programming.
Algorithm 2 Parallel Sub-tensor Partitioning Algorithm Based on Dynamic Programming (PSTPA-DP) |
Input: (1) : sub-tensors in classification; |
- 1:
(2) : number of sub-tensors in classification; - 2:
(3) : number of partitions at each mode of .
|
Output: : mode-wise partition results of each in . |
- 3:
for ∈do //partition sub-tensors ∈ in parallel - 4:
divide all slices of at mode-n into parts; - 5:
count non-zero elements at each slice at mode-n of in a parallel manner, and output ; - 6:
sort in descending sequence; - 7:
calculate the average number of non-zero elements over the parts at mode-n by ; - 8:
; //Initialization of - 9:
for do //mode-wise partition in a parallel manner - 10:
for do //slice-by-slice assignment at mode-i - 11:
if then - 12:
; //assign to ; - 13:
; //remove - 14:
refresh the average ; - 15:
else - 16:
; //assign to and remove ; - 17:
; - 18:
Traverse other members in : - 19:
if there exists a member equal to then - 20:
; //assign to the current partition ; - 21:
exit the traverse; - 22:
else if a member satisfies then - 23:
if then - 24:
; //assign a to the current partition - 25:
exit the traverse; - 26:
else - 27:
; assign b to the current partition - 28:
find ; - 29:
; //assign a to the current partition ; - 30:
exit the traverse; - 31:
end if - 32:
end if - 33:
end if - 34:
end for - 35:
end for - 36:
end for - 37:
Output partitioned results .
|
3.3. Parallel Computing Method for the Incremental Tensor Tucker Decomposition
This subsection deals with the decomposition of the tensor at timestamp , i.e., , by the incremental tensor Tucker decomposition on the basis of in a parallel manner, provided the sub-tensors have been partitioned by the proposed PSTPA-DP mechanism. We propose a two-level parallel update method for projection matrices and core tensors: the first level updates projection matrices or core tensors based on the parallel MTTKRP calculation strategy, and the second level updates different projection matrices or tensors independently using sub-tensors in different classifications in a parallel manner.
3.3.1. Two-Level Parallel Update Method for Projection Matrices
In
Section 3.1.2, the update Formulas (
13)–(30) for projection matrices in a stand-alone setting have been provided, utilizing sub-tensors of different classifications
to update the projection matrices. In this subsection, the update method is extended to a parallel manner. Take third-order tensors as an example. Suppose at time
t, a third-order tensor
has a Tucker decomposition with the core tensor
, and projection matrices
,
,
. At time
, the projection matrices are updated using sub-tensor
in
classification by Equation (
22), i.e.,
, where the core matrix
is expanded at mode-1 to generate
. The second dimension probably has a unimaginably big size, e.g., in case
,
may reach the magnitude of
, implying the update of projection matrices by a single computing equipment is computationally infeasible. Therefore, we consider implementing the update process in a parallel manner.
A. The First Level of Parallel Update—Parallel MTTKRP Calculation Strategy
According to Equation (
34), the projection matrices for high-order tensors can be updated by
Based on the properties of the pseudo-inverse, the right part of Equation (
36) can be expressed as
Considering
, we obtain
Consequently, Equation (
36) can be expressed as
For the third-order tensor scenario, Equation (
39) can be degenerated to the following formula:
Here, represents the expansion of at mode-1. The dimensionality of is significantly smaller than that of tensor .
and
represent the projection matrices at time
t, which are transformed by Equation (
40) in order to reduce the risk of explosive computational cost incurred by direct multiplications between the core tensor and the projection matrices. Additionally, the projection matrices are broadcast to each workhorse, facilitating parallel updates of the projection matrices.
In
Section 3.2, we propose a mechanism to allocate sub-tensors uniformly among workhorses to realize parallel processing. Take Equation (
40) as an example, the update operation for projection matrices can be divided into two independent tasks:
- (1)
, where
is a matricized tensor, and
is the Khatri–Rao product of the projection matrices. This task is essentially the multiplication of a matricized tensor by the Khatri–Rao product of the projection matrices, known as matricized tensor times Khatri–Rao product (MTTKRP) [
44]. Given
and
, the Khatri–Rao product
yields a matrix of size
.
Considering the situation that slices of the matricized tensor
have been allocated among different workhorses, the overall computation task
can be divided into several sub-tasks according to the allocation results of sub-tensors, which can compute in parallel. Let
, the row-wise computation can be described as follows:
which is derived by partitioning
slices of the matricized tensor
into two parts, with
J and
K components, respectively. It can be seen that the involved rows in
and
are determined by indices of
. Therefore, each workhorse can determine the indices
j and
k according to non-zero elements in slices
assigned to it and select the required rows in
and
for computation. At the same time, the selected rows are broadcast to all workhorses and participate in relevant multiplication operations.
- (2)
The pseudo-inverse of , which can be calculated directly, since its dimensionality is much smaller than that of .
When each workhorse finishes its tasks, the results are combined to produce the final output. Additionally, by using the proposed parallel dynamic programming-based sub-tensor partitioning mechanism (PSTPA-DP), the uniformity of non-zero element distribution is effectively ensured, leading to a relatively balanced computation assignment among workhorses.
B. The Second Level of Parallel Update—Using Sub-tensors In A Parallel Manner
Now that the parallel computation has been established to update the projection matrices using the sub-tensor in classification, we can derive the updating rules for projection matrices using other sub-tensors. Firstly, three auxiliary matrices should be updated in a parallel manner using sub-tensors in three classifications (i.e., , , ) in turn.
- (1)
Parallel Update of Auxiliary Matrices Using Classification Sub-Tensors
In addition to the sub-tensor
, the
classification contains sub-tensors
and
, which are leveraged to update the auxiliary matrices:
The update of each auxiliary matrix using sub-tensors in classification only involves elements generated at timestamp t, which are independent of each other, as shown by Equations (40), (43) and (44), , , can be updated in a parallel manner.
- (2)
Parallel Update of Auxiliary Matrices Using Classification Sub-Tensors
The
classification consists of sub-tensors
,
and
, which are used to update three auxiliary matrices:
Here, means the updated auxiliary matrix using sub-tensor in the classification.
The situation is slightly different from that using sub-tensors in
classification, i.e., each auxiliary matrix has to be updated twice using different sub-tensors, and there is a dependency between these two updates, e.g., output of Equation (
45) provides an element for Equation (
47). However, the elements required for updates of different auxiliary matrices are independent of each other. Therefore, Equations (45), (46) and (48) can be calculated in a parallel manner, and so can Equations (47), (49) and (50).
- (3)
Parallel Update of Auxiliary Matrices Using Classification Sub-Tensors
The
classification contains only one sub-tensor, i.e.,
, which is used to update the auxiliary matrices:
The sub-tensor in classification further updates the auxiliary matrices obtained using sub-tensors in , and the elements involved in this update process are independent as well. Therefore, Equations (51)–(53) can be calculated in a parallel manner.
So far, the update of the auxiliary matrices comes to an end. Based on the updated , projection matrices at timestamp can be updated by concatenating and the projection matrices , followed by an orthogonalization of the concatenated output , as defined by Equation (30).
In order to avoid computing pseudo-inverse of the matricized core tensor multiple times during the update process, once the has been computed for the first time (i.e., in the update process of sub-tensors using sub-tensors in classification), it is stored in the Master node, which is accessible for subsequent updates. This approach highlights its capability to avoid redundant computations and save memory space.
In summary, since tensors involved in this subsection are third-order, the degree of parallelism is . If this approach is extended to the tensors with K modes , the degree of parallelism is increased to , which can make more efficient use of the workhorses and improve computational efficiency.
3.3.2. Two-Level Parallel Update Method for Core Tensor
From Equation (
31) in
Section 3.1.3, which provides the update rule for core tensors, it can be seen that the update process consists of two computational tasks, i.e.,
and
.
The first task involves only the core tensor
and the projection matrices
, which are both small scale, making it feasible to be performed directly. The second task multiplies the sub-tensors (except for the
) by a series of matrices
and then sums all the products, which can be considered a MTTKRP computation as well. By applying the parallel MTTKRP computation method described in
Section 3.3.1, we can efficiently update the core tensor in parallel. Take the sub-tensor
in
classification as an example,
can be transformed as follows:
Considering the properties of the Kronecker product and mode multiplication of the tensors, Equation (
50) can be expressed as
Here, corresponds to the MTTKRP computation. Furthermore, the computation of involves summing the products of sub-tensor multiplications. Because there is no dependency between operations on different sub-tensors, such operations can be processed in a parallel manner. In this subsection, we realize the two-level parallel update method for the core tensor through two steps:
- (1)
Computation of related to different sub-tensors can be performed in a parallel manner; and
- (2)
Application of the MTTKRP parallel computation method, described in
Section 3.3.1, to each sub-tensor.
Assuming the sub-tensor is partitioned into parts at mode-n for processing, the degree of parallelism in the traditional single-level parallel update method is . For a sub-tensor with K modes, it can be partitioned into sub-tensors, and the sub-tensors involved in the cumulative computation can reach . As a result, the degree of parallelism in this two-level parallel update method can be increased to .
3.3.3. Parallel Update Method for High-Order Tensors
The update methods for third-order projection matrices and the core tensor described in the previous subsections should be extended to higher-order tensors. To be specific, we propose a complete parallel incremental Tucker decomposition method for high-order tensors with multi-mode growth, as shown by pseudo code in Algorithm 3.
Algorithm 3 Incremental Tucker Decomposition Method for High-Order Tensors with Multi-Mode Growth |
Input: (1) : The Tucker decomposition result of at timestamp t, including the core tensor and projection matrices ; (2) : The incremental sub-tensors at time , where ; (3) : Number of partitions for mode-n of a sub-tensors. |
Output: : The Tucker decomposition result of the sub-tensor at time , including the updated core tensor and projection matrices . |
- 1:
partition the incremental sub-tensor into K classifications ; - 2:
for do - 3:
parallel processing of the sub-tensors within ; - 4:
partition the sub-tensors according to Algorithm 2; - 5:
parallel update of the auxiliary matrices based on Equation ( 52); - 6:
end for - 7:
for do - 8:
concatenate the auxiliary matrices and projection matrices to form a temporary projection matrices according to Equation (30); - 9:
Perform orthogonalization on the ; - 10:
end for - 11:
update the core tensor in parallel according to Equation ( 53); - 12:
output the core tensor and projection matrices at time .
|
For a tensor with
K modes, the incremental tensor is first partitioned into
sub-tensors based on their relative position to the original tensor
. According to the number of digit 2 in indices of the sub-tensor
in Step (1), these sub-tensors are grouped into
K classifications
. Sub-tensors in each classification are further partitioned by the proposed PSTPA-DP, as described in Step (4). Subsequently, the projection matrices are updated using sub-tensors in
in turn. Using different sub-tensors in each classification, projection matrices can be updated in a parallel manner, corresponding to the first level of parallel updating. Since the sub-tensors have already been partitioned and the MTTKRP parallel computation method has been involved, the update of each projection matrix
can also be performed in a parallel manner, i.e., the second level of parallel processing. The update formulas for high-order projection matrices are described as follows:
When
; when
. This formula corresponds to Step (5), i.e., updating of auxiliary matrices. The updated auxiliary matrices are concatenated with the projection matrices and the concatenated result is orthogonalized, corresponding to Steps (7)–(10). At this point, the projection matrices
at time
are obtained. Finally, the update of the high-order core tensor is described as follows:
Equation (
53) corresponds to Step (11) of Algorithm 3. So far, the incremental Tucker decomposition method for high-order tensors with multi-mode growth is ready to yield the core tensor
and projection matrices
at time
.
4. Experimental Analysis
Experiments in this section are conducted to evaluate performance of the proposed algorithm from the perspective of three tasks: the tensor decomposition and reconstruction, a comparative analysis on the computational efficiency of sub-tensor partitioning algorithms, and an assessment on the parallel update efficiency of projection matrices and core tensors. The experimental environment is comprised of the following hardware and software settings:
- (1)
Five DELL workstations, each of them is equipped with an Intel® CoreTM i7-9700 CPU, 16 GB of RAM, and a 2TB hard drive. One workstation serves as the Master node and the other four as Slave nodes.
- (2)
The key algorithms are implemented using Scala 2.12.10 and executed on a Spark 3.0.3 parallel computing cluster.
- (3)
The overall system of parallel computation for tensors is developed using the Java programming language 1.8.0_281.
- (4)
Simple build tool (sbt) is employed as the build tool to package the project, which is deployed to the Spark cluster. The sbt-assembly is introduced to facilitate the application build process.
- (5)
External tools such as scopt and netlib are incorporated for project initialization, because sbt does not inherently package them into the application.
For the comprehensive performance evaluation of the TPITTD-MG algorithm, datasets with uniformly-distributed non-zero elements are generated using Numpy and used as the foundation to construct third-order tensors.
4.1. Experiments on Tensor Decomposition and Reconstruction
In this subsection, a couple of random matrices, , are generated and standardized. By performing operations such as translation and matrix multiplication, etc., these matrices are assembled into three tensors, i.e., the original tensors , and at time t, which are decomposed by the Tucker decomposition to obtain a core tensor and a series of projection matrices. At time , new increments have been added to each mode of the tensor , generating the tensor . Subsequently, the proposed TPITTD-MG algorithm is applied to yield the core tensor and projection matrices at time by updating their counterparts at time t in a parallel manner, using the new coming data.
The tensors
are reconstructed, denoted by
, on the basis of decomposition results at time
. An element-wise comparison between
and
is shown in
Figure 6.
In
Figure 6, the first row demonstrates the element-wise comparison between
and
at each mode, represented by blue and red curves, respectively. The second and third rows provide similar comparison for the second tensor and the third one in turn. It is evident that the reconstructed tensor closely fits the original tensor, implying the core tensor and the projection matrices obtained through TPITTD-MG perfectly preserve the information of the original tensor, validating the accuracy of the algorithm.
4.2. Execution Efficiency Comparison of Sub-Tensor Partitioning Algorithms
This subsection verifies time efficiency of the parallel sub-tensor partitioning mechanism based on the dynamic programming (PSTPA-DP) proposed in
Section 4.2 and compares it with the DisMASTD method, the heuristic sub-tensor partitioning algorithm shown in Algorithm 1.
Set the parallel degree to be 4; we compare the execution efficiency of the two algorithms over synthetic datasets with uniformly-distributed non-zero elements at different tensor modes. The experimental results are depicted as blue and orange plots, respectively, in
Figure 7a.
It can be seen that given a parallel degree of 4, there is no distinguishable difference between the performance of the two algorithms for smaller tensors. However, as the tensor size increases, the advantage of PSTPA-DP over DisMASTD becomes more and more obvious. The same phenomenon is true when the parallel degree is set to 8, which is shown in
Figure 7b. Specifically, when the parallel degree increases from 4 to 8, the efficiency of PSTPA-DP is also enhanced.
Specifically, the efficiency of PSTPA-DP is empirically validated in the scenarios characterized by nonuniform distribution of non-zero elements, which might lead to imbalanced partitions by traditional heuristic tensor partition algorithms. To assess the uniformity of partition, we introduce the term “coefficient of variation”, defined as the ratio of standard deviation to mean,
, as a metric to quantify the uniformity of partition results. For that purpose, the tensor datasets of identical volume but varying
are synthesized.
’s of the partitions generated by both algorithms over datasets with different initial
’s are computed.
Figure 8 illustrates performance of the two tensor partitioning algorithms over datasets with varied initial
’s.
Figure 8 reveals that for datasets with low
, two partition algorithms exhibit negligible differences. However, as
of the datasets increases, the partition generated by the PSTPA-DP demonstrates significantly higher uniformity compared with those produced by DisMASTD.
4.3. Efficiency Comparison of Parallel Projection Matrix and Core Tensor Updates
This subsection deals with the execution efficiency in updating projection matrices and core tensors, which is conducted over synthetic datasets of different scales, employing both single-level and two-level parallel update methods, as shown in
Table 3. As for the single-layer case, the heuristic sub-tensor partitioning algorithm is employed, while the two-level method involves the proposed PSTPA-DP.
Table 3 enumerates the execution time of each step of TPITTD-MG over different datasets with different partition methods. The execution time of the two-level parallel update method is almost the same as the single-level version over small-scale datasets. For medium-scale datasets, the former performs slightly better than the latter. For large-scale datasets, the former can significantly reduce the execution time compared with the latter. Such observations are consistent with the theoretical conclusion, verifying the effectiveness of the two-level parallel update algorithm.
Meanwhile, the effectiveness of TPITTD-MG is verified as well by a comparison with two reference algorithms, i.e., PCSTF and DPHTD, proposed by Yao et al. [
34] and Ning et al. [
35], respectively, as shown in
Table 4. Although TPITTD-MG operated over 4 workers provides a slightly lower speedup compared with the DPHTD operated over 16 workers, it outperforms PCSTF significantly under the same condition.