Next Article in Journal
Multistage Threshold Segmentation Method Based on Improved Electric Eel Foraging Optimization
Previous Article in Journal
Low-Carbon Transport for Prefabricated Buildings: Optimizing Capacitated Truck–Trailer Routing Problem with Time Windows
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG)

School of Cyberspace Security, Beijing University of Posts and Telecommunications, 1 Nanfeng Rd., Changping District, Beijing 102206, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(7), 1211; https://doi.org/10.3390/math13071211
Submission received: 18 October 2024 / Revised: 19 March 2025 / Accepted: 25 March 2025 / Published: 7 April 2025

Abstract

:
With the rapid growth of streaming data, traditional tensor decomposition methods can hardly handle real-time, high-dimensional data of massive amounts in this scenario. In this paper, a two-level parallel incremental tensor Tucker decomposition method with multi-mode growth (TPITTD-MG) is proposed to address the low parallelism issue of the existing Tucker decomposition methods on large-scale, high-dimensional, dynamically growing data. TPITTD-MG involves two mechanisms, i.e., a parallel sub-tensor partitioning mechanism based on the dynamic programming (PSTPA-DP) and a two-level parallel update method for projection matrices and core tensors. The former can count the non-zero elements in a parallel manner and use dynamic programming to partition sub-tensors, which ensures more uniform task allocation. The latter updates the projection matrices or the core tensors by implementing the first level of parallel updates based on the parallel MTTKRP calculation strategy, followed by the second level of parallel updates of different projection matrices or tensors independently based on different classification of sub-tensors. The experimental results show that execution efficiency is improved by nearly 400% and the uniformity of partition results is improved by more than 20% when the data scale reaches an order of magnitude of tens of millions with a parallelism degree of 4, compared with existing algorithms. For third-order tensors, compared with the single-layer update algorithm, execution efficiency is improved by nearly 300%.

1. Introduction

During the past decades, the landscape of computational mathematics has been dramatically changed by data revolution in the era of Big Data, which features 4V’s [1] (i.e., high volume, high velocity, high veracity and high variety) and stimulates the demand for new linear algebra tools to cope with massive amounts of complex data. Among them, a major challenge is how to model high-dimensional data without losing their inherent multi-linear structure. In other words, a provably optimal representation of high-dimensional data is crucial to offer both theoretical and practical foundations for other tasks [2]. In this sense, canonical data representation in the form of vectors or matrices is no longer feasible, because vectorization or matricization (i.e., unfolding of multi-dimensional data into vectors or matrices [3]) for processing may lead to sub-optimal performance due to a loss of relationships across various dimensions [4,5]. However, in areas flooded with high-dimensional data, such as fluid mechanics, electrodynamics and general relativity, etc., concise mathematical frameworks can be successfully constructed for formulating and solving problems by means of tensors.
A tensor refers to an algebraic object that describes a multi-linear correlation between sets of algebraic objects. Mathematically, any tensor of the type ( r , s ) is represented by a multi-dimensional array with definite numbers of upper indices r and lower indices s to indicate an r-times contravariance and an s-times covariance, respectively. This term has a clear geometrical significance and can be formally defined based on the following general transformation formulas [6,7,8]:
T = T j 1 i r i 1 i r g i 1 g i r g j 1 g j s = T j 1 j s i 1 i r g i 1 g i r g j 1 g j s
where g i ( g j ) and g i ( g j ) represent covariant and contravariant base vectors in coordinate systems x i and x i , respectively. In order to make this notoriously impenetrable concept easy to be understood, Ballard et al. presented the Chicago crime data (available at www.cityofchicago.org) as an example to be represented by a tensor, whose modes correspond to 365 days, 24 h, 77 communities, and 11 crime types. Entry χ ( i , j , k , l ) is the number of times that crime l happened in the neighborhood k during hour j on day i, as shown in Figure 1 [9].
Tensor-based approaches have attracted more and more attention due to their capacity of exploiting multi-linear relationships in multi-way data, among which tensor decomposition (TD) is the art of disassembling multi-dimensional arrays into smaller parts and finds ubiquitous applications in machine learning, neuroscience, quantum computing, signal processing, etc. [9]. Recently, TD has been employed to compress deep convolutional neural networks (DCNN), i.e., reducing the number of parameters and the time of training a model from scratch, with the purpose of accelerating DCNN’s [10,11].

1.1. A Brief Introduction to Tucker Decomposition

In 1963, Tucker [12] proposed the well-known Tucker decomposition, which represents any d-dimensional tensor χ R n 1 × × n d as a contraction between a d-dimensional core tensor G R r 1 × × r d and d projection matrices U ( i ) R n i × r i ( i = 1 , 2 , , d ) :
χ G × U 1 , , U d = G × 1 U ( 1 ) × 2 U ( 2 ) × × d U ( d ) = G × { U ( n ) } .
Projection matrices are typically orthogonal and serve as the principal component of the corresponding mode. Figure 2 presents a comparison between the principles of the CP decomposition and that of the Tucker decomposition, revealing the correlation between them.
For a hyperspectral image, each projection matrix represents a different feature. For example, U ( 1 ) and U ( 2 ) correspond to spatial features, and U ( 3 ) represents the spectral feature. The corresponding third-order tensor χ R I × J × K can be decomposed as follows:
χ Λ × 1 A × 2 B × 3 C ,
where Λ R R × R × R is the core tensor, A R I × R , B R R × J , C R K × R are orthogonal projection matrices. The mode-k product ( k = 1 , 2 , 3 ) of Λ by A , B and C is calculated by
( Λ × 1 A ) ( i 1 , j 2 , j 3 ) = j 1 = 1 d 1 Λ ( j 1 , j 2 , j 3 ) A ( j 1 , i 1 ) ( Λ × 2 B ) ( j 1 , i 2 , j 3 ) = j 2 = 1 d 2 Λ ( j 1 , j 2 , j 3 ) B ( j 2 , i 2 ) ( Λ × 3 C ) ( j 1 , j 2 , i 3 ) = j 3 = 1 d 3 Λ ( j 1 , j 2 , j 3 ) C ( j 3 , i 3 )
The decomposition may also be described more directly as
χ ( i 1 , i 2 , i 3 ) = j 1 = 1 d 1 j 2 = 1 d 2 j 3 = 1 d 3 Λ ( j 1 , j 2 , j 3 ) A ( j 1 , i 1 ) B ( j 2 , i 2 ) C ( j 3 , i 3 ) .
Here, d i ( i = 1 , 2 , 3 ) are the number of columns in the projection matrices A , B and C , respectively. If d i s are smaller than I, J, K, the core tensor Λ can be considered a compressed version of χ .

1.2. Challenges Incurred by the Streaming Data

With the advent of streaming data, i.e., data flowing continuously from a source to a sink, the problem becomes extremely challenging because data are usually generated simultaneously at high speed by different sources such as IoT sensors. What is worse, data in real-world applications may continuously arrive at every mode, i.e., data evolve over time at all modes [13]. Take a movie recommendation system as an example, a user–movie–date tensor is constructed, with an element ( i , j , k ) denoting the rating given by user i to movie j on day k. It is evident that each mode of this tensor is subject to a neverending temporal evolution. Figure 3, cited from the website of the world famous data service vendor Qlik, demonstrates its technical complicatedness and arduousness.
Due to the wide variety of sources, as well as the scale and velocity by which the data are generated, traditional data pipelines cannot keep up for near-real-time or real-time processing because they have to extract, transform, and load data before they can be manipulated. However, in real-world applications (e.g., movie recommendation), data continuously arrive at every mode, i.e., data evolve over time at all modes [13]. Since the existing online TD methods cannot cope with this problem because of high computation and storage costs [14], a desired online tensor decomposition should be able to dynamically update the decomposition of real-time large-scale tensors while preserving its low-rank structure. In this sense, the Tucker decomposition is desirable because of its unique property, i.e., it does not need to decompose at all axes (modes) [15]. Xiao H P et al. [13] proposed an efficient online Tucker decomposition (eOTD) approach to track the tensor decomposition for dynamic large-scale tensors on the fly. In other words, for a given K-th-order tensor χ ( t ) R N 1 ( t ) × × N K ( t ) at time t, it evolves at all modes, namely,
N k ( t + 1 ) N k t , k K
to generate a tensor stream χ ( t + m ) R N 1 ( t + m ) × × N K ( t + m ) : m M . The Tucker decomposition is defined by
χ ( t ) = G ( t ) × U k ( t )
where G ( t ) and U k ( t ) denote the core tensor and projection matrices, respectively, at timestamp t. The snapshot tensor at timestamp t + 1 is χ ( t + 1 ) = χ i 1 i K ( t + 1 ) i k [ 2 ] R N 1 ( t + 1 ) × × N K ( t + 1 ) , where χ 1 1 = χ ( t ) (i.e., χ ( t ) is a sub-tensor of χ ( t + 1 ) , denoted as χ ( t ) χ ( t + 1 ) ) , and χ i 1 i K ( t + 1 ) ( i 1 , , i K ) ( 1 , , 1 ) denotes new coming sub-tensors. χ ( t + 1 ) can be decomposed by eOTD. More specifically, eOTD is able to obtain projection matrices U k ( t + 1 ) by updating the U k ( t ) using G ( t ) and ( χ i 1 i K ( t + 1 ) ) ( i 1 , , i K ) ( 1 , , 1 ) , together with some auxiliary matrices obtained at timestamp t + 1 . In a similar vein, the core tensor is updated by a sum of tensors that are calculated by multiplying smaller tensors with matrices.
However, most research on tensor decomposition has been confined to standalone (i.e., single-machine) computational environments, which are not powerful enough for handling large-scale data, e.g., making solar energy affordable requires petaFLOPS of computing resources. This challenge has boosted the involvement of parallel computation [16]. Unlike serial computing, parallel architecture breaks a complex or formidable task into a set of sub-tasks, which are allocated to different computing equipment with the purpose that multiple operations are carried out simultaneously. Although parallel computing has taken a huge leap forward, its engagement with the tensor decomposition is not mature due to several factors, including the nonuniform task assignment and the synchronization among different computing nodes.
In this paper, we take eOTD as the foundation and a starting point and propose a two-level parallel incremental tensor Tucker decomposition method with multi-mode growth (TPITTD-MG) to reduce computational complexity and improve computing efficiency by advocating parallelization for the incremental tensor Tucker decomposition method, particularly for decomposition of tensors evolving over time at multiple modes. Specifically, the main contributions can be summarized as follows:
(1)
Utilizing dynamic programming to achieve efficient and uniform partitioning of sub-tensors, with a tensor partitioning algorithm applied to assign sub-tensors across different task nodes, which is crucial for scaling to tens of millions of data elements. This approach not only accelerates execution by exploiting parallel computation for counting non-zero elements but also ensures more uniform task assignment.
(2)
A two-level parallel update approach for projection matrices and core tensor is designed. The first level conducts updates via a parallel MTTKRP strategy, while the second level independently updates projection matrices based on categorized sub-tensors. This structured update mechanism substantially improves update efficiency.
(3)
The experimental results demonstrate that our method outperforms existing algorithms in terms of a nearly 400% improvement in execution efficiency and a 20% enhancement in uniformity of partition at a parallelism level of 4 for large-scale datasets. Specifically, for third-order tensors, our approach shows a nearly 300% efficiency improvement compared with traditional single-layer update algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the related works, and description of important concepts involved in this paper is the subject of Section 3. In Section 4, we describe the proposed TPITTD-MG framework in detail, whose performance is evaluated. Finally, we conclude the paper in Section 5.

2. Related Works

    The Tucker decomposition can be considered a generalization of the matrix singular value decomposition (SVD), which has led to the development of the higher-order singular value decomposition (HOSVD) [17], a popular approach capable of yielding an orthogonal core tensor [18]. Thenceforth, numerous methods have emerged in the literature to compute the representation of a tensor. Since Tucker decomposition is essentially a best rank- ( R 1 , R 2 , , R N ) approximation of tensor χ [19],
minimize G ; U ( 1 ) , , U ( N ) χ G ; U ( 1 ) , , U ( N ) F 2 ,
the research in this field focuses on such topics as computation and storage complexity, parallel computing, optimization methods, as well as how to exploit latent features of tensors including sparsity, orthogonality, etc.
Kapteyn and Neudecker [20] extended the capacity of the Tucker decomposition from dealing with third-order tensors to higher-order ones by employing orthogonal projection matrices. De Lathauwer et al. [14] proposed the high-order orthogonal iteration (HOOI) method, which is proven to be more efficient in computing projection matrices, compared with HOSVD. Furthermore, Elden and Savas [21] introduced the Newton–Grassmann algorithm for the Tucker decomposition of third-order tensors, offering a more efficient and less iterative approach by constraining the projection matrix to the Grassmann manifold space. Fang S et al. [22] developed a Bayesian streaming sparse Tucker decomposition method (BASS-Tucker) capable of preventing overfitting and improving the interpretability by automatically selecting meaningful projection interactions.
Thanh et al. [23] provided a contemporary and comprehensive survey on streaming tensor decomposition, with streaming Tucker decomposition algorithms broadly classified into three main classifications, i.e., online tensor dictionary learning, tensor subspace tracking, and multi-aspect streaming Tucker decomposition. Among them, the first two classifications are dedicated for two specific scenarios of single-aspect streaming Tucker decompositions; the last classification is for multi-aspect streaming tensors.
On the other hand, some researchers push the enhancement of Tucker decomposition in direct, randomized, and iterative ways. Vannieuwenhoven et al. [24] presented an alternative strategy, called the sequentially truncated HOSVD (ST-HOSVD), to truncate the higher-order singular value decomposition (T-HOSVD) proposed by De Lathauwer et al, with the purpose to reduce computational complexity and improve the approximation error. Kressner et al. [25] attained a fast algorithm by combining fast matrix-vector products that exploit the structure of Hadamard products with iterative methods, such as the Lanczos method and randomized algorithms, etc. Che et al. [26] took randomized algorithms as powerful tools for scientific computing because such algorithms are usually faster and more robust compared with standard deterministic algorithms. They designed an adaptive randomized algorithm to compute a low multilinear rank approximate of tensor decomposition for tensors with unknown multilinear rank and analyzed its probabilistic error bound under certain assumptions. Chachlakis et al. [27] designed a L 1 -Tucker, i.e., a reformulation of the standard Tucker decomposition by substitution of the outlier-responsive L 2 -norm by the sturdier L 1 -norm, followed by proposal of the L 1 -norm higher order orthogonal iterations (L1-HOOI) algorithm for the approximate solution to L 1 -Tucker.
As for the parallel computing, the research focuses on sparse tensors, parallel platforms with multicore nodes, data dependencies, and combination of randomization and parallelization for efficient communication schemes, etc [16,28]. Sejoon Oh et al. [29] developed the parallel Tucker decomposition algorithm P-TUCKER for sparse tensors, emphasizing memory savings and enhanced computational efficiency with scalability. Ballard G and Zhang J et al. [30] explored parallel computing frameworks for tensor Tucker decomposition. The advent of GigaTensor [31] marked a significant advance in tensor decomposition algorithms for large-scale tensors, followed by Park N et al. [32], who integrated these developments into the Bigtensor algorithm library. Acer S et al. [33] focused on tensor partitioning algorithms to further enhance decomposition efficiency. Minster et al. [16] proposed two randomized algorithms based on HOSVD and STHOSVD, respectively, and offered a new parallel implementation for the structured randomized sketch. Their key idea is to perform randomized sketches with Kronecker-structured random matrices in order to reduce computational complexity. Meanwhile, a probabilistic error analysis of the proposed algorithms was provided. As for distributed and parallel algorithms for Tucker decomposition, Yao et al. [34] and Ning et al. [35] proposed implementations based on Apache Spark, which is especially effective in processing high-dimensional tensors with excellent speedups and high robustness.
The aforementioned tensor decomposition algorithms are executed in parallel, which can significantly improve computational efficiency and save memory space. However, these algorithms are all full-scale parallel computation algorithms for tensor decomposition. When streaming data are dealt with, the tensor must be continuously restored and decomposed, resulting in a large amount of redundant computation, which deteriorates computational efficiency. Researchers attempt to address this issue in different ways.
Some researchers focus on incremental tensor decomposition. Nion et al. [36] proposed an adaptive CP decomposition algorithm for third-order tensors. Phan et al. [37] studied incremental tensor decomposition using the idea of block computation. Sarwar et al. [38] investigated the processing of dynamically growing data streams. However, these algorithms can only cope with tensors evolving at a single mode, whereas many real-world applications require tensor decomposition methods for tensors evolving at multiple modes, i.e., multi-aspect streaming tensors. Song et al. [39] proposed the multi-aspect streaming tensor (MAST) algorithm to achieve incremental tensor CP decomposition. Yang et al. [40] proposed the DisMASTD method to realize incremental tensor CP decomposition in a parallel manner. As for Tucker decomposition of the multi-aspect streaming tensors, Xiao et al. [13] proposed the eOTD algorithm, which is designed for running in a standalone mode.
Other researchers adopt the online mechanism. Zhou et al. [41] proposed the online CP algorithm for incremental tensor decomposition. Yu et al. [42] studied the incremental tensor Tucker decomposition and proposed an online low-rank tensor learning algorithm. In 2023, Rontogiannis et al. [43] proposed the online block-term decomposition reweighted least squares (O-BTD-RLS), which employs a sliding (truncated) windowing method with window duration being chosen according to the dynamics of the system under study or be adaptive.
In summary, although incremental tensor decomposition algorithms for processing the streaming data have currently become a research hotspot, the research is rather limited in the literature, with those about tensors of multi-mode growth in parallel particularly insufficient. Therefore, this article focuses on a parallel Tucker decomposition for multi-mode growth incremental tensors as the primary research object.

3. Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG)

    In order to make descriptions clearly, we denote tensors, matrices, vectors and scalars with calligraphic letters (e.g., χ ), uppercase bold letters (e.g., U ), lowercase bold letters (e.g., x ) and lowercase normal font (e.g., a), respectively. Important notations involved in this paper are listed in Table 1.

3.1. Technical Details of TPITTD-MG

   This subsection deals with the task of incremental Tucker decomposition method for tensors evolving over time at all modes and starts with a brief introduction to an off-the-shelf incremental Tucker decomposition method, i.e., efficient online Tucker decomposition (eOTD) proposed by Xiao et al. [13], followed by our research on performance improvement in terms of computational efficiency by means of parallel computing on the basis of eOTD. When the parallel incremental Tucker decomposition is implemented, it is a prerequisite to allocate the partitioned sub-tensors across different workhorses. However, current tensor partitioning algorithms often encounter such problems as low execution efficiency and nonuniform partition, etc. To address these issues, we propose a parallel sub-tensor allocation method based on dynamic programming, which significantly improves computational efficiency and achieves more uniform sub-tensor allocation, facilitating an optimal balance of computation load among different workhorses. Furthermore, to address the issue of low parallelism in updating projection matrices and the core tensor in current incremental Tucker decomposition methods, a two-level parallel update method is proposed.

3.1.1. A Brief Introduction to the Efficient Online Tucker Decomposition (eOTD) Approach

   Given a tensor χ ( t ) R N 1 ( t ) × × N K ( t ) and its Tucker decomposition snapshot, χ ( t ) = G ( t ) × { U k ( t ) } at timestamp t. When the tensor has evolved to χ ( t + 1 ) R N 1 ( t + 1 ) × × N K ( t + 1 ) at timestamp t + 1 , the eOTD approach starts with a partition of χ ( t + 1 ) into 2 K sub-tensors ( x i 1 i K ( t + 1 ) ) i k [ 2 ] such that χ 1 1 ( t + 1 ) = χ ( t ) , which will be employed to efficiently implement the Tucker decomposition χ ( t + 1 ) = G ( t + 1 ) × { U k ( t + 1 ) } by updating the core tensor G ( t ) and the projection matrices (also called projection matrices) U k ( t ) .
The 2 K sub-tensors obtained are classified into K categories according to their geometric positions relative to the χ 1 1 ( t + 1 ) , denoted as C m m = 1 K . Take a third-order tensor χ ( t ) R N 1 ( t ) × N 2 ( t ) × N 3 ( t ) as an example, it evolves to χ ( t + 1 ) at timestamp t + 1 , which is split into 2 3 sub-tensors χ ( i 1 i 2 i 3 ) ( t + 1 ) ( i 1 , i 2 , i 3 [ 2 ] ) , as shown in Figure 4. Among the 8 sub-tensors, χ 111 ( t + 1 ) refers to χ ( t ) , and the other 7 sub-tensors, i.e., ( χ i 1 i 2 i 3 ( t + 1 ) ) ( i 1 i 2 i 3 ) ( 111 ) , correspond to the new coming data, i.e., incremental part of χ ( t + 1 ) during time range [ t , t + 1 ] , which can be categorized into three groups:
C 1 = { χ 211 ( t + 1 ) , χ 121 ( t + 1 ) , χ 112 ( t + 1 ) } C 2 = { χ 221 ( t + 1 ) , χ 212 ( t + 1 ) , χ 122 ( t + 1 ) } C 3 = { χ 222 ( t + 1 ) }
according to their geometric position relative to χ 111 ( t + 1 ) : Elements in C 1 can be considered the rear, right and downstairs neighbors of χ 111 ( t + 1 ) , respectively. Namely, each of them shares a face with χ 111 ( t + 1 ) . Similarly, elements in C 2 and the only element in C 3 share an edge or a vertex with χ 111 ( t + 1 ) . This geometric understanding will be generalized in higher-order scenarios, which is more useful for tensor analysis, despite worse intuitiveness.
From the perspective of contribution to one or more auxiliary matrices, the three groups defined by Equation (26) can also be understood according to the number of sub-indices that equal to 2, indicating the number of auxiliary matrices that can be updated. For example, χ 112 ( t + 1 ) can be used for updating U 3 , χ 122 ( t + 1 ) can be used for updating both U 2 and U 3 , while χ 222 ( t + 1 ) corresponds to U 1 , U 2 and U 3 .

3.1.2. Update Methods for Projection Matrices

Every sub-tensor χ i 1 i K ( t + 1 ) C m will be used m times for the update of auxiliary matrix U . According to the Corollary II (i.e., Block Tensor and Matrix Multiplication II defined in Section 3), χ i 1 i K ( t + 1 ) = G ( t ) × U k , where
U k = U k ( t ) , if i k = 1 U , if i k = 2
Let all the sub-indices to be updated form a set S = i k 1 , , i k m . For each sub-index i k j S , the update rules for auxiliary matrices U are defined as
U i k j new α U i k j old + 1 α χ i 1 i K ( t + 1 ) i k j G i k j i k j ,
where G i k j G i 1 , i k j , , i K = G ( t ) × i k j U k , and † denotes the pseudo inverse of a matrix.
Subsequently, the U k is augmented to U k ( t ) = U k ( t ) T U k T R N k ( t ) × I k , which are orthogonalized and normalized by means of the modified Gram–Schmidt (MGS) to calculate U k ( t + 1 ) at timestamp t + 1 by updating U k ( t ) :
U k ( t + 1 ) = U k , 1 ( t + 1 ) T U k , 2 ( t + 1 ) .
Namely, U k ( t + 1 ) is also the concatenation of two matrices: U k , 1 ( t + 1 ) R N ( t ) × I k and U k , 2 ( t + 1 ) R N ( t + 1 ) N ( t ) × I k .
Specifically, as for the scenarios of third-order tensors, each of the 2 3 split sub-tensors χ ( i 1 i 2 i 3 ) ( t + 1 ) i 1 , i 2 , i 3 [ 2 ] of χ ( t + 1 ) can be expressed as χ i 1 i 2 i 3 ( t + 1 ) = G ( t ) × U k according to the Corollary II, where U k has the same definition as Equation (27). Let
G i 1 , i 2 , i 3 = G ( t ) × U k ,
in case a sub-index is missing, a set of auxiliary tensors can be defined by
G · , 1 , 1 = G ( t ) × 2 U 2 ( t ) × 3 U 3 ( t ) , G 1 , · , 1 = G ( t ) × 1 U 1 ( t ) × 3 U 3 ( t ) , G 1 , 1 , · = G ( t ) × 1 U 1 ( t ) × 2 U 2 ( t ) ,
G · , 1 , 2 = G ( t ) × 2 U 2 ( t ) × 3 U 3 ( t ) , G 1 , · , 2 = G ( t ) × 1 U 1 ( t ) × 3 U 3 ( t ) , G 1 , 2 , · = G ( t ) × 1 U 1 ( t ) × 2 U 2 ( t ) ,
G · , 2 , 1 = G ( t ) × 2 U 2 ( t ) × 3 U 3 ( t ) , G 2 , · , 1 = G ( t ) × 1 U 1 ( t ) × 3 U 3 ( t ) , G 2 , 1 , · = G ( t ) × 1 U 1 ( t ) × 2 U 2 ( t ) ,
G · , 2 , 2 = G ( t ) × 2 U 2 ( t ) × 3 U 3 ( t ) , G 2 , · , 2 = G ( t ) × 1 U 1 ( t ) × 3 U 3 ( t ) , G 2 , 2 , · = G ( t ) × 1 U 1 ( t ) × 2 U 3 ( t ) ,
i.e., they are multiplications of G ( t ) with the remaining projection matrices. Equation (31) to Equation (34) lay the theoretic underpinning for defining update rules of auxiliary matrices using every sub-tensor χ ( i 1 i 2 i 3 ) ( t + 1 ) ( i 1 , i 2 , i 3 [ 2 ] ) , category by category, in C m m = 1 3 .
A. Update Projection Matrices U k With Sub-tensors In C 1 classification
As described previously, each sub-tensor in C 1 , i.e., χ 211 ( t + 1 ) , χ 121 ( t + 1 ) and χ 112 ( t + 1 ) , is used only once for updating the corresponding auxiliary matrix U 1 , U 2 and U 3 , respectively.
U 1 χ 211 ( t + 1 ) ( 1 ) G · , 1 , 1 ( 1 ) ,
U 2 χ 121 ( t + 1 ) ( 2 ) G 1 , · , 1 ( 2 ) ,
U 3 χ 112 ( t + 1 ) ( 3 ) G 1 , 1 , · ( 3 ) ,
G · , 1 , 1 , G 1 , · , 1 and G 1 , 1 , · are defined by Equation (31).
When the update of projection matrices U has been implemented with sub-tensors in C 1 , they should be further refined by successive updates with sub-tensors in C 2 and C 3 classifications, because sub-tensors in different classifications have different contributions to one or more auxiliary matrices.
B. Update Projection Matrices U k With Sub-tensors In C 2 Classification
As defined by Equation (26), each sub-tensor in the C 2 = χ 221 ( t + 1 ) , χ 212 ( t + 1 ) , χ 122 ( t + 1 ) has two sub-indices equal to 2, implying it will be used twice to update two corresponding auxiliary matrices. As for χ 221 ( t + 1 ) , it is employed to update both U 1 and U 2 . The update of U 1 requires the auxiliary matrix U 2 calculated based on sub-tensors in C 1 classification, i.e., G · , 2 , 1 = G ( t ) × 2 U 2 ( t ) × 3 U 3 ( t ) , which can be expressed by
U 1 ( 211 ) α U 1 + ( 1 α ) χ 211 ( t + 1 ) ( 1 ) G · , 2 , 1 ( 1 ) ,
U 2 ( 211 ) α U 2 + ( 1 α ) χ 211 ( t + 1 ) ( 2 ) G 2 , · , 1 ( 2 ) .
Here, α is a forgetting factor, indicating how much information should be inherited from the previous step. The same principle goes for χ 212 ( t + 1 ) and χ 122 ( t + 1 ) . U k ( i 1 i 2 i 3 ) means the updated auxiliary matrix U k using sub-tensor χ i 1 i 2 i 3 ( t + 1 ) .
U 1 ( 212 ) α U 1 + ( 1 α ) χ 212 ( t + 1 ) ( 1 ) G · , 1 , 2 ( 1 ) ,
U 3 ( 212 ) α U 3 + ( 1 α ) χ 212 ( t + 1 ) ( 3 ) G 2 , 1 , · ( 3 ) ,
U 2 ( 122 ) α U 2 + ( 1 α ) χ 122 ( t + 1 ) ( 2 ) G 1 , · , 2 ( 2 ) ,
U 3 ( 122 ) α U 3 + ( 1 α ) χ 122 ( t + 1 ) ( 3 ) G 1 , 2 , · ( 3 ) .
C. Update Projection Matrices U k With Sub-tensors In C 3 Classification
Since C 3 classification contains only one sub-tensor χ 222 ( t + 1 ) = G ( t ) × U k , with three sub-indices equal to 2, χ 222 ( t + 1 ) associates with all auxiliary matrices:
U 1 ( 222 ) α U 1 + ( 1 α ) χ 222 ( t + 1 ) ( 1 ) G · , 2 , 2 ( 1 ) ,
U 2 ( 222 ) α U 2 + ( 1 α ) χ 222 ( t + 1 ) ( 2 ) G 2 , · , 2 ( 2 ) ,
U 3 ( 222 ) α U 3 + ( 1 α ) χ 222 ( t + 1 ) ( 3 ) G 2 , 2 , · ( 3 ) .
D. Update Projection Matrices U k ( t + 1 )
Given projection matrices U k ( t ) and the auxiliary matrices U k obtained at time t, the projection matrices U k ( t + 1 ) k [ K ] at timestamp t + 1 are updated according to the rule:
V k ( t + 1 ) T = U k ( t ) T U k T R N k ( t + 1 ) × I k ( k K ) ,
which involves a concatenation of U k ( t ) and U k at the k-th mode. Since Equation (47) cannot guarantee the generated V k ( t + 1 ) will be unitary, the modified Gram–Schmidt (MGS) algorithm is performed on V k ( t + 1 ) to produce orthonormal projection U k ( t + 1 ) .

3.1.3. Update Method for Core Tensor

   Based on the projection matrices U k ( t + 1 ) R N ( t + 1 ) × I k , the core tensor G ( t ) at timestamp t, as well as the new coming data χ i 1 i k ( t + 1 ) ( i 1 , , i K ) ( 1 , , 1 ) at timestamp t + 1 , since U k ( t + 1 ) are unitary matrices, the core tensor G ( t + 1 ) at timestamp t + 1 can be updated by
G ( t + 1 ) = χ ( t + 1 ) × U k ( t + 1 ) T = G ( t ) × U k , 1 ( t + 1 ) T U k ( t ) + ( i 1 , , i K ) ( 1 , , 1 ) χ i 1 , , i K ( t + 1 ) × U k , i k ( t + 1 ) T ,
according to the Second Corollary. The update procedure incorporates splitting of U k ( t + 1 ) into two matrices: U k , 1 ( t + 1 ) R N ( t ) × I k and U k , 2 ( t + 1 ) R N ( t + 1 ) N ( t ) × I k .
In summary, eOTD is efficient in dealing with the tensor stream χ ( t + m ) R N 1 ( t + m ) × × N K ( t + m ) : m M defined in Section 1.2 and can be applied to large-scale applications because it merely involves tensor matrix multiplication and matrix pseudo inverse operation, which are cheap compared with the computationally expensive SVD operations. Meanwhile, eOTD is established on a solid theoretical foundation, i.e., two corollaries proposed by Xiao et al. [13], which makes it a powerful tool to track Tucker decomposition of dynamic tensors χ ( t + m ) with an arbitrary number of modes.

3.2. Parallel Sub-Tensor Partitioning Algorithm

   It can be seen in Section 3.1 that it is an indispensable task to split a tensor into sub-tensors, which act as the fundamental elements for updating the core tensor and the projection matrices. Despite the relatively satisfactory efficiency provided by the eOTD approach, its standalone operation paradigm becomes an obstacle on the path to successful applications in more large-scale scenarios, motivating the involvement of parallel decomposition methods. Generally, for parallel computing, the strategy of allocating the obtained sub-tensors uniformly among different computing resources is a major concern. For ease of representation, we iconically refer to each computing resource in a distributed environment as a workhorse.
In this subsection, we start from the sub-tensors obtained by eOTD and exploit their latent structure to enhance computing efficiency of parallel decomposition and reduce the bottleneck cost incurred by fundamental tensor-related operations, e.g., matricized tensor times Khatri–Rao Product (MTTKRP [44]) in the Tucker decomposition. Since only non-zero elements of tensors contribute to the MTTKRP cost, an appropriate partition criteria for sub-tensors should be that each partitioned group has an equal number of non-zero tensor elements.
For example, given a sub-tensor χ s u b R 3 × 5 × 2 , We attempt to partition χ at the mode-2, i.e., the second dimension of 3 × 5 × 2 , producing five slices in total, as shown by Figure 5. Non-zero elements in each slice are { 4 , 2 , 3 , 4 , 8 , 9 } , { 5 , 6 , 6 } , { 71 , 12 , 3 , 43 } , { 15 , 3 , 8 } , { 7 , 1 } , respectively. Suppose the task in this scenario is to allocate the five slices among three workhorses (i.e., Workhorse I, Workhorse II and Workhorse III), the optimal strategy can be summarized as follows:
  • To allocate the first slice to Workhorse I, the corresponding six non-zero elements form Group I;
  • To allocate the second and fourth slices to Workhorse II, the corresponding six non-zero elements form Group II; and
  • To allocate the third and fifth slices to Workhorse III, the corresponding six non-zero elements form Group III.
As a result, all the non-zero elements are uniformly partitioned into three groups and each workhorse is allocated six non-zero elements, i.e., an extremely uniform allocation is realized. However, such an ideal situation as has occurred in this simple scenario may not be always true in practice, because splitting a sub-tensor into slices at a mode is analogous to the task of uniformly splitting a set of positive integers, a well-known NP-hard problem due to its complexity and the impracticality of finding an optimal solution [45]. In order to address this issue, Yang et al. [40] introduced a heuristic algorithm, DisMASTD, which counts the number of non-zero elements (NNZ) at each slice at mode-n of the sub-tensor and sums all these counts to yield a total number, T o t a l N N Z . Let p n represent the desired number of partitions at mode-n; an optimal number of non-zero elements in each partition, named T a r g e t , is calculated by T o t a l N N Z / p n . The workflow of DisMASTD is described as follows, which can also be described in pseudo code, as shown by Algorithm 1.
  • Initially, DisMASTD traverses I n slices at the mode-n of a sub-tensor χ sub R I 1 × I 2 × × I N (i.e., χ sub has N modes in total, and mode-n has I n slices, 1 n N ), counting non-zero elements at each slice, denoted as r e s i ( n ) i = 1 I n .
  • DisMASTD traverses mode-n of χ s u b again, slice-by-slice, greedily assigning the traversed slices to the current partition P ( n u m ) ( n ) ( n u m = 1 , , p n ) , until its total number of non-zero elements (denoted as s u m ) reaches T a r g e t .
  • When the i-th slice is being traversed and the s u m of the first partition reaches T a r g e t , DisMASTD will compare the deviation from T a r g e t incurred by two different strategies:
    i. Assigning slice i to the current partition incurs deviation T a r g e t s u m , and
    ii. Not assigning slice i to the current partition incurs deviation T a r g e t ( s u m a i ( n ) . N N Z ) .
    The strategy with smaller deviation will be accepted, i.e., if T a r g e t s u m < T a r g e t ( s u m a i ( n ) . N N Z ) , slice i is assigned to the current partition, i.e., the n u m -th partition P n u m ( n ) . Otherwise, go to step (4).
  • Generate the next partition P ( n u m + 1 ) ( n ) for mode-n, and assign the slice i to it.
  • DisMASTD will go ahead to traverse the ( i + 1 ) -th slice of mode-n and decide whether this slice should be assigned to the partition P ( n u m + 1 ) ( n ) . The loop will iterate until the s u m of partition P p n 1 ( n ) reaches T a r g e t .
  • The residual slices at mode-n are assigned to the last partition P p n ( n ) .
  • In case partitions at all modes are completed, DisMASTD outputs the partition result { P p ( n ) } p = 1 p n , n = 1 , 2 , , N .
Algorithm 1 DisMASTD—Heuristic Tensor Partitioning Algorithm
Input: (1) χ sub R I 1 × I 2 × × I N , Sub-tensor to be partitioned;
  1:
          (2) p n i = 1 N , number of partitions at mode-n;
Output:   { P p ( n ) } p = 1 p n , n = 1 , 2 , , N , mode-wise allocation of sub-tensor χ sub .
  2:
for  1 n N  do
  3:
     T a r g e t = T o t a l N N Z / p n //optimal number of non-zero elements in each partition
  4:
     r e s i ( n ) i = 1 I n //count number of non-zero elements at each slice at mode n of χ sub
  5:
     P p ( n ) n = 1 N //initialize the partition results container
  6:
     P //initialize a temporary container for intermediate partition results
  7:
     n u m 1 , s u m 0 //initialize other auxiliary variables.
  8:
                                 // s u m means the total number of non-zero elements allocated to
  9:
                                 //the current partition, n u m denotes the current partition,
10:
                                 //i.e., n u m -th partition.
11:
    for  1 i I n  do
12:
         s u m + = a i ( n ) . n n z
13:
        if  s u m < T a r g e t  then.
14:
           assign slice i to P
15:
        else
16:
           if  T a r g e t s u m < T a r g e t ( s u m a i ( n ) . N N Z )  then
17:
               assign slice i to P
18:
           end if
19:
           if  n u m < p n  then
20:
                P n u m ( n ) = P ;    //generate a new partition P n u m ( n ) to store the content in P
21:
                P , n u m + = 1 , s u m 0
22:
           else
23:
               Assign the remaining slices to P p n ( n ) , break
24:
           end if
25:
        end if
26:
    end for
27:
 end for
28:
 Output partitioned results { P p ( n ) } p = 1 p n , n = 1 , 2 , , N .
Nevertheless, DisMASTD exhibits a low execution efficiency because the partition of sub-tensors requires traversing all slices of a sub-tensor at each mode multiple times, which executes in a serial manner and incurs unacceptable time consumption. Table 2 enumerates the algorithm’s overall time consumption over sub-tensors of various ranks, as well as the specific time spent in counting non-zero elements. For smaller size tensors, counting non-zero elements accounts for over 50% of the total partitioning time. While for larger size ones, the ratio may reach 95% or even higher. In case non-zero elements are nonuniformly distributed, the performance of DisMASTD will go from bad to worse.
To address these issues, we propose a parallel sub-tensor partitioning mechanism based on the dynamic programming (PSTPA-DP), which attempts to realize parallel implementation from both the level of a sub-tensor and the level of a sub-tensor classification:
(1)
From the level of a sub-tensor
If all slices of χ s u b R I 1 × I 2 × × I N at mode-n are divided into p n parts, numbers of non-zero elements of all p n parts can be counted in a parallel manner. When all p n counting tasks are completed, the results are aggregated to obtain the output r e s i ( n ) i = 1 I n , i.e., the number of non-zero elements in each slice of χ s u b .
(2)
From the level of a sub-tensor classification
Based on the update rules for projection matrices defined in Section 3.1.2, a tensor with K modes is partitioned into 2 K sub-tensors, which are further divided into K classifications C m m = 1 K . Because the update processes of auxiliary matrices, which involve R ( m ) sub-tensors χ s u b , i ( m ) 1 i R ( m ) in classification C m , are independent of each other, it is desirable to partition R ( m ) sub-tensors in C m (i.e., χ s u b , i ( m ) i = 1 R ( m ) ) in a parallel manner. For example, there are three sub-tensors in C 2 = χ 221 ( t + 1 ) , χ 212 ( t + 1 ) , χ 122 ( t + 1 ) ; we wish to partition χ 221 ( t + 1 ) , χ 212 ( t + 1 ) , χ 122 ( t + 1 ) simultaneously.
PSTPA-DP introduces the dynamic programming mechanism to address the problem of nonuniform distribution of non-zero elements, it first clarifies the task of partition by determining the modes required to be expanded when updating the auxiliary matrices U using sub-tensors χ s u b ( m ) R I 1 × I 2 × × I N in different classifications C m m [ K ] . Subsequently, the numbers of non-zero elements, denoted as r e s i ( n ) at each slice at selected modes, are counted in a parallel manner. Finally, the average value A V G ¯ of the number of non-zero elements in the remaining slices (i.e., the slices that have not been partitioned) is iteratively refreshed based on the idea of dynamic programming. The algorithm can be described in greater detail by pseudo code in Algorithm 2.
Compared with DisMASTD, PSTPA-DP is advantageous in terms of time complexity. Given a sub-tensor χ s u b R I × J × K , the heuristic tensor partitioning algorithm has a time complexity of O ( I J K ) , which can be reduced to O ( I J K / p n ) by PSTPA-DP. The significant improvement in performance can be attributed to the division of the sub-tensor partitioning task into p n sub-tasks, as well as allocation of each sub-task to a different workhorse, scheduled by dynamic programming.
Algorithm 2 Parallel Sub-tensor Partitioning Algorithm Based on Dynamic Programming (PSTPA-DP)
Input: (1) χ s u b , i ( m ) R I 1 × I 2 × × I N 1 i R ( m ) : sub-tensors in C m ( m = 1 , , K ) classification;
  1:
(2) R ( m ) : number of sub-tensors in C m classification;
  2:
(3) p n n = 1 N : number of partitions at each mode of χ s u b .
Output:  P p ( n ) p = 1 p n ( n = 1 , 2 , , N ) : mode-wise partition results of each χ s u b in C m .
  3:
for  χ s u b , i ( m ) C m 1 i R ( m ) do   //partition R ( m ) sub-tensors χ s u b , i ( m ) C m in parallel
  4:
    divide all slices of χ s u b , i ( m ) at mode-n into p n parts;
  5:
    count non-zero elements at each slice at mode-n of χ s u b , i ( m ) in a parallel manner, and output r e s j ( n ) j = 1 I n ;
  6:
    sort r e s j ( n ) j = 1 I n in descending sequence;
  7:
    calculate the average number of non-zero elements over the p n parts at mode-n by A V G ¯ = r e s j ( n ) j = 1 I n / p n ;
  8:
     P p ( n ) p = 1 p n ( n = 1 , 2 , , N ) ;       //Initialization of P p ( n ) p = 1 p n
  9:
    for  n 1 : N  do    //mode-wise partition in a parallel manner
10:
        for  i 1 : I n  do    //slice-by-slice assignment at mode-i
11:
           if  r e s i ( n ) > A V G ¯  then
12:
                P p ( n ) r e s i ( n ) ;    //assign r e s i ( n ) to P p ( n ) ;
13:
                I n = I n 1 ;      //remove r e s i ( n )
14:
               refresh the average A V G ¯ = r e s i ( n ) i = 1 I n / p n ;
15:
           else
16:
                P p ( n ) r e s i ( n ) ;       //assign r e s i ( n ) to P p ( n ) and remove r e s i ( n ) ;
17:
                δ = a v g r e s i ( n ) ;
18:
               Traverse other members in r e s i ( n ) i = 1 I n :
19:
               if there exists a member M 1 r e s i ( n ) i = 1 I n equal to δ  then
20:
                    P p ( n ) M 1 ;    //assign M 1 to the current partition P p ( n ) ;
21:
                   exit the traverse;
22:
               else if a member M 2 satisfies a > δ > b a , b r e s i ( n ) i = 1 I n  then
23:
                   if  ( δ b ) > ( a δ )  then
24:
                        P p ( n ) a ;    //assign a to the current partition P p ( n )
25:
                       exit the traverse;
26:
                   else
27:
                        P p ( n ) b ;    assign b to the current partition P p ( n )
28:
                       find min ( δ a ) ;
29:
                        P p ( n ) a ;    //assign a to the current partition P p ( n ) ;
30:
                       exit the traverse;
31:
                   end if
32:
               end if
33:
           end if
34:
        end for
35:
    end for
36:
end for
37:
Output partitioned results P p ( n ) p = 1 p n , n = 1 , 2 , , N .

3.3. Parallel Computing Method for the Incremental Tensor Tucker Decomposition

   This subsection deals with the decomposition of the tensor at timestamp t + 1 , i.e., χ ( t + 1 ) , by the incremental tensor Tucker decomposition on the basis of χ ( t ) = G ( t ) × U k ( t ) in a parallel manner, provided the sub-tensors χ i 1 i k ( t + 1 ) ( i 1 , , i K ) ( 1 , , 1 ) have been partitioned by the proposed PSTPA-DP mechanism. We propose a two-level parallel update method for projection matrices and core tensors: the first level updates projection matrices or core tensors based on the parallel MTTKRP calculation strategy, and the second level updates different projection matrices or tensors independently using sub-tensors in different classifications in a parallel manner.

3.3.1. Two-Level Parallel Update Method for Projection Matrices

   In Section 3.1.2, the update Formulas (13)–(30) for projection matrices in a stand-alone setting have been provided, utilizing sub-tensors of different classifications C m m = 1 K to update the projection matrices. In this subsection, the update method is extended to a parallel manner. Take third-order tensors as an example. Suppose at time t, a third-order tensor χ ( t ) R I × J × K has a Tucker decomposition with the core tensor G R I × J × J , and projection matrices U 1 R I × P , U 2 R J × Q , U 3 R K × R . At time t + 1 , the projection matrices are updated using sub-tensor χ 211 ( t + 1 ) in C 1 classification by Equation (22), i.e., U 1 χ 211 ( t + 1 ) ( 1 ) G · , 1 , 1 ( 1 ) , where the core matrix G · , 1 , 1 R P × J × K is expanded at mode-1 to generate G · , 1 , 1 ( 1 ) R P × J K . The second dimension probably has a unimaginably big size, e.g., in case J = K = 10 6 , J × K may reach the magnitude of 10 12 , implying the update of projection matrices by a single computing equipment is computationally infeasible. Therefore, we consider implementing the update process in a parallel manner.
A. The First Level of Parallel Update—Parallel MTTKRP Calculation Strategy
According to Equation (34), the projection matrices for high-order tensors can be updated by
U ( k ) χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) .
Based on the properties of the pseudo-inverse, the right part of Equation (36) can be expressed as
χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) = χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) G i k ( k ) G i k ( k ) = χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) G i k ( k ) T G i k ( k ) = χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) T G i k ( k ) T G i k ( k ) = χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) T G i k ( k ) G i k ( k ) T
Considering G i k ( k ) = G ( k ) U ( k ) U ( k + 1 ) U ( k 1 ) U ( 1 ) T , we obtain
χ i 1 i k ( t + 1 ) ( k ) G i k ( k ) G i k ( k ) G i k ( k ) = χ i 1 i k ( t + 1 ) ( k ) U ( K ) U ( k + 1 ) U ( k 1 ) U ( 1 ) G ( k ) G k U K U k + 1 U k 1 U 1 T U K U k + 1 U k 1 U 1 G k T = χ i 1 i k ( t + 1 ) ( k ) U ( K ) U ( k + 1 ) U ( k 1 ) U ( 1 ) G ( k ) G ( k ) G ( k ) ) = χ i 1 i k ( t + 1 ) ( k ) U ( K ) U ( k + 1 ) U ( k 1 ) U ( 1 ) G ( k )
Consequently, Equation (36) can be expressed as
U k χ i 1 i k ( t + 1 ) ( k ) U ( K ) U ( k + 1 ) b m U ( k 1 ) U ( 1 ) G ( k )
For the third-order tensor scenario, Equation (39) can be degenerated to the following formula:
U 1 χ 211 ( t + 1 ) ( 1 ) U 3 ( t ) U 2 ( t ) G ( t ) ( 1 ) .
Here, G ( 1 ) ( t ) represents the expansion of G ( t ) at mode-1. The dimensionality of G ( t ) is significantly smaller than that of tensor χ ( t ) .
U 2 ( t ) and U 3 ( t ) represent the projection matrices at time t, which are transformed by Equation (40) in order to reduce the risk of explosive computational cost incurred by direct multiplications between the core tensor and the projection matrices. Additionally, the projection matrices are broadcast to each workhorse, facilitating parallel updates of the projection matrices.
In Section 3.2, we propose a mechanism to allocate sub-tensors uniformly among workhorses to realize parallel processing. Take Equation (40) as an example, the update operation for projection matrices can be divided into two independent tasks:
(1)
χ 211 ( t + 1 ) ( 1 ) U 3 ( t ) U 2 ( t ) , where χ 211 ( t + 1 ) ( 1 ) is a matricized tensor, and U 3 ( t ) U 2 ( t ) is the Khatri–Rao product of the projection matrices. This task is essentially the multiplication of a matricized tensor by the Khatri–Rao product of the projection matrices, known as matricized tensor times Khatri–Rao product (MTTKRP) [44]. Given U 2 ( t ) R J × Q and U 3 ( t ) R K × R , the Khatri–Rao product U 3 ( t ) U 2 ( t ) yields a matrix of size R J K × Q R .
Considering the situation that slices of the matricized tensor χ 211 ( t + 1 ) ( 1 ) have been allocated among different workhorses, the overall computation task χ 211 ( t + 1 ) ( 1 ) U 3 ( t ) U 2 ( t ) can be divided into several sub-tasks according to the allocation results of sub-tensors, which can compute in parallel. Let M = χ 211 ( t + 1 ) ( 1 ) U 3 ( t ) U 2 ( t ) , the row-wise computation can be described as follows:
M ( i , : ) = z = 0 J K χ 211 ( t + 1 ) ( 1 ) ( i , z ) U 3 ( t ) z j , : U 2 ( t ) z % j , :
= k = 0 K j = 0 J χ 211 ( t + 1 ) ( 1 ) ( i , j , k ) U 3 ( t ) ( k , : ) U 2 ( t ) ( j , : )
which is derived by partitioning J K slices of the matricized tensor χ 211 ( t + 1 ) ( 1 ) into two parts, with J and K components, respectively. It can be seen that the involved rows in U 2 ( t ) and U 3 ( t ) are determined by indices of χ 211 ( t + 1 ) ( 1 ) . Therefore, each workhorse can determine the indices j and k according to non-zero elements in slices { P i ( n ) } assigned to it and select the required rows in U 2 ( t ) and U 3 ( t ) for computation. At the same time, the selected rows are broadcast to all workhorses and participate in relevant multiplication operations.
(2)
The pseudo-inverse of G ( t ) , which can be calculated directly, since its dimensionality is much smaller than that of χ ( t ) .
When each workhorse finishes its tasks, the results are combined to produce the final output. Additionally, by using the proposed parallel dynamic programming-based sub-tensor partitioning mechanism (PSTPA-DP), the uniformity of non-zero element distribution is effectively ensured, leading to a relatively balanced computation assignment among workhorses.
B. The Second Level of Parallel Update—Using Sub-tensors In A Parallel Manner
Now that the parallel computation has been established to update the projection matrices using the sub-tensor χ 211 ( t + 1 ) in C 1 classification, we can derive the updating rules for projection matrices using other sub-tensors. Firstly, three auxiliary matrices should be updated in a parallel manner using sub-tensors in three classifications (i.e., C 1 , C 2 , C 3 ) in turn.
(1)
Parallel Update of Auxiliary Matrices Using C 1 Classification Sub-Tensors
In addition to the sub-tensor χ 211 ( t + 1 ) , the C 1 classification contains sub-tensors χ 121 ( t + 1 ) and χ 112 ( t + 1 ) , which are leveraged to update the auxiliary matrices:
U 2 χ 121 ( t + 1 ) ( 2 ) U 3 ( t ) U 1 ( t ) G ( t ) ( 2 )
U 3 χ 112 ( t + 1 ) ( 3 ) U 2 ( t ) U 1 ( t ) G ( t ) ( 3 )
The update of each auxiliary matrix using sub-tensors in C 1 classification only involves elements generated at timestamp t, which are independent of each other, as shown by Equations (40), (43) and (44), U 1 , U 2 , U 3 can be updated in a parallel manner.
(2)
Parallel Update of Auxiliary Matrices Using C 2 Classification Sub-Tensors
The C 2 classification consists of sub-tensors χ 221 ( t + 1 ) , χ 212 ( t + 1 ) and χ 122 ( t + 1 ) , which are used to update three auxiliary matrices:
U 1 ( 221 ) α U 1 + ( 1 α ) χ 221 ( t + 1 ) ( 1 ) U 3 ( t ) U 2 G ( t ) ( 1 )
U 2 ( 221 ) α U 2 + ( 1 α ) χ 221 ( t + 1 ) ( 2 ) U 3 ( t ) U 1 G ( t ) ( 2 )
U 1 ( 212 ) α U 1 ( 221 ) + ( 1 α ) χ 212 ( t + 1 ) ( 1 ) U 3 U 2 ( t ) G ( t ) ( 1 )
U 3 ( 212 ) α U 3 + ( 1 α ) χ 212 ( t + 1 ) ( 3 ) U 2 ( t ) U 1 G ( t ) ( 3 )
U 2 ( 122 ) α U 2 ( 221 ) + ( 1 α ) χ 122 ( t + 1 ) ( 2 ) U 3 U 1 ( t ) G ( t ) ( 2 )
U 3 ( 122 ) α U 3 ( 212 ) + ( 1 α ) χ 122 ( t + 1 ) ( 3 ) U 2 U 1 ( t ) G ( t ) ( 3 )
Here, U k ( i 1 i 2 i 3 ) means the updated auxiliary matrix U k using sub-tensor χ i 1 i 2 i 3 ( t + 1 ) in the C 2 classification.
The situation is slightly different from that using sub-tensors in C 1 classification, i.e., each auxiliary matrix has to be updated twice using different sub-tensors, and there is a dependency between these two updates, e.g., output of Equation (45) provides an element for Equation (47). However, the elements required for updates of different auxiliary matrices are independent of each other. Therefore, Equations (45), (46) and (48) can be calculated in a parallel manner, and so can Equations (47), (49) and (50).
(3)
Parallel Update of Auxiliary Matrices Using C 3 Classification Sub-Tensors
The C 3 classification contains only one sub-tensor, i.e., χ 222 ( t + 1 ) , which is used to update the auxiliary matrices:
U 1 ( 222 ) α U 1 ( 212 ) + ( 1 α ) χ 222 ( t + 1 ) ( 1 ) U 3 U 2 G ( t ) ( 1 )
U 2 ( 222 ) α U 2 ( 122 ) + ( 1 α ) χ 222 ( t + 1 ) ( 2 ) U 3 U 1 G ( t ) ( 2 )
U 3 ( 222 ) α U 3 ( 122 ) + ( 1 α ) χ 222 ( t + 1 ) ( 3 ) U 2 U 1 G ( t ) ( 3 )
The sub-tensor in C 3 classification further updates the auxiliary matrices obtained using sub-tensors in C 2 , and the elements involved in this update process are independent as well. Therefore, Equations (51)–(53) can be calculated in a parallel manner.
So far, the update of the auxiliary matrices U k comes to an end. Based on the updated U k , projection matrices U k ( t + 1 ) at timestamp t + 1 can be updated by concatenating U k and the projection matrices U k ( t ) , followed by an orthogonalization of the concatenated output A k ( t + 1 ) T , as defined by Equation (30).
In order to avoid computing pseudo-inverse G ( t ) of the matricized core tensor multiple times during the update process, once the G ( t ) has been computed for the first time (i.e., in the update process of sub-tensors using sub-tensors in C 1 classification), it is stored in the Master node, which is accessible for subsequent updates. This approach highlights its capability to avoid redundant computations and save memory space.
In summary, since tensors involved in this subsection are third-order, the degree of parallelism is 3 p n . If this approach is extended to the tensors with K modes ( K > 3 ) , the degree of parallelism is increased to K p n , which can make more efficient use of the workhorses and improve computational efficiency.

3.3.2. Two-Level Parallel Update Method for Core Tensor

   From Equation (31) in Section 3.1.3, which provides the update rule for core tensors, it can be seen that the update process consists of two computational tasks, i.e., G ( t ) × U k , 1 ( t + 1 ) T U k ( t ) and ( i 1 , i 2 , i 3 ) ( 1 , 1 , 1 ) χ i 1 i 2 i 3 ( t + 1 ) × U k , i k ( t + 1 ) T .
The first task involves only the core tensor G ( t ) and the projection matrices U k , 1 ( t + 1 ) T U k ( t ) , which are both small scale, making it feasible to be performed directly. The second task multiplies the sub-tensors (except for the χ 111 ( t + 1 ) ) by a series of matrices U k , i k ( t + 1 ) T and then sums all the products, which can be considered a MTTKRP computation as well. By applying the parallel MTTKRP computation method described in Section 3.3.1, we can efficiently update the core tensor in parallel. Take the sub-tensor χ 211 ( t + 1 ) in C 1 classification as an example, ( i 1 , i 2 , i 3 ) ( 1 , 1 , 1 ) χ i 1 i 2 i 3 ( t + 1 ) × ( U k , i k ( t + 1 ) ) T can be transformed as follows:
χ 211 ( t + 1 ) × 1 ( U 1 , 2 ( t + 1 ) ) T × 2 ( U 2 , 1 ( t + 1 ) ) T × 3 ( U 3 , 1 ( t + 1 ) ) T .
Considering the properties of the Kronecker product and mode multiplication of the tensors, Equation (50) can be expressed as
U 1 , 2 ( t + 1 ) T × 1 χ 211 ( t + 1 ) ( 1 ) U 3 , 1 ( t + 1 ) T U 2 , 1 ( t + 1 ) T .
Here, χ 211 ( t + 1 ) ( 1 ) U 3 , 1 ( t + 1 ) T U 2 , 1 ( t + 1 ) T corresponds to the MTTKRP computation. Furthermore, the computation of ( i 1 , i 2 , i 3 ) ( 1 , 1 , 1 ) χ i 1 i 2 i 3 ( t + 1 ) × U k , i k ( t + 1 ) T involves summing the products of sub-tensor multiplications. Because there is no dependency between operations on different sub-tensors, such operations can be processed in a parallel manner. In this subsection, we realize the two-level parallel update method for the core tensor through two steps:
(1)
Computation of χ i 1 i 2 i 3 ( t + 1 ) × ( U k , i k ( t + 1 ) ) T related to different sub-tensors can be performed in a parallel manner; and
(2)
Application of the MTTKRP parallel computation method, described in Section 3.3.1, to each sub-tensor.
Assuming the sub-tensor is partitioned into p n parts at mode-n for processing, the degree of parallelism in the traditional single-level parallel update method is p n . For a sub-tensor with K modes, it can be partitioned into 2 k sub-tensors, and the sub-tensors involved in the cumulative computation ( i 1 , i 2 , i 3 ) ( 1 , 1 , 1 ) χ i 1 i 2 i 3 ( t + 1 ) × ( U k , i k ( t + 1 ) ) T can reach 2 k 1 . As a result, the degree of parallelism in this two-level parallel update method can be increased to 2 k 1 p n .

3.3.3. Parallel Update Method for High-Order Tensors

   The update methods for third-order projection matrices and the core tensor described in the previous subsections should be extended to higher-order tensors. To be specific, we propose a complete parallel incremental Tucker decomposition method for high-order tensors with multi-mode growth, as shown by pseudo code in Algorithm 3.
Algorithm 3 Incremental Tucker Decomposition Method for High-Order Tensors with Multi-Mode Growth
Input: (1) χ ( t ) = G ( t ) × U k ( t ) : The Tucker decomposition result of χ ( t ) at timestamp t, including the core tensor G ( t ) and projection matrices { U k ( t ) } ;
             (2) χ i 1 i k ( t + 1 ) : The incremental sub-tensors at time t + 1 , where ( i 1 , i 2 , , i K ) ( 1 , 1 , , 1 ) ;
             (3) p n : Number of partitions for mode-n of a sub-tensors.
Output:  χ ( t + 1 ) = G ( t + 1 ) × U k ( t + 1 ) : The Tucker decomposition result of the sub-tensor at time t + 1 , including the updated core tensor G ( t + 1 ) and projection matrices { U k ( t + 1 ) } .
  1:
 partition the incremental sub-tensor χ i 1 i k ( t + 1 ) into K classifications { C m } m = 1 K ;
  2:
 for  1 m K  do
  3:
    parallel processing of the sub-tensors χ i 1 i k ( t + 1 ) within C m ;
  4:
    partition the sub-tensors χ i 1 i k ( t + 1 ) according to Algorithm 2;
  5:
    parallel update of the auxiliary matrices U k based on Equation (52);
  6:
 end for
  7:
 for  1 m K  do
  8:
    concatenate the auxiliary matrices and projection matrices to form a temporary projection matrices F M according to Equation (30);
  9:
    Perform orthogonalization on the F M ;
10:
 end for
11:
 update the core tensor in parallel according to Equation (53);
12:
 output the core tensor G ( t + 1 ) and projection matrices U k ( t + 1 ) at time t + 1 .
For a tensor with K modes, the incremental tensor is first partitioned into 2 K 1 sub-tensors based on their relative position to the original tensor χ 1 1 = χ ( t ) . According to the number of digit 2 in indices of the sub-tensor χ i 1 i k ( t + 1 ) in Step (1), these sub-tensors are grouped into K classifications { C m } m = 1 K . Sub-tensors in each classification are further partitioned by the proposed PSTPA-DP, as described in Step (4). Subsequently, the projection matrices are updated using sub-tensors in C 1 , C 2 , , C m in turn. Using different sub-tensors in each classification, projection matrices can be updated in a parallel manner, corresponding to the first level of parallel updating. Since the sub-tensors have already been partitioned and the MTTKRP parallel computation method has been involved, the update of each projection matrix U k can also be performed in a parallel manner, i.e., the second level of parallel processing. The update formulas for high-order projection matrices are described as follows:
U ( k ) n e w α U ( k ) o l d + ( 1 α ) χ i 1 i k ( t + 1 ) ( k ) U ( K ) U ( k + 1 ) U ( k 1 ) U ( 1 ) G ( t ) ( K )
When i k = 1 , U ( K ) = U ( K ) ( t ) ; when i k = 2 , U ( K ) = U ( K ) . This formula corresponds to Step (5), i.e., updating of auxiliary matrices. The updated auxiliary matrices are concatenated with the projection matrices and the concatenated result is orthogonalized, corresponding to Steps (7)–(10). At this point, the projection matrices { U k ( t + 1 ) } at time t + 1 are obtained. Finally, the update of the high-order core tensor is described as follows:
G ( t + 1 ) = G ( t ) × ( U k , 1 ( t + 1 ) ) T U k ( t ) + ( i 1 , , i k ) ( 1 , , 1 ) χ i 1 i k ( t + 1 ) × ( U k , i k ( t + 1 ) ) T
Equation (53) corresponds to Step (11) of Algorithm 3. So far, the incremental Tucker decomposition method for high-order tensors with multi-mode growth is ready to yield the core tensor G ( t + 1 ) and projection matrices { U k ( t + 1 ) } at time t + 1 .

4. Experimental Analysis

Experiments in this section are conducted to evaluate performance of the proposed algorithm from the perspective of three tasks: the tensor decomposition and reconstruction, a comparative analysis on the computational efficiency of sub-tensor partitioning algorithms, and an assessment on the parallel update efficiency of projection matrices and core tensors. The experimental environment is comprised of the following hardware and software settings:
(1)
Five DELL workstations, each of them is equipped with an Intel® CoreTM i7-9700 CPU, 16 GB of RAM, and a 2TB hard drive. One workstation serves as the Master node and the other four as Slave nodes.
(2)
The key algorithms are implemented using Scala 2.12.10 and executed on a Spark 3.0.3 parallel computing cluster.
(3)
The overall system of parallel computation for tensors is developed using the Java programming language 1.8.0_281.
(4)
Simple build tool (sbt) is employed as the build tool to package the project, which is deployed to the Spark cluster. The sbt-assembly is introduced to facilitate the application build process.
(5)
External tools such as scopt and netlib are incorporated for project initialization, because sbt does not inherently package them into the application.
For the comprehensive performance evaluation of the TPITTD-MG algorithm, datasets with uniformly-distributed non-zero elements are generated using Numpy and used as the foundation to construct third-order tensors.

4.1. Experiments on Tensor Decomposition and Reconstruction

In this subsection, a couple of random matrices, X 1 , X 2 , , X n ( n 3 ) , are generated and standardized. By performing operations such as translation and matrix multiplication, etc., these matrices are assembled into three tensors, i.e., the original tensors χ 1 ( t ) , χ 2 ( t ) and χ 3 ( t ) at time t, which are decomposed by the Tucker decomposition to obtain a core tensor and a series of projection matrices. At time t + 1 , new increments have been added to each mode of the tensor χ i ( t ) ( i = 1 , 2 , 3 ) , generating the tensor χ i ( t + 1 ) ( i = 1 , 2 , 3 ) . Subsequently, the proposed TPITTD-MG algorithm is applied to yield the core tensor and projection matrices at time t + 1 by updating their counterparts at time t in a parallel manner, using the new coming data.
The tensors χ i ( t + 1 ) ( i = 1 , 2 , 3 ) are reconstructed, denoted by χ ^ i ( t + 1 ) ( i = 1 , 2 , 3 ) , on the basis of decomposition results at time t + 1 . An element-wise comparison between χ ^ i ( t + 1 ) and χ i ( t + 1 ) is shown in Figure 6.
In Figure 6, the first row demonstrates the element-wise comparison between χ ^ 1 ( t + 1 ) and χ 1 ( t + 1 ) at each mode, represented by blue and red curves, respectively. The second and third rows provide similar comparison for the second tensor and the third one in turn. It is evident that the reconstructed tensor closely fits the original tensor, implying the core tensor and the projection matrices obtained through TPITTD-MG perfectly preserve the information of the original tensor, validating the accuracy of the algorithm.

4.2. Execution Efficiency Comparison of Sub-Tensor Partitioning Algorithms

This subsection verifies time efficiency of the parallel sub-tensor partitioning mechanism based on the dynamic programming (PSTPA-DP) proposed in Section 4.2 and compares it with the DisMASTD method, the heuristic sub-tensor partitioning algorithm shown in Algorithm 1.
Set the parallel degree to be 4; we compare the execution efficiency of the two algorithms over synthetic datasets with uniformly-distributed non-zero elements at different tensor modes. The experimental results are depicted as blue and orange plots, respectively, in Figure 7a.
It can be seen that given a parallel degree of 4, there is no distinguishable difference between the performance of the two algorithms for smaller tensors. However, as the tensor size increases, the advantage of PSTPA-DP over DisMASTD becomes more and more obvious. The same phenomenon is true when the parallel degree is set to 8, which is shown in Figure 7b. Specifically, when the parallel degree increases from 4 to 8, the efficiency of PSTPA-DP is also enhanced.
Specifically, the efficiency of PSTPA-DP is empirically validated in the scenarios characterized by nonuniform distribution of non-zero elements, which might lead to imbalanced partitions by traditional heuristic tensor partition algorithms. To assess the uniformity of partition, we introduce the term “coefficient of variation”, defined as the ratio of standard deviation to mean, c v = σ / μ , as a metric to quantify the uniformity of partition results. For that purpose, the tensor datasets of identical volume but varying c v are synthesized. c v ’s of the partitions generated by both algorithms over datasets with different initial c v ’s are computed. Figure 8 illustrates performance of the two tensor partitioning algorithms over datasets with varied initial c v ’s.
Figure 8 reveals that for datasets with low c v , two partition algorithms exhibit negligible differences. However, as c v of the datasets increases, the partition generated by the PSTPA-DP demonstrates significantly higher uniformity compared with those produced by DisMASTD.

4.3. Efficiency Comparison of Parallel Projection Matrix and Core Tensor Updates

This subsection deals with the execution efficiency in updating projection matrices and core tensors, which is conducted over synthetic datasets of different scales, employing both single-level and two-level parallel update methods, as shown in Table 3. As for the single-layer case, the heuristic sub-tensor partitioning algorithm is employed, while the two-level method involves the proposed PSTPA-DP.
Table 3 enumerates the execution time of each step of TPITTD-MG over different datasets with different partition methods. The execution time of the two-level parallel update method is almost the same as the single-level version over small-scale datasets. For medium-scale datasets, the former performs slightly better than the latter. For large-scale datasets, the former can significantly reduce the execution time compared with the latter. Such observations are consistent with the theoretical conclusion, verifying the effectiveness of the two-level parallel update algorithm.
Meanwhile, the effectiveness of TPITTD-MG is verified as well by a comparison with two reference algorithms, i.e., PCSTF and DPHTD, proposed by Yao et al. [34] and Ning et al. [35], respectively, as shown in Table 4. Although TPITTD-MG operated over 4 workers provides a slightly lower speedup compared with the DPHTD operated over 16 workers, it outperforms PCSTF significantly under the same condition.

5. Conclusions

In this paper, we attempt to reduce the computational complexity associated with the incremental Tucker tensor decomposition by introducing two methods, i.e., a parallel sub-tensor partitioning algorithm based on dynamic programming (PSTPA-DP), and a two-level parallel update method. The first method addresses the inefficiencies encountered in counting non-zero elements during sub-tensor partitioning, leading to an even partition of sub-tensors. The second one aims to accelerate computation by an increasing degree of parallel, it consists of two levels of parallel: the first level of parallel updates using a parallel MTTKRP computation strategy, the second level of parallel updates leveraging the independence property of updating processes of projection matrices using different sub-tensors within the same classification. By integration of these two methods with the Tucker decomposition, we propose a two-level parallel incremental tensor Tucker decomposition method with multi-mode growth (TPITTD-MG) and conduct extensive experiments over synthesized datasets to validate the effectiveness and efficiency of the mechanism involved as well as the algorithm proposed. The experimental results show that in the case of a data scale in the tens of millions and a parallel degree of 4, TPITTD-MG improves execution efficiency by nearly 400% compared to the existing algorithms, with partitioning uniformity improved by over 20%; For third-order tensors, the execution efficiency of this algorithm improves by nearly 300% compared to the single-level update algorithm. It can be concluded that transitioning the decomposition of the initial tensor to a parallel execution is a practical strategy to substantially elevate computational efficiency.

Author Contributions

Conceptualization, Y.Z. and Z.Y.; methodology, Y.Z. and Z.Y.; software, Z.Y. and Z.C.; writing—original draft preparation, Z.Y.; writing—review and editing, Y.Z.; experiments, Z.C.; reference lookup, Z.C.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 111 Project (Grant No. B21049).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to express our sincere gratitude to Chenpeng JIANG for his contributions to the original draft preparation and review of the manuscript. We also thank Lin YANG for her efforts for the reference lookup and editing of the manuscript.

Conflicts of Interest

The authors declare no potential conflicts of interests.

References

  1. Cichocki, A. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. arXiv 2014, arXiv:1403.2048. [Google Scholar]
  2. Kilmer, M.; Horesh, L.; Avron, H.; Newman, E. Tensor-tensor Algebra for Optimal Representation and Compression of Multiway Data. Proc. Natl. Acad. Sci. USA 2021, 118, e2015851118. [Google Scholar] [PubMed]
  3. Kolda, T.G.; Bader, B.W. Tensor Decompositions and Applications. SIAM Rev. 2009, 51, 455–500. [Google Scholar]
  4. Liu, Y.; Liu, J.; Long, Z.; Zhu, C. Tensor Computation for Data Analysis; Springer: Berlin, Germany, 2021. [Google Scholar]
  5. Kuang, L.; Hao, F.; Yang, L.T.; Lin, M.; Luo, C.; Min, G. A Tensor-Based Approach for Big Data Representation and Dimensionality Reduction. IEEE Trans. Emerg. Top. Comput. 2017, 2, 280–291. [Google Scholar] [CrossRef]
  6. NguyenSchäfer, H.; Schmidt, J.P. Tensor Analysis and Elementary Differential Geometry for Physicists and Engineers; Springer: Berlin, Germany, 2017. [Google Scholar]
  7. Levi-Civita, T.; Persico, E.; Long, M. The Absolute Differential Calculus. Absol. Differ. Calc. 1926, 7, 140. [Google Scholar]
  8. Schouten, J.A.; Corson, E.M. Tensor Analysis for Physicists. Phys. Today 1955, 5, 22. [Google Scholar]
  9. Ballard, G.; Kolda, T.G. Tensor Decompositions for Data Science; Cambridge University Press: Cambridge, UK, 2024. [Google Scholar]
  10. Ma, Y.; Chen, R.; Li, W.; Shang, F.; Yu, B. A Unified Approximation Framework for Compressing and Accelerating Deep Neural Networks. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019. [Google Scholar]
  11. Liu, X.; Parhi, K.K. Tensor Decomposition for Model Reduction in Neural Networks: A Review. IEEE Circuits Syst. Mag. 2023, 23, 8–28. [Google Scholar]
  12. Tucker, L.R. Implications of Factor Analysis of Three-Way Matrices for Measurement of Change. Probl. Meas. Chang. 1963, 15, 3. [Google Scholar]
  13. Xiao, H.; Wang, F.; Ma, F.; Gao, J. eOTD: An Efficient Online Tucker Decomposition for Higher Order Tensors. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018. [Google Scholar]
  14. De Lathauwer, L.; De Moor, B.; Vandewalle, J. On the Best Rank-1 and Rank-(R1, R2, …, RN) Approximation of Higher-Order Tensors. SIAM J. Matrix Anal. Appl. 2000, 21, 1324–1342. [Google Scholar]
  15. Kim, Y.D.; Park, E.; Yoo, S.; Choi, T.; Shin, D. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. arXiv 2015, arXiv:1511.06530. [Google Scholar]
  16. Minster, R.; Li, Z.; Ballard, G. Parallel Randomized Tucker Decomposition Algorithms. SIAM J. Sci. Comput. 2024, 46, A1186–A1213. [Google Scholar] [CrossRef]
  17. Tucker, L. Some Mathematical Notes on Three-mode Factor Analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
  18. Lathauwer, L.D.; De, B. A Multi-Linear Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 2000, 21, 1253–1278. [Google Scholar] [CrossRef]
  19. Heng, Q. Robust Low-Rank Tensor Decomposition with the L 2 Criterion. Technometrics 2023, 65, 537–552. [Google Scholar] [CrossRef]
  20. Kapteyn, A.; Neudecker, H.; Wansbeek, T. An Approach Ton-mode Components Analysis. Psychometrika 1986, 51, 269–275. [Google Scholar] [CrossRef]
  21. Eldén, L.; Savas, B. A Newton-Grassmann Method for Computing the Best Multi-Linear Rank-(R1, R2, R3) Approximation of A Tensor. SIAM J. Matrix Anal. Appl. 2009, 31, 248–271. [Google Scholar] [CrossRef]
  22. Fang, S.; Kirby, R.M.; Zhe, S. Bayesian Streaming Sparse Tucker Decomposition. Uncertain. Artif. Intell. 2021, 161, 558–567. [Google Scholar]
  23. Thanh, L.T.; Abed-Meraim, K.; Trung, N.L.; Hafiane, A. A Contemporary and Comprehensive Survey on Streaming Tensor Decomposition. IEEE Trans. Knowl. Data Eng. 2023, 35, 10897–10921. [Google Scholar] [CrossRef]
  24. Vannieuwenhoven, N.; Vandebril, R.; Meerbergen, K. A New Truncation Strategy for the Higher-Order Singular Value Decomposition. SIAM J. Sci. Comput. 2012, 34, 1027–1052. [Google Scholar] [CrossRef]
  25. Kressner, D.; Periša, L. Recompression of Hadamard Products of Tensors in Tucker Format. SIAM J. Sci. Comput. 2017, 39, A1879–A1902. [Google Scholar] [CrossRef]
  26. Che, M.; Wei, Y. Randomized Algorithms for the Approximations of Tucker and the Tensor Train Decompositions. Adv. Comput. Math. 2019, 45, 395–428. [Google Scholar]
  27. Chachlakis, D.G.; Prater-Bennette, A.; Markopoulos, P.P. L1-Norm Higher-Order Orthogonal Iterations for Robust Tensor Analysis. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
  28. Xiao, G.; Yin, C.; Zhou, T.; Li, X.; Chen, Y.; Li, K. A Survey of Accelerating Parallel Sparse Linear Algebra. ACM Comput. Surv. 2024, 56, 38. [Google Scholar]
  29. Oh, S.; Park, N.; Sael, L.; Kang, U. Scalable Tucker Factorization for Sparse Tensors—Algorithms and Discoveries. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 16–19 April 2018. [Google Scholar]
  30. Ballard, G.; Klinvex, A.; Kolda, T.G. TuckerMPI: A Parallel C++/MPI Software Package for Large-Scale Data Compression via the Tucker Tensor Decomposition. ACM Trans. Math. Softw. (TOMS) 2020, 46, 1–31. [Google Scholar]
  31. Kang, U.; Papalexakis, E.; Harpale, A.; Faloutsos, C. GigaTensor: Scaling Tensor Analysis Up by 100 Times—Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 316–324. [Google Scholar]
  32. Park, N.; Jeon, B.; Lee, J.; Kang, U. BIGtensor: Mining Billion-Scale Tensor Made Easy. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 2457–2460. [Google Scholar]
  33. Acer, S.; Torun, T.; Aykanat, C. Improving Medium-Grain Partitioning for Scalable Sparse Tensor Decomposition. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 2814–2825. [Google Scholar]
  34. Yao, J.; Zheng, P.; Wu, Z.; Sun, J.; Wei, Z. A Distributed and Parallel Method for Fusion of Remote Sensing Images Based on Tucker Decomposition. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 1013–1016. [Google Scholar]
  35. Ning, W.; Wu, Z.; Sun, J.; Yang, J.; Zhang, Y.; Zhu, Y.; Wei, Z.; Xia, L. A Distributed and Parallel Tensor Hierarchical Tucker Decomposition on Clouds. In Proceedings of the 2021 Ninth International Conference on Advanced Cloud and Big Data (CBD), Xi’an, China, 26–27 March 2022; pp. 31–36. [Google Scholar]
  36. Phan, A.H.; Cichocki, A. PARAFAC Algorithms for Large-Scale Problems. Neurocomputing 2011, 74, 1970–1984. [Google Scholar] [CrossRef]
  37. Nion, D.; Sidiropoulos, N.D. Adaptive Algorithms to Track the PARAFAC Decomposition of a Third-Order Tensor. IEEE Trans. Signal Process. 2009, 57, 2299–2310. [Google Scholar]
  38. Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Incremental Singular Value Decomposition Algorithms for Highly Scalable Recommender Systems. In Proceedings of the Fifth International Conference on Computer and Information Science, Seoul, Republic of Korea, 28–29 November 2002. [Google Scholar]
  39. Song, Q.; Huang, X.; Ge, H.; Caverlee, J.A.; Hu, X. Multi-Aspect Streaming Tensor Completion. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 435–443. [Google Scholar]
  40. Yang, K.; Gao, Y.; Shen, Y.; Zheng, B.; Chen, L. DisMASTD: An Efficient Distributed Multi-Aspect Streaming Tensor Decomposition. In Proceedings of the ACM Turing Award Celebration Conference-China, Wuhan, China, 28–30 July 2023; pp. 127–128. [Google Scholar]
  41. Zhou, S.; Vinh, N.X.; Bailey, J.; Jia, Y.; Davidson, I. Accelerating Online CP Decompositions for Higher Order Tensors. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1375–1384. [Google Scholar]
  42. Yu, R.; Cheng, D.; Liu, Y. Accelerated Online Low Rank Tensor Learning for Multivariate Spatiotemporal Streams. In Proceedings of the International Conference on Machine Learning, PMLR, Lile, France, 6–11 July 2015; pp. 238–247. [Google Scholar]
  43. Rontogiannis, A.A.; Kofidis, E.; Giampouras, P.V. Online Rank-Revealing Block-Term Tensor Decomposition. Signal Process. 2023, 212, 109126. [Google Scholar] [CrossRef]
  44. Abubaker, N.; Acer, S.; Aykanat, C. True Load Balancing for Matricized Tensor Times Khatri-Rao Product. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 1974–1986. [Google Scholar] [CrossRef]
  45. Korf, R.E. A Complete Anytime Algorithm for Number Partitioning. Artif. Intell. 1998, 106, 181–203. [Google Scholar]
Figure 1. Chicago crime tensor of size 365 × 24 × 77 × 12 .
Figure 1. Chicago crime tensor of size 365 × 24 × 77 × 12 .
Mathematics 13 01211 g001
Figure 2. Comparison between CP and Tucker decomposition.
Figure 2. Comparison between CP and Tucker decomposition.
Mathematics 13 01211 g002
Figure 3. A schematic diagram of a streaming data ecosystem.
Figure 3. A schematic diagram of a streaming data ecosystem.
Mathematics 13 01211 g003
Figure 4. Tucker decomposition of multi-mode growth for third-order tensor.
Figure 4. Tucker decomposition of multi-mode growth for third-order tensor.
Mathematics 13 01211 g004
Figure 5. Illustration of the 5 slices of the sub-tensor χ s u b .
Figure 5. Illustration of the 5 slices of the sub-tensor χ s u b .
Mathematics 13 01211 g005
Figure 6. Element-wise comparison between the original and the reconstructed tensors at time t + 1 .
Figure 6. Element-wise comparison between the original and the reconstructed tensors at time t + 1 .
Mathematics 13 01211 g006
Figure 7. A comparison between the dynamic scheduling-based parallel sub-tensor partitioning algorithm and DisMASTD of different parallelism. (a) Parallel degree of 4. (b) Parallel degree of 8.
Figure 7. A comparison between the dynamic scheduling-based parallel sub-tensor partitioning algorithm and DisMASTD of different parallelism. (a) Parallel degree of 4. (b) Parallel degree of 8.
Mathematics 13 01211 g007
Figure 8. Trend in coefficient of variation.
Figure 8. Trend in coefficient of variation.
Mathematics 13 01211 g008
Table 1. Important notations in this paper.
Table 1. Important notations in this paper.
NotationMeaning
χ R I 1 × I 2 × × I N a N-dimensional tensor
G core tensor
U 1 , U 2 , , U n projection matrices
( χ ) ( n ) mode-n unfolding of χ
X T , X 1 , X transpose, inverse and pseudo-inverse of matrix X
X ( i , : ) , X ( : , j ) i-th row and j-th column of X
χ × n U mode-n product of χ with U
, , , outer, Khatri–Rao, Kronecker, and Hadamard product
X × n 1 Y mode- ( n , 1 ) contracted product of X with Y
| | · | | F , | | · | | p , | | · | | * Euclidean norm, L p norm and nuclear norm
U ( n ) n = 1 N i = 1 r U ( 1 ) ( : , i ) U ( 2 ) ( : , i ) U ( N ) ( : , i )
χ ; U ( n ) n = 1 N or χ × U k χ × 1 U ( 1 ) × 2 U ( 2 ) × 3 × N U ( N ) , ( χ × k U ) n 1 i k n K = n k = 1 N X n 1 n k n N U i k n k
X × ( a , b ) Y The mode- ( a , b ) product (Tensor Contraction))
Z × k U k Z = X × k U k X + Y × k U k Y Block Tensor and Matrix Multiplication I (Corollary I proposed by Xiao et al. [13]). In case k k and U k Z R R k × I k , Z × k U k Z is equal to the concatenation of X × k U k Z and Y × k U k Z at the mode-k.
χ × { U k } = ( i 1 , , i K ) [ 2 ] K χ i 1 i 2 i K × { U k , i k } Block Tensor and Matrix Multiplication II (Corollary II proposed by Xiao et al. [13]).
Table 2. Execution Timetable of Heuristic Tensor Partitioning Algorithm.
Table 2. Execution Timetable of Heuristic Tensor Partitioning Algorithm.
Order NumberData SizeDuration of Counting Non-Zero Element (In Seconds)Length of Execution of Partitioning Algorithm (In Seconds)Total Update Time (In Seconds)
120∗20∗400.520.912.42
2200∗20∗401.091.282.82
3200∗200∗402.012.174.56
4200∗200∗4008.028.1816.09
The corresponding hardware/software configuration as well as datasets for experiments are described at length in Section 4.
Table 3. Execution Time of Different Steps for Different Algorithms.
Table 3. Execution Time of Different Steps for Different Algorithms.
Data ScaleAlgorithm NameTensor Partition Execution Time (s)Projection Matrix Update Execution Time (s)Core Tensor Update Execution Time (s)Total Update Execution Time (s)
10∗10∗10Single Layer Update1.011.210.342.58
10∗10∗10Two Layer Update1.191.320.312.87
100∗10∗10Single Layer Update1.191.110.332.80
100∗10∗10Two Layer Update1.181.130.422.76
100∗100∗10Single Layer Update2.492.030.335.03
100∗100∗10Two Layer Update1.821.540.284.01
100∗100∗100Single Layer Update9.697.110.3817.80
100∗100∗100Two Layer Update3.473.020.357.78
1000∗100∗100Single Layer Update44.3242.231.0191.87
1000∗100∗100Two Layer Update12.5314.610.9930.94
Table 4. Comparison of Distributed Tucker Decomposition Methods.
Table 4. Comparison of Distributed Tucker Decomposition Methods.
AlgorithmPlatformParallelismData ScaleKey OptimizationSpeedup
Yao (PCSTF)Spark4 workers200 × 200 × 93Distributed core tensor update3.0
Ning (DPHTD)OpenStack4 × 4 workers128 × 129 × 320Coarse/fine-grained parallelism4.1
TPITTD-MGSpark4 workers1000 × 100 × 100Dynamic subtensor allocation, MTTKRP4.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Yue, Z.; Chen, Z. A Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG). Mathematics 2025, 13, 1211. https://doi.org/10.3390/math13071211

AMA Style

Zhou Y, Yue Z, Chen Z. A Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG). Mathematics. 2025; 13(7):1211. https://doi.org/10.3390/math13071211

Chicago/Turabian Style

Zhou, Yajian, Zongqian Yue, and Zhe Chen. 2025. "A Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG)" Mathematics 13, no. 7: 1211. https://doi.org/10.3390/math13071211

APA Style

Zhou, Y., Yue, Z., & Chen, Z. (2025). A Two-Level Parallel Incremental Tensor Tucker Decomposition Method with Multi-Mode Growth (TPITTD-MG). Mathematics, 13(7), 1211. https://doi.org/10.3390/math13071211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop