Next Article in Journal
Improved Dual-Center Particle Swarm Optimization Algorithm
Previous Article in Journal
Integral Cryptanalysis of Reduced-Round IIoTBC-A and Full IIoTBC-B
Previous Article in Special Issue
A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

by
Euis Asriani
1,*,
Intan Muchtadi-Alamsyah
2,3 and
Ayu Purwarianti
3,4
1
Doctoral Program of Mathematics, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Bandung 40132, Indonesia
2
Algebra Research Group, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Bandung 40132, Indonesia
3
University Center of Excellence Artificial Intelligence on Vision, Natural Language Processing and Big Data Analytics (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung 40132, Indonesia
4
Informatics Research Group, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung 40132, Indonesia
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(11), 1697; https://doi.org/10.3390/math12111697
Submission received: 29 February 2024 / Revised: 3 May 2024 / Accepted: 24 May 2024 / Published: 29 May 2024
(This article belongs to the Special Issue Applications of Mathematics in Neural Networks and Machine Learning)

Abstract

:
Transformer has emerged as one of the modern neural networks that has been applied in numerous applications. However, transformers’ large and deep architecture makes them computationally and memory-intensive. In this paper, we propose the block g-circulant matrices to replace the dense weight matrices in the feedforward layers of the transformer and leverage the DCT-DST algorithm to multiply these matrices with the input vector. Our test using Portuguese-English datasets shows that the suggested method improves model memory efficiency compared to the dense transformer but at the cost of a slight drop in accuracy. We found that the model Dense-block 1-circulant DCT-DST of 128 dimensions achieved the highest model memory efficiency at 22.14%. We further show that the same model achieved a BLEU score of 26.47%.

1. Introduction

Across a wide range of applications, including chatbots [1], machine translation [2], text summarization [3], video-image processing [4,5,6], etc., transformer neural networks have significantly improved. As demonstrated by [7,8,9,10,11], transformers and their modifications have proven superior, particularly in machine translation. As the amount of computational power required for training and inference increases, one of the more advanced neural networks is the Transformer [12], which has a deeper topology. Conversely, the depth of the transformer architecture gives rise to several constraints and challenges, including high computational complexity [13], substantial demands on computational resources [14], and high memory consumption [15] that quadratic to the input sequence length. Therefore, methods are required to achieve excellent performance, mainly when using them as translation machines. Several approaches have been put forth to address this problem, including weight matrix decomposition, matrix-vector algorithm selection, and weight matrix replacement.
Utilizing a structured weight matrix is considered a crucial tactic among other strategies because of its benefits. According to [16], this may lessen the memory needed for training and storing the model (and optimizer). This alternative approach can also mitigate computational complexity, leveraging the benefits of the selected structured weight matrix [12]. This fact has encouraged the creation of a wide variety of structured matrices, including low-rank matrices [17], Toeplitz-like matrices [18], block-circulant matrices [19,20], fast-food transforms [21], low displacement rank matrices [22], and butterfly matrices [23], among other matrices. One of the Low Displacement Rank (LDR) matrices is the block g-circulant matrix [24,25]. The block g-circulant matrix combines features from block-circulant and g-circulant matrices. Its adoption as a replacement for the transformer weight matrix is believed to enhance transformer performance, building on findings from prior research [26,27]. With a g position shift to the right, this matrix is a generalization of the block circulant matrix. This property makes more intricate patterns and interactions within the matrix possible, expanding the range of potential uses. Among their many benefits is the ability to find efficient algorithms, such as those for multiplying them by any vector to reduce computational complexity [28]. This is why structured matrices are used as transformer weight matrices.
We have found that the Fast Fourier Transform (FFT) algorithm is dependable when the structured matrix is a circulant matrix [29]. The ability of the FFT Transformer to compress model memory up to 100 times has been demonstrated. This transformer is based on the feedforward layer’s block g-circulant weight matrices. Liu [28] discovered the Discrete Cosine Transform-Discrete Sine Transform (DCT-DST) algorithm for circulant matrix-vector products, which can save storage compared to FFT. No text processing system has ever employed the DCT-DST algorithm. Processing photos and videos is its more common usage [30,31]. With a DCT-DST algorithm for the translation machine, we present a novel method for executing a real block g-circulant weight matrix-based transformer in this paper. Our objective is to impose a block g-circulant structure on transformer model topologies through the elegant mathematical characteristic of the block g-circulant matrix. The contributions of this paper are summarized as follows:
  • We define the block g-circulant matrix and generate several lemmas and theorems regarding the characteristics and possible eigenvalues, as well as defining the matrices utilized in carrying out the DCT-DST algorithm for multiplication of block g-circulant matrices.
  • We propose a new approach in using structured matrices as a replacement for dense weight matrices, combined with matrix multiplication algorithms, in this case, a combination of block g-circulant matrices and the DCT-DST algorithm.
  • Our research is the first study that applied the DCT-DST algorithm for weight matrix-vector multiplication in a transformer-based translation machine.
In this article, we present a structured exploration of key concepts. Firstly, we delve into the foundational motivation driving our exploration of block g-circulant matrices and the DCT-DST algorithm. Section 2 explains the theories and characteristics of block g-circulant matrices, including a comprehensive examination of the potential eigenvalues of real block g-circulant matrices. Section 3 describes the workings of the block g-circulant DCT-DST transformer, including a comprehensive explanation of the associated algorithm. Section 4 presents our experimental methodology, results, and a thorough discussion of the research findings. Finally, we summarize our insights in a concise conclusion in the closing section.

2. Block g-Circulant Matrix

Definition 1.
Let C i be a m × m matrix for each i = 1 , , n . Then a n m × n m block-circulant matrix C n m is generated from the ordered set C 1 , C 2 , , C n , and is of the form
C n m = C 1 C 2 C n C n C 1 C n 1 C 2 C 3 C 1 .
The set of all such matrices of order n m × n m will be denoted by B C n m , whereas B m represents the set of all circulant matrices of dimension m.
Theorem 1
([32]). Let C n m B C n m has generating elements C 1 , C 2 , , C n B m . If c i ( 1 ) , c i ( 2 ) , , c i ( m ) are generating elements of C i , then
( F n F m ) C n m ( F n F m ) = D n m = d i a g i = 1 , , n λ i ( 1 ) 0 0 0 λ i ( 2 ) 0 0 0 λ i ( m )
is a diagonal matrix of dimension n m × n m with
λ i ( p ) = k = 1 n l = 1 m c k l ω p 1 l 1 ω i 1 k 1 ,
with i = 1 , 2 , , n and p = 1 , 2 , , m .
F n = 1 n ω j k , j , k = 0 , 1 , , n 1
with ω = e 2 π n i , i = 1 .
In case C n m R n m × n m , we can decompose the diagonalization of C n m [26] as follow:
( U n U m ) C n m ( U n U m ) = Ω n m
with
U n = t 0 , 2 t 1 , , 2 t h 1 , t h , 2 s h 1 , , 2 s 1 , if n = 2 h t 0 , 2 t 1 , , 2 t h , 2 s h , , 2 s 1 , if n = 2 h + 1
where t k and s k are real and imaginary parts of columns of F n = [ f 0 , f 1 , , f h ] and
Ω n m = q 1 q 2 s 2 q h s h q h + 1 s h q h s 2 q 2
and for n = 2 h + 1 it will be
Ω n m = q 1 q 2 s 2 q h + 1 s h + 1 s h + 1 q h + 1 s 2 q 2
with
q i = k = 1 n a k ( i ) α k ( 1 ) k = 1 n a k ( i ) α k ( 2 ) k = 1 n a k ( i ) β k ( 2 ) k = 1 n a k ( i ) α k ( r ) k = 1 n a k ( i ) β k ( r ) k = 1 n a k ( i ) α k ( r + 1 ) k = 1 n a k ( i ) β k ( r ) k = 1 n a k ( i ) α k ( r ) k = 1 n a k ( i ) β k ( 2 ) k = 1 n a k ( i ) β k ( 2 ) , if m = 2 r k = 1 n a k ( i ) α k ( 1 ) k = 1 n a k ( i ) α k ( 2 ) k = 1 n a k ( i ) β k ( 2 ) k = 1 n a k ( i ) α k ( r + 1 ) k = 1 n a k ( i ) β k ( r + 1 ) k = 1 n a k ( i ) β k ( r + 1 ) k = 1 n a k ( i ) α k ( r + 1 ) k = 1 n a k ( i ) β k ( 2 ) k = 1 n a k ( i ) α k ( 2 ) , if m = 2 r + 1
and
s i = k = 1 n b k ( i ) α k ( 1 ) k = 1 n b k ( i ) α k ( 2 ) k = 1 n b k ( i ) β k ( 2 ) k = 1 n b k ( i ) α k ( r ) k = 1 n b k ( i ) β k ( r ) k = 1 n b k ( i ) α k ( r + 1 ) k = 1 n b k ( i ) β k ( r ) k = 1 n b k ( i ) α k ( r ) k = 1 n b k ( i ) β k ( 2 ) k = 1 n b k ( i ) β k ( 2 ) , if m = 2 r k = 1 n b k ( i ) α k ( 1 ) k = 1 n b k ( i ) α k ( 2 ) k = 1 n b k ( i ) β k ( 2 ) k = 1 n b k ( i ) α k ( r + 1 ) k = 1 n b k ( i ) β k ( r + 1 ) k = 1 n b k ( i ) β k ( r + 1 ) k = 1 n b k ( i ) α k ( r + 1 ) k = 1 n b k ( i ) β k ( 2 ) k = 1 n b k ( i ) α k ( 2 ) , if m = 2 r + 1
where α , β , and a , b denote a real and imaginary part of the eigenvalues of C n m decomposers.
Definition 2
([28]). The Discrete Cosine Transform (DCT) I and V matrices are defined as follows
C n + 1 I = 2 n τ j τ k cos j k π n j , k = 0 n
C n V = 2 2 n 1 τ j τ k cos 2 j k π 2 n 1 j , k = 0 n 1
with τ l ( l = j , k ) = 1 2 , i f   l = 0   o r   l = n 1 , i f   l   o t h e r w i s e ι k = 1 2 , i f k = n 1 1 , i f   k   o t h e r w i s e
Definition 3
([28]). The Discrete Sine Transform (DST) I and V matrices are defined as follows
S n 1 I = 2 n sin j k π n j , k = 1 n 1
S n 1 V = 2 2 n 1 sin 2 j k π 2 n 1 j , k = 1 n 1
In the following theorem, we will see that the matrix U n , defined in (5), can be partitioned into a matrix generated by the DCT-DST matrices.
Theorem 2
([28]). Let U n be the matrix stated in (5). Then U n can be partitioned into the following form
U n = σ 1 q h + 1 T 0 C 1 2 2 S h 1 I J h 1 σ 1 v h + 1 T 0 J h 1 C 1 2 2 J h 1 S h 1 I J h 1 , i f n = 2 h σ 1 p h + 1 T 0 C 1 2 2 S h V J h J h C 1 2 2 J h S h V J h , i f n = 2 h + 1
Definition 4.
Let C i ( p ) be a m × m g-circulant matrix, i.e., classical circulant matrix with each row shifted g position to the right. A n m × n m matrix C n m g is a block g-circulant matrix if it is generated by C i ( p ) , i = 1 , 2 , , n ; p = 1 , 2 , , m and shifted g position to the right.
Lemma 1.
Let C n m g C n m × n m be a block g-circulant matrix with n block, and each block has order m. Then C n m g can be denoted as
C n m g = Z n m g C n m
where C n m is a block g-circulant matrix with g = 1 and
Z n m g = Z n , g Z m , g
where
Z n , g = [ δ ( g r s ) m o d n ] g , r = 0 n 1
δ k = 1 i f k 0 m o d n 0 i f k o t h e r w i s e
and by (1) we have
C n m g = Z n m g ( F n F m ) D n m ( F n F m )
Lemma 2.
Let n , m Z and n , m > 2 ., and let n g = n / ( n , g ) , m g = m / ( m , g ) with ( n , g ) and ( m , g ) are the great common divisor between n, m respectively with g. Then we have
Z n m g = Z ˜ n m g | Z ˜ n m g | | Z ˜ n m g T ( n , g ) × ( m , g )   times
where Z n m g is the matrix defined in (12) and Z ˜ n m g C ( n g × m g ) × n m is the submatrix of Z n m g obtained by considering only its first ( n g × m g ) rows.
Lemma 3
([33]). Let g n , then
(a) 
Z n , g = Z n , g o ,
(b) 
C n , g = C n , g o ,
(c) 
M n , g = M n , g o
(d) 
M n , g M n , h = M n , g h
where g o g mod n.
Lemma 4.
Let Z n , g be the matrix defined ini (13) and F n be the Fourier matrix defined in (3). Then we have
Z n , g = F n M n , g F n
where
M n , g = [ δ r g c ] r , c = 0 n 1
and δ k is defined as in (14).
Lemma 5.
Let g n , m . Then Z n m g = Z n m g o where g o = g mod ( n m ) . As a consequence we have C n m g = C n m g o .
Proof. 
From (12) and by Lemma (3), we have that
Z n m g = Z n , g Z m , g = Z n , g o Z m , g o = Z n m g o
and so from (11) we have the equality C n m g = Z n m g C n m = Z n m g o C n m = C n m g o . □
Lemma 6.
Let Z n m g be the matrix defined in (12) and let F n as defined in (3). Then
Z n m g = ( F n F m ) M n m g ( F n F m )
where
M n m g = M n , g M m , g = [ δ r g s ] r , s = 0 n 1 [ δ r g s ] r , s = 0 m 1 .
and δ k is defined as in (14).
Proof. 
It suffices to show that ( F n F m ) Z n m g = M m n g ( F n F m ) .
Using (12), kronecker product properties, and (17) we have
( F n F m ) Z n m g = ( F n F m ) ( Z n , g Z m , g ) = F n Z n , g F m Z m , g = M n , g F n M m , g F m = ( M n , g M m , g ) ( F n F m ) = M n m g ( F n F m ) .
In the following lemma, we will see that Lemma (5) also holds for the matrix M m n g .
Lemma 7.
If g n , m , then M n m g = M n m g o with g o = g mod ( n m ) .
Proof. 
By using (20) we have
Z n m g = ( F n F m ) M n m g ( F n F m ) Z n m g o = ( F n F m ) M n m g o ( F n F m )
By Lemma (5), we deduce
( F n F m ) M n m g ( F n F m ) = ( F n F m ) M n m g o ( F n F m )
and so M n m g = M n m g o . □
Lemma 8.
Let D n m C n m × n m be a diagonal matrix
D n m = diag p = 0 , , m 1 j = 0 , , n 1 λ j ( p )
where p = 0 , , m 1 ; j = 0 , , n 1 , and let M n m g be the matrix defined in (21). Then
D n m M n m g = M n m g D ˜ n m g
where D ˜ n m g = diag p = 0 , , m 1 j = 0 , , n 1 λ g j ( m o d n ) ( g p ( m o d n ) ) .
Lemma 9.
Let C n m g C n m × n m be a g-block circulant matrix and C n m h C n m × n m be a h-block circulant matrix. Then C n m g C n m h C n m × n m is a g h -block circulant matrix.
Proof. 
Using (11) and (15) we have
C n m g = Z n m g C n m = Z n m g ( F n F m ) D n m ( 1 ) ( F n F m )
and
C n m h = Z n m h C n m = Z n m h ( F n F m ) D n m ( 2 ) ( F n F m ) .
Since (20), then
C n m g C n m h = ( F n F m ) M n m g D n m ( 1 ) M n m h D n m ( 2 ) ( F n F m )
By using (22), the last equation will be
C n m g C n m h = ( F n F m ) M n m g M n m h D ˜ n m g ( 1 ) D n m ( 2 ) ( F n F m )
Furthermore by (21) and Kronecker product properties we have
M n m g M n m h = ( M n g M m g ) ( M n h M m h )
                = ( M n g M n h ) ( M m g M m h )
= ( M n g h ) ( M m g h )
= M n m g h
So by (20) we obtain
C n m g C n m h = Z n m g h ( F n F m ) D ˜ n m g ( 1 ) D n m ( 2 ) ( F n F m )
Since D ˜ n m g ( 1 ) D n m ( 2 ) is a diagonal matrix, then the last equation is the representation of g h -circulant. □

2.1. Eigenvalues of Block g-Circulant Matrices

In this section, we explore the eigenvalues of a block g-circulant matrix in case g = 0 , g = 1 , and ( n , g ) 1 and ( m , g ) 1 .

2.1.1. Case g = 0

If g = 0 , this means that C n m g is a matrix that has constant elements along all the rows in each block and, therefore, it has rank 1; then, remembering that the trace (tr(·)) of a matrix is the sum of its eigenvalues, we can conclude that C n m g has ( n m × n m ) 1 zero eigenvalues and one eigenvalue λ different from zero given by
λ ( C n m g ) = i = 1 n t r ( C i ) = i = 1 n p = 1 m c i ( p )
where C i = ( c i ( 1 ) , c i ( 2 ) , , c i ( m ) ) and c i ( p ) is the entry of the ith block and pth column.

2.1.2. Case g = 1

If g = 1 , then the C n m g is a “classical” block circulant matrix as defined in Definition 1, and its eigenvalues are given by
λ i ( p ) = k = 1 n l = 1 m c k l ω p 1 l 1 ω i 1 k 1

2.1.3. Case ( n m , g ) 1

The following theorem we will use to calculate the eigenvalues of a singular matrix A C n × n with rank ( A ) = r n by calculating the eigenvalues of a smaller matrix A C k × k with r k n .
Theorem 3
([34]). Let A be a matrix of dimension n × n , A C n × n , which can be written as A = X Y , where X , Y C n × k , with k n . Then the eigenvalues of the matrix A are given by k eigenvalues of the matrix Y X C k × k and n k zero eigenvalues:
E i g ( A ) = E i g ( Y X ) { 0 w i t h g e o m e t r i c m u l t i p l i c i t y n k } .
We will apply Theorem (3) while considering that the matrix C n m g is singular. By Lemma 2 we can write that
Z n m g = Z ˜ n m g | Z ˜ n m g | | Z ˜ n m g T = I n g | I n g | | I n g T I m g | I m g | | I m g T ( Z ˜ n , g Z ˜ m , g ) = I n m g Z ˜ n m g
where Z ˜ n m g C ( n g × m g ) × n m and I n m g is the identity matrix of dimension n m × ( n g × m g ) . Now we can rewrite (11) as bellow
C n m g = Z n m g C n m = I n m g Z ˜ n m g C n m = I n m g C ˜ n m g
where C ˜ n m g = Z ˜ n m g C n m C ( n g × m g ) × n m .
By Theorem (3) we find that the eigenvalues of C n m g are equal to those of C ˜ n m g I n m g C ( n g × m g ) × ( n g × m g ) , plus ( n m ( n g × m g ) ) null eigenvalues:
E i g ( C n m g ) = E i g ( C ˜ n m g I n m g ) { 0 with geometric multiplicity n m ( n g × m g ) } .

2.1.4. Case ( n m , g ) = 1

When both n m and g are coprime, the lemma provides a straightforward formula to compute the modulus of eigenvalues for a g-circulant matrix C n m g . This method draws upon the classical eigenvalue computation for the circulant matrix C n m .
Lemma 10.
Let C n m g C n m × n m be a g-block circulant such that ( n m , g ) = 1 and denote
C n m g = Z n m g ( F n F m ) D n m ( F n F m )
with
D n m = diag p = 0 , , m 1 j = 0 , , n 1 λ j ( p )
Then, the modulus of the eigenvalues of C n m g are given by
| λ j ( p ) ( C n m g ) | = k = 0 s 1 λ ( g k j ) m o d n ( g k p ) m o d m s ,
with j = 0 , 1 , , n 1 and p = 0 , 1 , , m 1 . where s N + is such that g s 1 ( m o d n ) .
Proof. 
By Lemma (9) we have that if C n m g is a g-block circulant matrix, then C n m g r is a g r -block circulant matrix, r Z + . By Lemma (5) C n m g r is also a g ˜ r -block circulant matrix, where g ˜ r g r m o d ( n m ) . Since ( n m , g ) = 1 then there is s Z + such that g s ( m o d n m ) 1 , then C n m g s is a block circulant matrix. Notice that the eigenvalues of C n m g are the modulus of the roots of index s of the eigenvalues of the block circulant matrix C n m g s . From Equations (15) and (20) we obtain that
C n m g s = ( F n F m ) ( M n m g D n m ) s ( F n F m )
Since ( F n F m ) ( F n F m ) = I , we have
E i g ( C n m g s ) = E i g ( ( M n m g D n m ) s )
Using Equation (22) and the fact that D n m = D ˜ n m g o , we obtain
( M n m g D n m ) s = M n m g D n m M n m g D n m . M n m g D n m = M n m g ( D n m M n m g ) s 1 D n m = M n m g ( M n m g D ˜ n m g ) s 1 D ˜ n m g o = M n m g M n m g diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m s 1 D ˜ n m g o = M n m g M n m g diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m M n m g s 2 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m D ˜ n m g o = M n m g 2 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m M n m g s 2 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m D ˜ n m g o = M n m g 2 M n m g diag p = 0 , , m 1 j = 0 , , n 1 λ g ( g j ) m o d n g ( g p ) m o d m s 2 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g j ) m o d n ( g p ) m o d m D ˜ n m g o = M n m g 2 M n m g diag p = 0 , , m 1 j = 0 , , n 1 λ ( g 2 j ) m o d n ( g 2 p ) m o d m s 2 D ˜ n m g D ˜ n m g o = M n m g 3 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g 2 j ) m o d n ( g 2 p ) m o d m M n m g s 3 diag p = 0 , , m 1 j = 0 , , n 1 λ ( g 2 j ) m o d n ( g 2 p ) m o d m D ˜ n m g D ˜ n m g o = M n m g s . D ˜ n m g s 1 . D ˜ n m g 2 D ˜ n m g D ˜ n m g o .
Using Lemma (9), Equation (2), and since g s m o d ( n m ) 1 , we have
M n m g s = M n m g . M n m g . . M n m g = M n m g s = M n m g s m o d ( n m ) = M n m = I n m .
Hence, we have that ( M n m g D n m ) s is a diagonal matrix, and its eigenvalues are given by the diagonal elements of D ˜ n m g s 1 D ˜ n m g 2 D ˜ n m g D ˜ n m g o , i.e., from (22)
λ j ( p ) ( M n m g D n m ) s = λ g s 1 j m o d n g s 1 p m o d m λ g 2 j m o d n g 2 p m o d m λ g j m o d n g p m o d m λ g 0 j m o d n g 0 p m o d m = k = 0 s 1 λ ( g k j ) m o d n ( g k p ) m o d m
with p = 0 , , m 1 and j = 0 , , n 1 . So we have the modulus of the eigenvalues of C n m g are the modulus of the roots of index s of the eigenvalues of the block circulant matrix C n m g s . □
Next, we will see a real block g-circulant matrix decomposition into an orthogonal matrix U n m g . We also will define an orthogonal matrix Q n m g whose multiplication by U n m g will be used in the DCT-DST algorithm.
Theorem 4.
Let C n m g R n m × n m be a real block g-circulant matrix, and define U n m g R n m × n m as:
U n m g = ( U n U m ) ( P n , n g + 1 P m , m g + 1 )
where U t is a matrix of dimension t as defined in (5) and P t , t g + 1 denote identity matrix of t dimension whose t and t g + 1 columns are exchanged. A straightforward calculation shows that
C n m g = U n m g Ω n m g U n m g T
where
Ω n m g = P n m g T N n m g P n m g
P n m g = P n , n g + 1 P m , m g + 1 , N n m g = U n m T Z n m g U n m , and U n m = U n U m .
Definition 5.
Let U m and P m , m g + 1 be the matrices as defined in (5) and Theorem (4) respectively. Matrix orthogonal Q n m g R n m × n m is denoted as
Q n m g = Q n ( U m P m , m g + 1 )
where
Q n = 1 2 2 0 0 0 I h 1 0 J h 1 0 0 2 0 0 J h 1 0 I h 1 , i f n = 2 h 1 2 2 0 0 0 I h J h 0 J h I h , i f n = 2 h + 1
as stated in [28].
Theorem 5.
Let U n m g and Q n m g be as defined in (29) and (32), respectively. Then we have
Q n m g U n m g = ( Q n U n P n , n g + 1 ) ( U m P m , m g + 1 ) 2
where U n and U m as defined in (5) and
Q n U n = C h + 1 I 0 0 J h 1 S h 1 I J h 1 , i f n = 2 h C h + 1 V 0 0 J h S h V J h , i f n = 2 h + 1

3. Block g-Circulant DCT-DST Transformer

The architecture of the transformer network includes one or two multi-head attention units, a position-wise feedforward unit, and sub-layer connections with layer normalization. A layer of an encoder or decoder is composed of these. Learning weight matrices to multiply a matrix is a step in the positionwise feedforward and multihead attention units. In [2], these weight matrices were uncompressed, dense, and random in the original Transformers. Using block g-circulant weight matrices, we compress these weight matrices in this study.
The block g-circulant DCT-DST transformer discussed in this study is a variation of the original transformer model outlined in [2]. This adaptation involves two essential modifications: firstly, the substitution of a block g-circulant matrix for the conventional dense weight matrix, and secondly, the integration of the DCT-DST matrix-vector multiplication algorithm. This integration enables efficient computation when multiplying the block g-circulant matrix with the vector input of each sublayer (Figure 1).

3.1. Multihead Attention Sublayer

Two input sequences are applied to the multi-head attention sublayer: a key/value sequence S K R n K × d m o d e l and a query sequence S Q R n Q × d m o d e l , with n K and n Q represent the number of queries and keys/values respectively and d m o d e l denote the dimensionality of input and output. Let the number of attention heads be y. By projecting the query sequence and key/value sequence by dense weight matrices, we create queries, keys, and values in R d k , where d k = d m o d e l / y . For i = 1 , , y and α = Q , K , V , we could learn 3 y separate dense weight matrices C α i R d k × d m o d e l . Rather, for α = Q , K , V , we learn a 3-dense weight matrices C α R d m o d e l × d m o d e l and slice the resulting products y times. It is important that these weight matrices C α have block sizes c a t t n smaller than d k to avoid correlations between different attention heads. After obtaining these projections, we calculate the dot product of attention for each attention head:
a t t e n t i o n ( Q ( i ) , K ( i ) , V ( i ) ) = s o f t m a x Q ( i ) K ( i ) T d k V ( i )
After computing dot-product attention, the output sequences are concatenated into a sequence of shape ( n k , d m o d e l ) and projected through a dense projection matrix C p r o j R d m o d e l × d m o d e l with block-size c a t t n . Thus, each dense matrix multihead attention unit learns 4 dense weight matrices C Q , C K , C V , C p r o j of shape d m o d e l × d m o d e l and block-size c a t t n < d m o d e l y .

3.2. Block g-Circulant Positionwise Feedforward Sublayer

The block g-circulant positionwise feedforward sublayer uses two block g-circulant weight matrices. It applies the transformation:
a = C 2 R e L U ( C 1 x )
where, C 1 R d f f × d m o d e l has block-size c 1 and C 2 R d m o d e l × d f f has block-size c 2 , C 1 and C 2 represent individual g-circulant block matrices, and by default, d f f = 4 × d m o d e l , where d f f denote the dimensionality of the inner-layer. When performing matrix multiplication between C 1 and C 2 and the multiplier vector, the DCT-DST multiplication algorithm is employed.

3.3. Block Sizes

The block-size parameters c a t t n , c 1 , and c 2 introduce the three-dimensional hyperparameter space. However, we established a single block-size model so that we could test models and iterate quickly. Concerning c m ,
c 1 = c m ; c 2 = 4 × c m ; c a t t n = c m / y
In this case, the maximum value of c m is d m o d e l , at which time it is entirely circulant.

3.4. DCT-DST Algorithm

In multiplying block g-circulant weight matrices with any input vector in each transformer layer, we applied the following algorithm, adapted from [28].
  • Compute v = Q n m g c 1 directly, c 1 = C n m g e 1 , e 1 = ( 1 , 0 , 0 , , 0 ) T
  • Compute v ^ = ( Q n m g U n m g ) T v by DCT and DST
  • Form Ω n m g
  • Compute y 1 = Q n m g x directly
  • Compute y 2 = ( Q n m g U n m g ) T y 1 by DCT and DST
  • Compute y 3 = Ω n m g y 2 directly
  • Compute y 4 = ( Q n m g U n m g ) y 3 by DCT and DST
  • Compute Q n m g T y 4 , i.e., C n m g x

4. Experiment and Result

4.1. Data and Experimental Details

This experiment was conducted using data from the TED Talks Open Translation Project’s Portuguese-English datasets, and Tensorflow datasets were then used to load the data. The TED Open Translation Project is a pioneering endeavor by a prominent media platform to subtitle and comprehensively catalog online video content. It represents a groundbreaking initiative in leveraging volunteer translation for public and professional purposes. Among the many talks available within the TED talks transcripts under the Open Translator project, the Portuguese-English transcript is just one of over 2400 talks spanning 109 languages. Portuguese and English are among the top three languages with the highest number of talks in the TED talks collection. It has approximately 52,000 in training examples (Portuguese-English sentence pairs), 1200 in validation examples, and 1800 in test examples.
The dataset was then tokenized using Moses tokenizer as used in the original transformer by [2]. We chose to employ this tokenizer because it is one of the two widely recognized rule-based tokenizers extensively utilized in machine translation and natural language processing experiments. According to [35], the Moses tokenizer has demonstrated superior performance to several other tokenizers, specifically in neural machine translation. We used the code from the TensorFlow.org tutorial on neural machine translation with a transformer and Keras. We utilized the various base model and optimizer setups, which were slightly different from the original transformer model. Distinct from [2], each model applied four layers, not six, as in [2], eight attention heads, and a dropout rate of 0.1 . We set a batch size of 64, while the number of epochs is 20. The feedforward dimensions are four times larger than those of the model. Like [2], we used an Adam optimizer with β 1 = 0.9 ; β 2 = 0.98 , and ϵ = 10 9 . The details of the experiment are provided in Table 1.
The model dimensions have various values depending on the size of the weight matrices being tested. The size of the weight matrices is the combinations of n and m values such that a block g-circulant matrix of dimension 128 was obtained. Choosing a 128-dimensional matrix corresponds to the findings derived from [26]. We also experimented with other matrix dimensions for comparative purposes as part of our analysis. We applied some values of g, inter alia, g = 1 , and other values such that ( n m , g ) 1 and ( n m , g ) = 1 . ( n m , g ) denote the greatest common divisor of n m and g.
The model’s name refers to the types of matrices and algorithms employed in the multi-head attention and feed-forward sublayers. We used two types of matrices (Dense and Real Block Circulant Matrices) in conjunction with the DCT-DST algorithm. For instance, the Dense-Block 2-Circulant DCT-DST transformer model means that in the multi-head attention is applied the Dense Matrix and in the feed-forward sublayer is used the real Block 2-Circulant matrix with the DCT-DST algorithm. In this experiment, we trained 3 (three) transformer models with block g-circulant weight matrices and one original transformer model. In each model, experiments were conducted with matrix sizes across five dimensions: 16, 32, 64, 128, and 256 (Table 2).

4.2. Evaluation

On a held-out set of 500 samples, we evaluated performance using the corpus BLEU (Bilingual Evaluation Understudy) score. This figure corresponds to [36]. Statistically, a sample size of 500 can be classified as significant, making it sufficiently robust for conclusion.
BLEU serves as a metric designed to evaluate machine-translated text automatically. It quantifies the similarity between the machine-translated text and a set of high-quality reference translations, generating a score ranging from zero to one. BLEU’s notable strength is its strong correlation with human judgments. It achieves this by averaging individual sentence judgment errors across a test corpus rather than attempting to precisely determine human judgment for every single sentence [36,37]. In our study, the corpus BLEU score employed the English sentence as its single reference and the top English sentence output of beam search as the hypothesis for each pair of Portuguese and English sentences in the evaluation set. Aggregating references and hypotheses across all pairings produced the corpus BLEU.

4.3. Result and Discussion

Based on the analysis of multiple model samples displayed in Figure 2, it’s evident that the overall models avoid overfitting, demonstrating their capability for predictive tasks. Reviewing the experimental outcomes outlined in Table 3, the Dense-Block 1-circulant DCT-DST model notably surpasses the Dense-Dense model in BLEU score and model memory efficiency. Specifically, it achieves a 4.1 % higher BLEU score and demonstrates a 22.14 % improvement in model memory utilization. However, across other metrics, the performance of the block g-circulant matrix-based model needs to catch up to the Dense-Dense model. During testing on the test dataset, the Dense-Dense model excels notably in test duration. Furthermore, across similar matrix dimensions, all models achieve loss values that exhibit marginal differences. Notably, the block g-circulant model group with ( n m , g ) = 1 tends to outperform the group where ( n m , g ) 1 , both in terms of loss and accuracy metrics. Figure 3, Figure 4, Figure 5 and Figure 6 provides a holistic view of the comparison among diverse g values, highlighting the fluctuations in loss, accuracy, BLEU score, and model memory size.
In general, the transformer model employing the block g-circulant weight matrix boasts a more efficient model memory size than the Dense-Dense model. This enhanced efficiency can be credited to utilizing the block g-circulant matrix, a structured matrix falling within the category of low displacement rank (LDR) matrices [12]. This discovery corroborates earlier research detailed in [26,28,29], which similarly highlighted the benefits of both block circulant matrices and circulant matrices through experimental evidence—the block g-circulant matrices allowing us to leverage the concept of data-sparsity. Data sparsity implies that representing an n × n matrix necessitates fewer than O ( n 2 ) parameters. Unlike traditional sparse matrices, data-sparse matrices aren’t mandated to contain zero entries; instead, a relationship exists among the matrix entries. Moreover, efficient algorithms, such as computing the matrix-vector product with any vector, can be achieved with fewer than O ( n 2 ) operations [12]. This approach is expected to reduce the number of model training parameters utilized in the experiment, consequently diminishing the demand for storage space. Moreover, the reduced storage space requirement is believed to stem from implementing the DCT-DST algorithm, elaborated upon in [28].
Additionally, the relatively prolonged test duration observed in the block g-circulant model experiments likely arises from employing a vector matrix multiplication algorithm involving more intricate procedural steps, such as the algorithm DCT-DST. Initially, some people anticipated that implementing the algorithm would streamline the testing process, but contrary to expectations, the opposite happened. This can be attributed to the algorithm’s relatively intricate structure and the deep transformer architecture, which involves a significant number of weight matrix and input vector multiplication operations.

5. Conclusions

Incorporating a structured block g-circulant matrix as a weight matrix, combined with the DCT-DST algorithm for multiplication with the input vector in the transformer model, effectively elevated the BLEU score and conserved storage space. However, this approach resulted in a slight reduction in accuracy and an extension of testing time. In a mathematical context, the Kronecker product operation plays a pivotal role in defining the matrices used in the algorithm, enabling the execution of the DCT-DST algorithm on the multiplication of the weight matrix with the transformer input vector.

Author Contributions

Conceptualization, I.M.-A., A.P. and E.A.; methodology, I.M.-A.; software, E.A. and A.P.; validation, I.M.-A. and A.P.; formal analysis, E.A.; investigation, E.A.; resources, I.M.-A. and A.P.; data curation, A.P.; writing—original draft preparation, E.A.; writing—review and editing, E.A., I.M.-A. and A.P.; visualization, I.M.-A.; supervision, I.M.-A. and A.P.; project administration, E.A.; funding acquisition, I.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Research and Community Service Program of the Indonesian Ministry of Education, Culture, Research and Technology 2023. The doctoral-degree scholarship was supported by the Center for Higher Education Fund (Balai Pembiayaan Pendidikan Tinggi), the Indonesia Endowment Fund for Education (LPDP), and Agency for the Assessment and Application of Technology (BPPT) through the Beasiswa Pendidikan Indonesia (BPI) Kemendikbudristek.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors express their gratitude to Mikhael Martin for his invaluable contribution in conceptualizing the programming language.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Mitsuda, K.; Higashinaka, R.; Sugiyama, H.; Mizukami, M.; Kinebuchi, T.; Nakamura, R.; Adachi, N.; Kawabata, H. Fine-Tuning a Pre-trained Transformer-Based Encoder-Decoder Model with User-Generated Question-Answer Pairs to Realize Character-like Chatbots. In Conversational AI for Natural Human-Centric Interaction: Proceedings of the 12th International Workshop on Spoken Dialogue System Technology, Singapore, IWSDS 2021; Springer Nature: Singapore, 2022; pp. 277–290. [Google Scholar]
  2. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; dan Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  3. Ranganathan, J.; Abuka, G. Text summarization using transformer model. In Proceedings of the 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), Milan, Italy, 29 November–1 December 2022; pp. 1–5. [Google Scholar]
  4. Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
  5. Zeng, P.; Zhang, H.; Song, J.; Gao, L. S2 transformer for image captioning. In Proceedings of the International Joint Conferences on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1608–1614. [Google Scholar]
  6. Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
  7. Toral, A.; Oliver, A.; Ballestín, P.R. Machine translation of novels in the age of transformer. arXiv 2020, arXiv:2011.14979. [Google Scholar]
  8. Araabi, A.; Monz, C. Optimizing transformer for low-resource neural machine translation. arXiv 2020, arXiv:2011.02266. [Google Scholar]
  9. Tian, T.; Song, C.; Ting, J.; Huang, H. A French-to-English machine translation model using transformer network. Procedia Comput. Sci. 2022, 199, 1438–1443. [Google Scholar] [CrossRef]
  10. Ahmed, K.; Keskar, N.S.; Socher, R. Weighted transformer network for machine translation. arXiv 2017, arXiv:1711.02132. [Google Scholar]
  11. Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. arXiv 2019, arXiv:1906.01787. [Google Scholar]
  12. Kissel, M.; Diepold, K. Structured Matrices and Their Application in Neural Networks: A Survey. New Gener. Comput. 2023, 41, 697–722. [Google Scholar] [CrossRef]
  13. Keles, F.D.; Wijewardena, P.M.; Hegde, C. On the computational complexity of self-attention. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, Singapore, 20–23 February 2022; PMLR:2023. pp. 597–619. [Google Scholar]
  14. Pan, Z.; Chen, P.; He, H.; Liu, J.; Cai, J.; Zhuang, B. Mesa: A memory-saving training framework for transformers. arXiv 2021, arXiv:2111.11124. [Google Scholar]
  15. Yang, H.; Zhao, M.; Yuan, L.; Yu, Y.; Li, Z.; Gu, M. Memory-efficient Transformer-based network model for Traveling Salesman Problem. Neural Netw. 2023, 161, 589–597. [Google Scholar] [CrossRef]
  16. Sohoni, N.S.; Aberger, C.R.; Leszczynski, M.; Zhang, J.; Ré, C. Low-memory neural network training: A technical report. arXiv 2019, arXiv:1904.10631. [Google Scholar]
  17. Sainath, T.N.; Kingsbury, B.; Sindhwani, V.; Arisoy, E.; Ramabhadran, B. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6655–6659. [Google Scholar]
  18. Sindhwani, V.; Sainath, T.; Kumar, S. Structured transforms for small-footprint deep learning. arXiv 2015, arXiv:1510.01722. [Google Scholar]
  19. Cheng, Y.; Yu, F.X.; Feris, R.S.; Kumar, S.; Choudhary, A.; Chang, S. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2857–2865. [Google Scholar]
  20. Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Boston, MA, USA, 14–17 October 2017; pp. 395–408. [Google Scholar]
  21. Yang, Z.; Moczulski, M.; Denil, M.; Freitas, N.D.; Song, L.; Wang, Z. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  22. Thomas, A.; Gu, A.; Dao, T.; Rudra, A.; Ré, C. Learning compressed transforms with low displacement rank. arXiv 2018, arXiv:1810.02309. [Google Scholar]
  23. Dao, T.; Gu, A.; Eichhorn, M.; Rudra, A.; Ré, C. Learning fast algorithms for linear transforms using butterfly factorizations. Proc. Mach. Learn. Res. 2019, 97, 1517–1527. [Google Scholar] [PubMed]
  24. Pan, V. Structured Matrices and Polynomials: Unified Superfast Algorithms; Springer Science and Business Media: Boston, MA, USA; New York, NY, USA, 2001. [Google Scholar]
  25. Davis, P.J. Circulant Matrices; Wiley: New York, NY, USA, 1979; Volume 2. [Google Scholar]
  26. Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. Real Block-Circulant Matrices and DCT-DST Algorithm for Transformer Neural Network. Front. Appl. Math. Stat. 2023, 9, 1260187. [Google Scholar] [CrossRef]
  27. Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. g-Circulant Matrices and Its Matrix-Vector Multiplication Algorithm for Transformer Neural Networks. AIP Conf. 2024. post-acceptance. [Google Scholar]
  28. Liu, Z.; Chen, S.; Xu, W.; Zhang, Y. The eigen-structures of real (skew) circulant matrices with some applications. Comput. Appl. Math. 2019, 38, 1–13. [Google Scholar] [CrossRef]
  29. Reid, S.; dan Mistele, M. Fast Fourier Transformed Transformers: Circulant Weight Matrices for NMT Compression. 2019. Available online: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15722831.pdf (accessed on 23 May 2024).
  30. Saxena, A.; Fernandes, F.C. DCT/DST-based transform coding for intra prediction in image/video coding. IEEE Trans. Image Process. 2013, 22, 3974–3981. [Google Scholar] [CrossRef]
  31. Park, W.; Lee, B.; Kim, M. Fast computation of integer DCT-V, DCT-VIII, and DST-VII for video coding. IEEE Trans. Image Process. 2019, 28, 5839–5851. [Google Scholar] [CrossRef]
  32. Olson, B.; Shaw, S.; Shi, C.; Pierre, C.; Parker, R. Circulant matrices and their application to vibration analysis. Appl. Mech. Rev. 2014, 66, 040803. [Google Scholar] [CrossRef]
  33. Serra-Capizzano, S.; Debora, S. A note on the eigenvalues of g-circulants (and of g-Toeplitz, g-Hankel matrices). Calcolo 2014, 51, 639–659. [Google Scholar] [CrossRef]
  34. Wilkinson, J.H. The Algebraic Eigenvalue Problem; Clarendon: Oxford, UK, 1965; Volume 662. [Google Scholar]
  35. Domingo, M.; Garcıa-Martınez, M.; Helle, A.; Casacuberta, F.; Herranz, M. How much does tokenization affect neural machine translation? arXiv 2018, arXiv:1812.08621. [Google Scholar]
  36. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  37. Post, M. A call for clarity in reporting BLEU scores. arXiv 2018, arXiv:1804.08771. [Google Scholar]
Figure 1. The encoder-decoder structure of the block g-circulant DCT-DST transformer.
Figure 1. The encoder-decoder structure of the block g-circulant DCT-DST transformer.
Mathematics 12 01697 g001
Figure 2. (a) The loss of Dense-Block 2-Circulant DCT-DST model; (b) The accuracy of Dense-Block 2-Circulant DCT-DST model; (c) The loss of Dense-Block 3-Circulant DCT-DST model; (d) The Accuracy of Dense-Block 3-Circulant DCT-DST model.
Figure 2. (a) The loss of Dense-Block 2-Circulant DCT-DST model; (b) The accuracy of Dense-Block 2-Circulant DCT-DST model; (c) The loss of Dense-Block 3-Circulant DCT-DST model; (d) The Accuracy of Dense-Block 3-Circulant DCT-DST model.
Mathematics 12 01697 g002
Figure 3. The loss values for the four transformer models.
Figure 3. The loss values for the four transformer models.
Mathematics 12 01697 g003
Figure 4. The accuracy values for the four transformer models.
Figure 4. The accuracy values for the four transformer models.
Mathematics 12 01697 g004
Figure 5. The BLEU score for the four transformer models.
Figure 5. The BLEU score for the four transformer models.
Mathematics 12 01697 g005
Figure 6. The Model memory size for the four transformer models.
Figure 6. The Model memory size for the four transformer models.
Mathematics 12 01697 g006
Table 1. The experiment details of models training.
Table 1. The experiment details of models training.
DatasetPortuguese-English translation dataset
from TED talks Open Translation project
TokenizerMoses tokenizer
Training hyperparametersnumber of epoch = 20
batch size = 64
number of layer = 4
d m o d e l = 16 , 32 , 64 , 128 , 256
d f f = 4 × d m o d e l
number of head = 8
dropout rate = 0.1
OptimizerAdam optimizer, β 1 = 0.9 ; β 2 = 0.98 , and  ϵ = 10 9
Weight matrix dimension 16 , 32 , 64 , 128 , 256
g 0 , 1 , 2 , 3
Table 2. The transformer model being tested along with the g value and size of the weight matrix.
Table 2. The transformer model being tested along with the g value and size of the weight matrix.
Modelg d model
Dense-dense (A)0 16 , 32 , 64 , 128 , 256
Dense-Block 1-Circulant DCT-DST (B)1 16 , 32 , 64 , 128 , 256
Dense-Block 2-Circulant DCT-DST (C)2 16 , 32 , 64 , 128 , 256
Dense-Block 3-Circulant DCT-DST (D)3 16 , 32 , 64 , 128 , 256
Table 3. The evaluated metrics across the four transformer models include accuracy, loss, test duration, BLEU score, and model memory consumption.
Table 3. The evaluated metrics across the four transformer models include accuracy, loss, test duration, BLEU score, and model memory consumption.
Modelg d model LossAccuracy (%)Test Duration (Second)BLEU (%)Model Memory (KB)
16 4.027 0.327 7.831 2.78 1751
32 2.963 0.485 7.907 2.71 3546
A064 2.404 0.569 8.260 2.71 7540
128 2.320 0.607 8.895 25.43 18,394
256 2.219 0.616 9.348 25.43 50,855
16 4.1058 0.3186 8.9704 1.75 1714
32 3.3718 0.4201 9.5401 11.64 3227
B164 2.651 0.5244 11.566 22.23 6542
128 2.373 0.579 27.839 26.47 14,322
256 2.3089 0.5855 121.908 24.65 34,491
16 4.151 0.317 17.591 2.29 1732
32 3.338 0.425 24.934 9.06 3246
C264 2.619 0.528 31.145 20.8 6560
128 2.448 0.555 59.918 21.69 14,340
256 2.323 0.577 180.119 24.18 34,509
16 4.0974 0.3232 18.4132 2.54 1732
32 3.2157 0.4404 24.5743 11.64 3246
D364 2.6102 0.5312 36.9716 21.35 6560
128 2.373 0.570 68.211 24.12 14,340
256 2.3079 0.5798 283.2846 24.98 34,509
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine. Mathematics 2024, 12, 1697. https://doi.org/10.3390/math12111697

AMA Style

Asriani E, Muchtadi-Alamsyah I, Purwarianti A. On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine. Mathematics. 2024; 12(11):1697. https://doi.org/10.3390/math12111697

Chicago/Turabian Style

Asriani, Euis, Intan Muchtadi-Alamsyah, and Ayu Purwarianti. 2024. "On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine" Mathematics 12, no. 11: 1697. https://doi.org/10.3390/math12111697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop