On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

Asriani, Euis; Muchtadi-Alamsyah, Intan; Purwarianti, Ayu

doi:10.3390/math12111697

Open AccessArticle

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

by

Euis Asriani

^1,*,

Intan Muchtadi-Alamsyah

^2,3

and

Ayu Purwarianti

^3,4

¹

Doctoral Program of Mathematics, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Bandung 40132, Indonesia

²

Algebra Research Group, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Bandung 40132, Indonesia

³

University Center of Excellence Artificial Intelligence on Vision, Natural Language Processing and Big Data Analytics (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung 40132, Indonesia

⁴

Informatics Research Group, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung 40132, Indonesia

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1697; https://doi.org/10.3390/math12111697

Submission received: 29 February 2024 / Revised: 3 May 2024 / Accepted: 24 May 2024 / Published: 29 May 2024

(This article belongs to the Special Issue Applications of Mathematics in Neural Networks and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Transformer has emerged as one of the modern neural networks that has been applied in numerous applications. However, transformers’ large and deep architecture makes them computationally and memory-intensive. In this paper, we propose the block g-circulant matrices to replace the dense weight matrices in the feedforward layers of the transformer and leverage the DCT-DST algorithm to multiply these matrices with the input vector. Our test using Portuguese-English datasets shows that the suggested method improves model memory efficiency compared to the dense transformer but at the cost of a slight drop in accuracy. We found that the model Dense-block 1-circulant DCT-DST of 128 dimensions achieved the highest model memory efficiency at 22.14%. We further show that the same model achieved a BLEU score of 26.47%.

Keywords:

transformer; block g-circulant matrices; DCT-DST algorithm; kronecker product

MSC:

15B05

1. Introduction

Across a wide range of applications, including chatbots [1], machine translation [2], text summarization [3], video-image processing [4,5,6], etc., transformer neural networks have significantly improved. As demonstrated by [7,8,9,10,11], transformers and their modifications have proven superior, particularly in machine translation. As the amount of computational power required for training and inference increases, one of the more advanced neural networks is the Transformer [12], which has a deeper topology. Conversely, the depth of the transformer architecture gives rise to several constraints and challenges, including high computational complexity [13], substantial demands on computational resources [14], and high memory consumption [15] that quadratic to the input sequence length. Therefore, methods are required to achieve excellent performance, mainly when using them as translation machines. Several approaches have been put forth to address this problem, including weight matrix decomposition, matrix-vector algorithm selection, and weight matrix replacement.

Utilizing a structured weight matrix is considered a crucial tactic among other strategies because of its benefits. According to [16], this may lessen the memory needed for training and storing the model (and optimizer). This alternative approach can also mitigate computational complexity, leveraging the benefits of the selected structured weight matrix [12]. This fact has encouraged the creation of a wide variety of structured matrices, including low-rank matrices [17], Toeplitz-like matrices [18], block-circulant matrices [19,20], fast-food transforms [21], low displacement rank matrices [22], and butterfly matrices [23], among other matrices. One of the Low Displacement Rank (LDR) matrices is the block g-circulant matrix [24,25]. The block g-circulant matrix combines features from block-circulant and g-circulant matrices. Its adoption as a replacement for the transformer weight matrix is believed to enhance transformer performance, building on findings from prior research [26,27]. With a g position shift to the right, this matrix is a generalization of the block circulant matrix. This property makes more intricate patterns and interactions within the matrix possible, expanding the range of potential uses. Among their many benefits is the ability to find efficient algorithms, such as those for multiplying them by any vector to reduce computational complexity [28]. This is why structured matrices are used as transformer weight matrices.

We have found that the Fast Fourier Transform (FFT) algorithm is dependable when the structured matrix is a circulant matrix [29]. The ability of the FFT Transformer to compress model memory up to 100 times has been demonstrated. This transformer is based on the feedforward layer’s block g-circulant weight matrices. Liu [28] discovered the Discrete Cosine Transform-Discrete Sine Transform (DCT-DST) algorithm for circulant matrix-vector products, which can save storage compared to FFT. No text processing system has ever employed the DCT-DST algorithm. Processing photos and videos is its more common usage [30,31]. With a DCT-DST algorithm for the translation machine, we present a novel method for executing a real block g-circulant weight matrix-based transformer in this paper. Our objective is to impose a block g-circulant structure on transformer model topologies through the elegant mathematical characteristic of the block g-circulant matrix. The contributions of this paper are summarized as follows:

We define the block g-circulant matrix and generate several lemmas and theorems regarding the characteristics and possible eigenvalues, as well as defining the matrices utilized in carrying out the DCT-DST algorithm for multiplication of block g-circulant matrices.
We propose a new approach in using structured matrices as a replacement for dense weight matrices, combined with matrix multiplication algorithms, in this case, a combination of block g-circulant matrices and the DCT-DST algorithm.
Our research is the first study that applied the DCT-DST algorithm for weight matrix-vector multiplication in a transformer-based translation machine.

In this article, we present a structured exploration of key concepts. Firstly, we delve into the foundational motivation driving our exploration of block g-circulant matrices and the DCT-DST algorithm. Section 2 explains the theories and characteristics of block g-circulant matrices, including a comprehensive examination of the potential eigenvalues of real block g-circulant matrices. Section 3 describes the workings of the block g-circulant DCT-DST transformer, including a comprehensive explanation of the associated algorithm. Section 4 presents our experimental methodology, results, and a thorough discussion of the research findings. Finally, we summarize our insights in a concise conclusion in the closing section.

2. Block g-Circulant Matrix

Definition 1.

Let

C_{i}

be a

m \times m

matrix for each

i = 1, \dots, n

. Then a

n m \times n m

block-circulant matrix

C_{n m}

is generated from the ordered set

C_{1}, C_{2}, \dots, C_{n}

, and is of the form

C_{n m} = [\begin{matrix} C_{1} & C_{2} & \dots & C_{n} \\ C_{n} & C_{1} & \dots & C_{n - 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{2} & C_{3} & \dots & C_{1} \end{matrix}] .

The set of all such matrices of order

n m \times n m

will be denoted by

B C_{n m}

, whereas

B_{m}

represents the set of all circulant matrices of dimension m.

Theorem 1

([32]). Let

C_{n m} \in B C_{n m}

has generating elements

C_{1}, C_{2}, \dots, C_{n} \in B_{m}

. If

c_{i}^{(1)}

,

c_{i}^{(2)}, \dots, c_{i}^{(m)}

are generating elements of

C_{i}

, then

(F_{n}^{*} \otimes F_{m}^{*}) C_{n m} (F_{n} \otimes F_{m}) = D_{n m} = d i a g_{i = 1, \dots, n} [\begin{matrix} λ_{i}^{(1)} & 0 & \dots & 0 \\ 0 & λ_{i}^{(2)} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & λ_{i}^{(m)} \end{matrix}]

(1)

is a diagonal matrix of dimension

n m \times n m

with

λ_{i}^{(p)} = \sum_{k = 1}^{n} \sum_{l = 1}^{m} c_{k}^{l} ω_{p - 1}^{l - 1} ω_{i - 1}^{k - 1},

(2)

with

i = 1, 2, \dots, n

and

p = 1, 2, \dots, m .

F_{n} = \frac{1}{\sqrt{n}} ω^{- j k}, j, k = 0, 1, \dots, n - 1

(3)

with

ω = e^{\frac{2 π}{n} i}, i = \sqrt{- 1}

.

In case

C_{n m} \in R^{n m \times n m}

, we can decompose the diagonalization of

C_{n m}

[26] as follow:

(U_{n}^{*} \otimes U_{m}^{*}) C_{n m} (U_{n} \otimes U_{m}) = Ω_{n m}

(4)

with

U_{n} = \{\begin{matrix} [t_{0}, \sqrt{2} t_{1}, \dots, \sqrt{2} t_{h - 1}, t_{h}, \sqrt{2} s_{h - 1}, \dots, \sqrt{2} s_{1}], & if n = 2 h \\ [t_{0}, \sqrt{2} t_{1}, \dots, \sqrt{2} t_{h}, \sqrt{2} s_{h}, \dots, \sqrt{2} s_{1}], & if n = 2 h + 1 \end{matrix}

(5)

where

t_{k}

and

s_{k}

are real and imaginary parts of columns of

F_{n} = [f_{0}, f_{1}, \dots, f_{h}]

and

Ω_{n m} = [\begin{matrix} q_{1} \\ q_{2} & s_{2} \\ ⋱ & ⋰ \\ q_{h} & s_{h} \\ q_{h + 1} \\ - s_{h} & q_{h} \\ ⋰ & ⋱ \\ - s_{2} & q_{2} \end{matrix}]

and for

n = 2 h + 1

it will be

Ω_{n m} = [\begin{matrix} q_{1} \\ q_{2} & s_{2} \\ ⋱ & ⋰ \\ q_{h + 1} & s_{h + 1} \\ - s_{h + 1} & q_{h + 1} \\ ⋰ & ⋱ \\ - s_{2} & q_{2} \end{matrix}]

with

\begin{matrix} q_{i} = \{\begin{matrix} [\begin{matrix} \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(1)} \\ \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(2)} & \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(2)} \\ ⋱ & ⋰ \\ \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(r)} & \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(r)} \\ \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(r + 1)} \\ - \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(r)} & \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(r)} \\ ⋰ & ⋱ \\ - \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(2)} & - \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(2)} \end{matrix}], & if m = 2 r \\ [\begin{matrix} \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(1)} \\ \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(2)} & \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(2)} \\ ⋱ & ⋰ \\ \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(r + 1)} & \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(r + 1)} \\ - \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(r + 1)} & \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(r + 1)} \\ ⋰ & ⋱ \\ - \sum_{k = 1}^{n} a_{k}^{(i)} β_{k}^{(2)} & \sum_{k = 1}^{n} a_{k}^{(i)} α_{k}^{(2)} \end{matrix}], & if m = 2 r + 1 \end{matrix} \end{matrix}

and

\begin{matrix} s_{i} = \{\begin{matrix} [\begin{matrix} \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(1)} \\ \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(2)} & \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(2)} \\ ⋱ & ⋰ \\ \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(r)} & \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(r)} \\ \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(r + 1)} \\ - \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(r)} & \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(r)} \\ ⋰ & ⋱ \\ - \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(2)} & - \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(2)} \end{matrix}], & if m = 2 r \\ [\begin{matrix} \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(1)} \\ \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(2)} & \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(2)} \\ ⋱ & ⋰ \\ \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(r + 1)} & \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(r + 1)} \\ - \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(r + 1)} & \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(r + 1)} \\ ⋰ & ⋱ \\ - \sum_{k = 1}^{n} b_{k}^{(i)} β_{k}^{(2)} & \sum_{k = 1}^{n} b_{k}^{(i)} α_{k}^{(2)} \end{matrix}], & if m = 2 r + 1 \end{matrix} \end{matrix}

where

α, β

, and

a, b

denote a real and imaginary part of the eigenvalues of

C_{n m}

decomposers.

Definition 2

([28]). The Discrete Cosine Transform (DCT) I and V matrices are defined as follows

C_{n + 1}^{I} = \sqrt{\frac{2}{n}} {[τ_{j} τ_{k} cos \frac{j k π}{n}]}_{j, k = 0}^{n}

(6)

C_{n}^{V} = \frac{2}{\sqrt{2 n - 1}} {[τ_{j} τ_{k} cos \frac{2 j k π}{2 n - 1}]}_{j, k = 0}^{n - 1}

(7)

\begin{array}{l} with τ_{l (l = j, k)} = \{\begin{matrix} \frac{1}{\sqrt{2}}, & i f l = 0 o r l = n \\ 1, & i f l o t h e r w i s e \end{matrix} \\ ι_{k} = \{\begin{matrix} \frac{1}{\sqrt{2}}, & i f k = n - 1 \\ 1, & i f k o t h e r w i s e \end{matrix} \end{array}

Definition 3

([28]). The Discrete Sine Transform (DST) I and V matrices are defined as follows

S_{n - 1}^{I} = \sqrt{\frac{2}{n}} {[sin \frac{j k π}{n}]}_{j, k = 1}^{n - 1}

(8)

S_{n - 1}^{V} = \frac{2}{\sqrt{2 n - 1}} {[sin \frac{2 j k π}{2 n - 1}]}_{j, k = 1}^{n - 1}

(9)

In the following theorem, we will see that the matrix

U_{n}

, defined in (5), can be partitioned into a matrix generated by the DCT-DST matrices.

Theorem 2

([28]). Let

U_{n}

be the matrix stated in (5). Then

U_{n}

can be partitioned into the following form

U_{n} = \{\begin{matrix} [\begin{matrix} σ_{1} q_{h + 1}^{T} & 0 \\ C & - \frac{1}{2} \sqrt{2} S_{h - 1}^{I} J_{h - 1} \\ σ_{1} v_{h + 1}^{T} & 0 \\ J_{h - 1} C & \frac{1}{2} \sqrt{2} J_{h - 1} S_{h - 1}^{I} J_{h - 1} \end{matrix}], & i f n = 2 h \\ [\begin{matrix} σ_{1} p_{h + 1}^{T} & 0 \\ C & - \frac{1}{2} \sqrt{2} S_{h}^{V} J_{h} \\ J_{h} C & \frac{1}{2} \sqrt{2} J_{h} S_{h}^{V} J_{h} \end{matrix}], & i f n = 2 h + 1 \end{matrix}

(10)

Definition 4.

Let

C_{i}^{(p)}

be a

m \times m

g-circulant matrix, i.e., classical circulant matrix with each row shifted g position to the right. A

n m \times n m

matrix

C_{n m g}

is a block g-circulant matrix if it is generated by

C_{i}^{(p)}

,

i = 1, 2, \dots, n

;

p = 1, 2, \dots, m

and shifted g position to the right.

Lemma 1.

Let

C_{n m g} \in C^{n m \times n m}

be a block g-circulant matrix with n block, and each block has order m. Then

C_{n m g}

can be denoted as

C_{n m g} = Z_{n m g} C_{n m}

(11)

where

C_{n m}

is a block g-circulant matrix with

g = 1

and

Z_{n m g} = Z_{n, g} \otimes Z_{m, g}

(12)

where

Z_{n, g} = {[δ_{(g r - s) m o d n}]}_{g, r = 0}^{n - 1}

(13)

δ_{k} = \{\begin{matrix} 1 i f k \equiv 0 m o d n \\ 0 i f k o t h e r w i s e \end{matrix}

(14)

and by (1) we have

C_{n m g} = Z_{n m g} (F_{n} \otimes F_{m}) D_{n m} (F_{n}^{*} \otimes F_{m}^{*})

(15)

Lemma 2.

Let

n, m \in Z

and

n, m > 2

., and let

n_{g} = n / (n, g)

,

m_{g} = m / (m, g)

with

(n, g)

and

(m, g)

are the great common divisor between n, m respectively with g. Then we have

\begin{matrix} Z_{n m g} = & \underset{︸}{{[\begin{matrix} {\tilde{Z}}_{n m g} | {\tilde{Z}}_{n m g} | \dots | {\tilde{Z}}_{n m g} \end{matrix}]}^{T}} \\ (n, g) \times (m, g) times \end{matrix}

(16)

where

Z_{n m g}

is the matrix defined in (12) and

{\tilde{Z}}_{n m g} \in C^{(n_{g} \times m_{g}) \times n m}

is the submatrix of

Z_{n m g}

obtained by considering only its first

(n_{g} \times m_{g})

rows.

Lemma 3

([33]). Let

g \geq n

, then

(a): $Z_{n, g} = Z_{n, g^{o}}$ ,
(b): $C_{n, g} = C_{n, g^{o}}$ ,
(c): $M_{n, g} = M_{n, g^{o}}$
(d): $M_{n, g} M_{n, h} = M_{n, g h}$

where

g^{o} \equiv g

mod n.

Lemma 4.

Let

Z_{n, g}

be the matrix defined ini (13) and

F_{n}

be the Fourier matrix defined in (3). Then we have

Z_{n, g} = F_{n} M_{n, g} F_{n}^{*}

(17)

where

M_{n, g} = {[δ_{r - g c}]}_{r, c = 0}^{n - 1}

(18)

and

δ_{k}

is defined as in (14).

Lemma 5.

Let

g \geq n, m

. Then

Z_{n m g} = Z_{n m g^{o}}

where

g^{o} = g

mod

(n m)

. As a consequence we have

C_{n m g} = C_{n m g^{o}}

.

Proof.

From (12) and by Lemma (3), we have that

Z_{n m g} = Z_{n, g} \otimes Z_{m, g} = Z_{n, g^{o}} \otimes Z_{m, g^{o}} = Z_{n m g^{o}}

(19)

and so from (11) we have the equality

C_{n m g} = Z_{n m g} C_{n m} = Z_{n m g^{o}} C_{n m} = C_{n m g^{o}}

. □

Lemma 6.

Let

Z_{n m g}

be the matrix defined in (12) and let

F_{n}

as defined in (3). Then

Z_{n m g} = (F_{n} \otimes F_{m}) M_{n m g} (F_{n}^{*} \otimes F_{m}^{*})

(20)

where

M_{n m g} = M_{n, g} \otimes M_{m, g} = {[δ_{r - g s}]}_{r, s = 0}^{n - 1} \otimes {[δ_{r - g s}]}_{r, s = 0}^{m - 1} .

(21)

and

δ_{k}

is defined as in (14).

Proof.

It suffices to show that

(F_{n}^{*} \otimes F_{m}^{*}) Z_{n m g} = M_{m n g} (F_{n}^{*} \otimes F_{m}^{*})

.

Using (12), kronecker product properties, and (17) we have

\begin{matrix} (F_{n}^{*} \otimes F_{m}^{*}) Z_{n m g} & = (F_{n}^{*} \otimes F_{m}^{*}) (Z_{n, g} \otimes Z_{m, g}) \\ = F_{n}^{*} Z_{n, g} \otimes F_{m}^{*} Z_{m, g} \\ = M_{n, g} F_{n}^{*} \otimes M_{m, g} F_{m}^{*} \\ = (M_{n, g} \otimes M_{m, g}) (F_{n}^{*} \otimes F_{m}^{*}) \\ = M_{n m g} (F_{n}^{*} \otimes F_{m}^{*}) . \end{matrix}

□

In the following lemma, we will see that Lemma (5) also holds for the matrix

M_{m n g}

.

Lemma 7.

If

g \geq n, m

, then

M_{n m g} = M_{n m g^{o}}

with

g^{o} = g

mod

(n m)

.

Proof.

By using (20) we have

\begin{matrix} Z_{n m g} = (F_{n} \otimes F_{m}) M_{n m g} (F_{n}^{*} \otimes F_{m}^{*}) \\ Z_{n m g^{o}} = (F_{n} \otimes F_{m}) M_{n m g^{o}} (F_{n}^{*} \otimes F_{m}^{*}) \end{matrix}

By Lemma (5), we deduce

(F_{n} \otimes F_{m}) M_{n m g} (F_{n}^{*} \otimes F_{m}^{*}) = (F_{n} \otimes F_{m}) M_{n m g^{o}} (F_{n}^{*} \otimes F_{m}^{*})

and so

M_{n m g} = M_{n m g^{o}}

. □

Lemma 8.

Let

D_{n m} \in C^{n m \times n m}

be a diagonal matrix

D_{n m} = \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{j}^{(p)})

where

p = 0, \dots, m - 1; j = 0, \dots, n - 1

, and let

M_{n m g}

be the matrix defined in (21). Then

D_{n m} M_{n m g} = M_{n m g} {\tilde{D}}_{n m g}

(22)

where

{\tilde{D}}_{n m g} = \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{g j (m o d n)}^{(g p (m o d n))})

.

Lemma 9.

Let

C_{n m g} \in C^{n m \times n m}

be a g-block circulant matrix and

C_{n m h} \in C^{n m \times n m}

be a h-block circulant matrix. Then

C_{n m g} C_{n m h} \in C^{n m \times n m}

is a

g h

-block circulant matrix.

Proof.

Using (11) and (15) we have

C_{n m g} = Z_{n m g} C_{n m} = Z_{n m g} (F_{n} \otimes F_{m}) D_{n m}^{(1)} (F_{n}^{*} \otimes F_{m}^{*})

and

C_{n m h} = Z_{n m h} C_{n m} = Z_{n m h} (F_{n} \otimes F_{m}) D_{n m}^{(2)} (F_{n}^{*} \otimes F_{m}^{*}) .

Since (20), then

C_{n m g} C_{n m h} = (F_{n} \otimes F_{m}) M_{n m g} D_{n m}^{(1)} M_{n m h} D_{n m}^{(2)} (F_{n}^{*} \otimes F_{m}^{*})

By using (22), the last equation will be

C_{n m g} C_{n m h} = (F_{n} \otimes F_{m}) M_{n m g} M_{n m h} {\tilde{D}}_{n m g}^{(1)} D_{n m}^{(2)} (F_{n}^{*} \otimes F_{m}^{*})

Furthermore by (21) and Kronecker product properties we have

\begin{matrix} M_{n m g} M_{n m h} & = (M_{n g} \otimes M_{m g}) (M_{n h} \otimes M_{m h}) \end{matrix}

(23)

\begin{matrix} = (M_{n g} M_{n h}) \otimes (M_{m g} M_{m h}) \end{matrix}

(24)

\begin{matrix} = (M_{n g h}) \otimes (M_{m g h}) \end{matrix}

(25)

\begin{matrix} = M_{n m g h} \end{matrix}

(26)

So by (20) we obtain

C_{n m g} C_{n m h} = Z_{n m g h} (F_{n} \otimes F_{m}) {\tilde{D}}_{n m g}^{(1)} D_{n m}^{(2)} (F_{n}^{*} \otimes F_{m}^{*})

Since

{\tilde{D}}_{n m g}^{(1)} D_{n m}^{(2)}

is a diagonal matrix, then the last equation is the representation of

g h

-circulant. □

2.1. Eigenvalues of Block g-Circulant Matrices

In this section, we explore the eigenvalues of a block g-circulant matrix in case

g = 0, g = 1

, and

(n, g) \neq 1

and

(m, g) \neq 1

.

2.1.1. Case $g = 0$

If

g = 0

, this means that

C_{n m g}

is a matrix that has constant elements along all the rows in each block and, therefore, it has rank 1; then, remembering that the trace (tr(·)) of a matrix is the sum of its eigenvalues, we can conclude that

C_{n m g}

has

(n m \times n m) - 1

zero eigenvalues and one eigenvalue

λ

different from zero given by

λ (C_{n m g}) = \sum_{i = 1}^{n} t r (C_{i}) = \sum_{i = 1}^{n} \sum_{p = 1}^{m} c_{i}^{(p)}

where

C_{i} = (c_{i}^{(1)}, c_{i}^{(2)}, \dots, c_{i}^{(m)})

and

c_{i}^{(p)}

is the entry of the ith block and pth column.

2.1.2. Case $g = 1$

If

g = 1

, then the

C_{n m g}

is a “classical” block circulant matrix as defined in Definition 1, and its eigenvalues are given by

λ_{i}^{(p)} = \sum_{k = 1}^{n} \sum_{l = 1}^{m} c_{k}^{l} ω_{p - 1}^{l - 1} ω_{i - 1}^{k - 1}

(27)

2.1.3. Case $(n m, g) \neq 1$

The following theorem we will use to calculate the eigenvalues of a singular matrix

A \in C^{n \times n}

with rank

(A) = r \leq n

by calculating the eigenvalues of a smaller matrix

A \in C^{k \times k}

with

r \leq k \leq n

.

Theorem 3

([34]). Let A be a matrix of dimension

n \times n

,

A \in C^{n \times n}

, which can be written as

A = X Y^{*}

, where

X, Y \in C^{n \times k}

, with

k \leq n

. Then the eigenvalues of the matrix A are given by k eigenvalues of the matrix

Y^{*} X \in C^{k \times k}

and

n - k

zero eigenvalues:

E i g (A) = E i g (Y^{*} X) ⋃ {0 w i t h g e o m e t r i c m u l t i p l i c i t y n - k} .

We will apply Theorem (3) while considering that the matrix

C_{n m g}

is singular. By Lemma 2 we can write that

\begin{matrix} Z_{n m g} & = {[\begin{matrix} {\tilde{Z}}_{n m g} | {\tilde{Z}}_{n m g} | \dots | {\tilde{Z}}_{n m g} \end{matrix}]}^{T} \\ = ({[\begin{matrix} I_{n_{g}} | I_{n_{g}} | \dots | I_{n_{g}} \end{matrix}]}^{T} \otimes {[\begin{matrix} I_{m_{g}} | I_{m_{g}} | \dots | I_{m_{g}} \end{matrix}]}^{T}) ({\tilde{Z}}_{n, g} \otimes {\tilde{Z}}_{m, g}) \\ = I_{n m_{g}} {\tilde{Z}}_{n m g} \end{matrix}

where

{\tilde{Z}}_{n m g} \in C^{(n_{g} \times m_{g}) \times n m}

and

I_{n m_{g}}

is the identity matrix of dimension

n m \times (n_{g} \times m_{g})

. Now we can rewrite (11) as bellow

\begin{matrix} C_{n m g} & = Z_{n m g} C_{n m} \\ = I_{n m_{g}} {\tilde{Z}}_{n m g} C_{n m} \\ = I_{n m_{g}} {\tilde{C}}_{n m g} \end{matrix}

where

{\tilde{C}}_{n m g} = {\tilde{Z}}_{n m g} C_{n m} \in C^{(n_{g} \times m_{g}) \times n m}

.

By Theorem (3) we find that the eigenvalues of

C_{n m g}

are equal to those of

{\tilde{C}}_{n m g} I_{n m_{g}} \in C^{(n_{g} \times m_{g}) \times (n_{g} \times m_{g})}

, plus

(n m - (n_{g} \times m_{g}))

null eigenvalues:

E i g (C_{n m g}) = E i g ({\tilde{C}}_{n m g} I_{n m_{g}}) ⋃ {0

with geometric multiplicity

n m - (n_{g} \times m_{g})}

.

2.1.4. Case $(n m, g) = 1$

When both

n m

and g are coprime, the lemma provides a straightforward formula to compute the modulus of eigenvalues for a g-circulant matrix

C_{n m g}

. This method draws upon the classical eigenvalue computation for the circulant matrix

C_{n m}

.

Lemma 10.

Let

C_{n m g} \in C^{n m \times n m}

be a g-block circulant such that

(n m, g) = 1

and denote

C_{n m g} = Z_{n m g} (F_{n}^{*} \otimes F_{m}^{*}) D_{n m} (F_{n} \otimes F_{m})

with

D_{n m} = \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{j}^{(p)})

Then, the modulus of the eigenvalues of

C_{n m g}

are given by

| λ_{j}^{(p)} (C_{n m g}) | = |\sqrt[s]{\prod_{k = 0}^{s - 1} λ_{(g^{k} j) m o d n}^{(g^{k} p) m o d m}}|,

(28)

with

j = 0, 1, \dots, n - 1

and

p = 0, 1, \dots, m - 1

. where

s \in N^{+}

is such that

g^{s} \equiv 1 (m o d n)

.

Proof.

By Lemma (9) we have that if

C_{n m g}

is a g-block circulant matrix, then

C_{n m g}^{r}

is a

g^{r}

-block circulant matrix,

r \in Z^{+}

. By Lemma (5)

C_{n m g}^{r}

is also a

{\tilde{g}}^{r}

-block circulant matrix, where

{\tilde{g}}^{r} \equiv g^{r} m o d (n m)

. Since

(n m, g) = 1

then there is

s \in Z^{+}

such that

g^{s} (m o d n m) \equiv 1

, then

C_{n m g}^{s}

is a block circulant matrix. Notice that the eigenvalues of

C_{n m g}

are the modulus of the roots of index s of the eigenvalues of the block circulant matrix

C_{n m g}^{s}

. From Equations (15) and (20) we obtain that

C_{n m g}^{s} = (F_{n} \otimes F_{m}) {(M_{n m g D_{n m}})}^{s} (F_{n}^{*} \otimes F_{m}^{*})

Since

(F_{n} \otimes F_{m}) (F_{n}^{*} \otimes F_{m}^{*}) = I

, we have

E i g (C_{n m g}^{s}) = E i g ({(M_{n m g} D_{n m})}^{s})

Using Equation (22) and the fact that

D_{n m} = {\tilde{D}}_{n m g^{o}}

, we obtain

\begin{matrix} {(M_{n m g} D_{n m})}^{s} & = M_{n m g} D_{n m} M_{n m g} D_{n m} . \dots M_{n m g} D_{n m} \\ = M_{n m g} {(D_{n m} M_{n m g})}^{s - 1} D_{n m} \\ = M_{n m g} {(M_{n m g} {\tilde{D}}_{n m g})}^{s - 1} {\tilde{D}}_{n m g^{o}} \\ = M_{n m g} {[M_{n m g} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m})]}^{s - 1} {\tilde{D}}_{n m g^{o}} \\ = M_{n m g} M_{n m g} {[\underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m}) M_{n m g}]}^{s - 2} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m}) {\tilde{D}}_{n m g^{o}} \\ = M_{n m g}^{2} {[\underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m}) M_{n m g}]}^{s - 2} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m}) {\tilde{D}}_{n m g^{o}} \\ = M_{n m g}^{2} {[M_{n m g} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{g (g j) m o d n}^{g (g p) m o d m})]}^{s - 2} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g j) m o d n}^{(g p) m o d m}) {\tilde{D}}_{n m g^{o}} \\ = M_{n m g}^{2} {[M_{n m g} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g^{2} j) m o d n}^{(g^{2} p) m o d m})]}^{s - 2} {\tilde{D}}_{n m g} {\tilde{D}}_{n m g^{o}} \\ = M_{n m g}^{3} {[\underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g^{2} j) m o d n}^{(g^{2} p) m o d m}) M_{n m g}]}^{s - 3} \underset{\begin{matrix} p = 0, \dots, m - 1 \\ j = 0, \dots, n - 1 \end{matrix}}{diag} (λ_{(g^{2} j) m o d n}^{(g^{2} p) m o d m}) {\tilde{D}}_{n m g} {\tilde{D}}_{n m g^{o}} \\ ⋮ \\ = M_{n m g}^{s} . {\tilde{D}}_{n m g^{s - 1}} . \dots {\tilde{D}}_{n m g^{2}} {\tilde{D}}_{n m g} {\tilde{D}}_{n m g^{o}} . \end{matrix}

Using Lemma (9), Equation (2), and since

g^{s} m o d (n m) \equiv 1

, we have

\begin{matrix} M_{n m g}^{s} & = M_{n m g} . M_{n m g} . \dots . M_{n m g} \\ = M_{n m g^{s}} = M_{n m g^{s} m o d (n m)} = M_{n m} = I_{n m} . \end{matrix}

Hence, we have that

{(M_{n m g} D_{n m})}^{s}

is a diagonal matrix, and its eigenvalues are given by the diagonal elements of

{\tilde{D}}_{n m g^{s - 1}} \dots {\tilde{D}}_{n m g^{2}} {\tilde{D}}_{n m g} {\tilde{D}}_{n m g^{o}}

, i.e., from (22)

\begin{matrix} λ_{j}^{(p)} {(M_{n m g} D_{n m})}^{s} & = (λ_{g^{s - 1} j m o d n}^{g^{s - 1} p m o d m}) \dots (λ_{g^{2} j m o d n}^{g^{2} p m o d m}) (λ_{g j m o d n}^{g p m o d m}) (λ_{g^{0} j m o d n}^{g^{0} p m o d m}) \\ = \prod_{k = 0}^{s - 1} λ_{(g^{k} j) m o d n}^{(g^{k} p) m o d m} \end{matrix}

with

p = 0, \dots, m - 1

and

j = 0, \dots, n - 1

. So we have the modulus of the eigenvalues of

C_{n m g}

are the modulus of the roots of index s of the eigenvalues of the block circulant matrix

C_{n m g}^{s}

. □

Next, we will see a real block g-circulant matrix decomposition into an orthogonal matrix

U_{n m g}

. We also will define an orthogonal matrix

Q_{n m g}

whose multiplication by

U_{n m g}

will be used in the DCT-DST algorithm.

Theorem 4.

Let

C_{n m g} \in R^{n m \times n m}

be a real block g-circulant matrix, and define

U_{n m g} \in R^{n m \times n m}

as:

U_{n m g} = (U_{n} \otimes U_{m}) (P_{n, n - g + 1} \otimes P_{m, m - g + 1})

(29)

where

U_{t}

is a matrix of dimension t as defined in (5) and

P_{t, t - g + 1}

denote identity matrix of t dimension whose t and

t - g + 1

columns are exchanged. A straightforward calculation shows that

C_{n m g} = U_{n m g} Ω_{n m g} U_{n m g}^{T}

(30)

where

Ω_{n m g} = P_{n m g}^{T} N_{n m g} P_{n m g}

(31)

P_{n m g} = P_{n, n - g + 1} \otimes P_{m, m - g + 1}

,

N_{n m g} = U_{n m}^{T} Z_{n m g} U_{n m}

, and

U_{n m} = U_{n} \otimes U_{m}

.

Definition 5.

Let

U_{m}

and

P_{m, m - g + 1}

be the matrices as defined in (5) and Theorem (4) respectively. Matrix orthogonal

Q_{n m g} \in R^{n m \times n m}

is denoted as

Q_{n m g} = Q_{n} \otimes (U_{m} P_{m, m - g + 1})

(32)

where

Q_{n} = \{\begin{matrix} \frac{1}{\sqrt{2}} [\begin{matrix} \sqrt{2} & 0 & 0 \\ 0 & I_{h - 1} & 0 & J_{h - 1} \\ 0 & 0 & \sqrt{2} & 0 \\ 0 & - J_{h - 1} & 0 & I_{h - 1} \end{matrix}], & i f n = 2 h \\ \frac{1}{\sqrt{2}} [\begin{matrix} \sqrt{2} & 0 & 0 \\ 0 & I_{h} & J_{h} \\ 0 & - J_{h} & I_{h} \end{matrix}], & i f n = 2 h + 1 \end{matrix}

(33)

as stated in [28].

Theorem 5.

Let

U_{n m g}

and

Q_{n m g}

be as defined in (29) and (32), respectively. Then we have

Q_{n m g} U_{n m g} = (Q_{n} U_{n} P_{n, n - g + 1}) \otimes {(U_{m} P_{m, m - g + 1})}^{2}

(34)

where

U_{n}

and

U_{m}

as defined in (5) and

Q_{n} U_{n} = \{\begin{matrix} [\begin{matrix} C_{h + 1}^{I} & 0 \\ 0 & J_{h - 1} S_{h - 1}^{I} J_{h - 1} \end{matrix}], & i f n = 2 h \\ [\begin{matrix} C_{h + 1}^{V} & 0 \\ 0 & J_{h} S_{h}^{V} J_{h} \end{matrix}], & i f n = 2 h + 1 \end{matrix}

(35)

3. Block g-Circulant DCT-DST Transformer

The architecture of the transformer network includes one or two multi-head attention units, a position-wise feedforward unit, and sub-layer connections with layer normalization. A layer of an encoder or decoder is composed of these. Learning weight matrices to multiply a matrix is a step in the positionwise feedforward and multihead attention units. In [2], these weight matrices were uncompressed, dense, and random in the original Transformers. Using block g-circulant weight matrices, we compress these weight matrices in this study.

The block g-circulant DCT-DST transformer discussed in this study is a variation of the original transformer model outlined in [2]. This adaptation involves two essential modifications: firstly, the substitution of a block g-circulant matrix for the conventional dense weight matrix, and secondly, the integration of the DCT-DST matrix-vector multiplication algorithm. This integration enables efficient computation when multiplying the block g-circulant matrix with the vector input of each sublayer (Figure 1).

3.1. Multihead Attention Sublayer

Two input sequences are applied to the multi-head attention sublayer: a key/value sequence

S_{K} \in R^{n_{K} \times d_{m o d e l}}

and a query sequence

S_{Q} \in R^{n_{Q} \times d_{m o d e l}}

, with

n_{K}

and

n_{Q}

represent the number of queries and keys/values respectively and

d_{m o d e l}

denote the dimensionality of input and output. Let the number of attention heads be y. By projecting the query sequence and key/value sequence by dense weight matrices, we create queries, keys, and values in

R^{d_{k}}

, where

d k = d_{m o d e l} / y

. For

i = 1, \dots, y

and

α = Q, K, V

, we could learn

3 y

separate dense weight matrices

C_{α}^{i} \in R^{d_{k} \times d_{m o d e l}}

. Rather, for

α = Q, K, V

, we learn a 3-dense weight matrices

C_{α} \in R^{d_{m o d e l} \times d_{m o d e l}}

and slice the resulting products y times. It is important that these weight matrices

C_{α}

have block sizes

c_{a t t n}

smaller than

d_{k}

to avoid correlations between different attention heads. After obtaining these projections, we calculate the dot product of attention for each attention head:

a t t e n t i o n (Q^{(i)}, K^{(i)}, V^{(i)}) = s o f t m a x (\frac{Q^{(i)} K^{(i) T}}{\sqrt{d_{k}}}) V^{(i)}

(36)

After computing dot-product attention, the output sequences are concatenated into a sequence of shape

(n_{k}, d_{m o d e l})

and projected through a dense projection matrix

C_{p r o j} \in R^{d_{m o d e l} \times d_{m o d e l}}

with block-size

c_{a t t n}

. Thus, each dense matrix multihead attention unit learns 4 dense weight matrices

C_{Q}, C_{K}, C_{V}, C_{p r o j}

of shape

d_{m o d e l} \times d_{m o d e l}

and block-size

c_{a t t n} < \frac{d_{m o d e l}}{y}

.

3.2. Block g-Circulant Positionwise Feedforward Sublayer

The block g-circulant positionwise feedforward sublayer uses two block g-circulant weight matrices. It applies the transformation:

a = C_{2} R e L U (C_{1} x)

(37)

where,

C_{1} \in R^{d_{f f} \times d_{m o d e l}}

has block-size

c_{1}

and

C_{2} \in R^{d_{m o d e l} \times d_{f f}}

has block-size

c_{2}

,

C_{1}

and

C_{2}

represent individual g-circulant block matrices, and by default,

d_{f f} = 4 \times d_{m o d e l}

, where

d_{f f}

denote the dimensionality of the inner-layer. When performing matrix multiplication between

C_{1}

and

C_{2}

and the multiplier vector, the DCT-DST multiplication algorithm is employed.

3.3. Block Sizes

The block-size parameters

c_{a t t n}, c_{1}

, and

c_{2}

introduce the three-dimensional hyperparameter space. However, we established a single block-size model so that we could test models and iterate quickly. Concerning

c_{m}

,

c_{1} = c_{m}; c_{2} = 4 \times c_{m}; c_{a t t n} = c_{m} / y

(38)

In this case, the maximum value of

c_{m}

is

d_{m o d e l}

, at which time it is entirely circulant.

3.4. DCT-DST Algorithm

In multiplying block g-circulant weight matrices with any input vector in each transformer layer, we applied the following algorithm, adapted from [28].

Compute $v = Q_{n m g} c_{1}$ directly, $c_{1} = C_{n m g} e_{1}$ , $e_{1} = {(1, 0, 0, \dots, 0)}^{T}$
Compute $\hat{v} = {(Q_{n m g} U_{n m g})}^{T} v$ by DCT and DST
Form $Ω_{n m g}$
Compute $y_{1} = Q_{n m g} x$ directly
Compute $y_{2} = {(Q_{n m g} U_{n m g})}^{T} y_{1}$ by DCT and DST
Compute $y_{3} = Ω_{n m g} y_{2}$ directly
Compute $y_{4} = (Q_{n m g} U_{n m g}) y_{3}$ by DCT and DST
Compute $Q_{n m g}^{T} y_{4}$ , i.e., $C_{n m g} x$

4. Experiment and Result

4.1. Data and Experimental Details

This experiment was conducted using data from the TED Talks Open Translation Project’s Portuguese-English datasets, and Tensorflow datasets were then used to load the data. The TED Open Translation Project is a pioneering endeavor by a prominent media platform to subtitle and comprehensively catalog online video content. It represents a groundbreaking initiative in leveraging volunteer translation for public and professional purposes. Among the many talks available within the TED talks transcripts under the Open Translator project, the Portuguese-English transcript is just one of over 2400 talks spanning 109 languages. Portuguese and English are among the top three languages with the highest number of talks in the TED talks collection. It has approximately 52,000 in training examples (Portuguese-English sentence pairs), 1200 in validation examples, and 1800 in test examples.

The dataset was then tokenized using Moses tokenizer as used in the original transformer by [2]. We chose to employ this tokenizer because it is one of the two widely recognized rule-based tokenizers extensively utilized in machine translation and natural language processing experiments. According to [35], the Moses tokenizer has demonstrated superior performance to several other tokenizers, specifically in neural machine translation. We used the code from the TensorFlow.org tutorial on neural machine translation with a transformer and Keras. We utilized the various base model and optimizer setups, which were slightly different from the original transformer model. Distinct from [2], each model applied four layers, not six, as in [2], eight attention heads, and a dropout rate of

0.1

. We set a batch size of 64, while the number of epochs is 20. The feedforward dimensions are four times larger than those of the model. Like [2], we used an Adam optimizer with

β_{1} = 0.9

;

β_{2} = 0.98

, and

ϵ = 10^{- 9}

. The details of the experiment are provided in Table 1.

The model dimensions have various values depending on the size of the weight matrices being tested. The size of the weight matrices is the combinations of n and m values such that a block g-circulant matrix of dimension 128 was obtained. Choosing a 128-dimensional matrix corresponds to the findings derived from [26]. We also experimented with other matrix dimensions for comparative purposes as part of our analysis. We applied some values of g, inter alia,

g = 1

, and other values such that

(n m, g) \neq 1

and

(n m, g) = 1

.

(n m, g)

denote the greatest common divisor of

n m

and g.

The model’s name refers to the types of matrices and algorithms employed in the multi-head attention and feed-forward sublayers. We used two types of matrices (Dense and Real Block Circulant Matrices) in conjunction with the DCT-DST algorithm. For instance, the Dense-Block 2-Circulant DCT-DST transformer model means that in the multi-head attention is applied the Dense Matrix and in the feed-forward sublayer is used the real Block 2-Circulant matrix with the DCT-DST algorithm. In this experiment, we trained 3 (three) transformer models with block g-circulant weight matrices and one original transformer model. In each model, experiments were conducted with matrix sizes across five dimensions: 16, 32, 64, 128, and 256 (Table 2).

4.2. Evaluation

On a held-out set of 500 samples, we evaluated performance using the corpus BLEU (Bilingual Evaluation Understudy) score. This figure corresponds to [36]. Statistically, a sample size of 500 can be classified as significant, making it sufficiently robust for conclusion.

BLEU serves as a metric designed to evaluate machine-translated text automatically. It quantifies the similarity between the machine-translated text and a set of high-quality reference translations, generating a score ranging from zero to one. BLEU’s notable strength is its strong correlation with human judgments. It achieves this by averaging individual sentence judgment errors across a test corpus rather than attempting to precisely determine human judgment for every single sentence [36,37]. In our study, the corpus BLEU score employed the English sentence as its single reference and the top English sentence output of beam search as the hypothesis for each pair of Portuguese and English sentences in the evaluation set. Aggregating references and hypotheses across all pairings produced the corpus BLEU.

4.3. Result and Discussion

Based on the analysis of multiple model samples displayed in Figure 2, it’s evident that the overall models avoid overfitting, demonstrating their capability for predictive tasks. Reviewing the experimental outcomes outlined in Table 3, the Dense-Block 1-circulant DCT-DST model notably surpasses the Dense-Dense model in BLEU score and model memory efficiency. Specifically, it achieves a

4.1 %

higher BLEU score and demonstrates a

22.14 %

improvement in model memory utilization. However, across other metrics, the performance of the block g-circulant matrix-based model needs to catch up to the Dense-Dense model. During testing on the test dataset, the Dense-Dense model excels notably in test duration. Furthermore, across similar matrix dimensions, all models achieve loss values that exhibit marginal differences. Notably, the block g-circulant model group with

(n m, g) = 1

tends to outperform the group where

(n m, g) \neq 1

, both in terms of loss and accuracy metrics. Figure 3, Figure 4, Figure 5 and Figure 6 provides a holistic view of the comparison among diverse g values, highlighting the fluctuations in loss, accuracy, BLEU score, and model memory size.

In general, the transformer model employing the block g-circulant weight matrix boasts a more efficient model memory size than the Dense-Dense model. This enhanced efficiency can be credited to utilizing the block g-circulant matrix, a structured matrix falling within the category of low displacement rank (LDR) matrices [12]. This discovery corroborates earlier research detailed in [26,28,29], which similarly highlighted the benefits of both block circulant matrices and circulant matrices through experimental evidence—the block g-circulant matrices allowing us to leverage the concept of data-sparsity. Data sparsity implies that representing an

n \times n

matrix necessitates fewer than

O (n^{2})

parameters. Unlike traditional sparse matrices, data-sparse matrices aren’t mandated to contain zero entries; instead, a relationship exists among the matrix entries. Moreover, efficient algorithms, such as computing the matrix-vector product with any vector, can be achieved with fewer than

O (n^{2})

operations [12]. This approach is expected to reduce the number of model training parameters utilized in the experiment, consequently diminishing the demand for storage space. Moreover, the reduced storage space requirement is believed to stem from implementing the DCT-DST algorithm, elaborated upon in [28].

Additionally, the relatively prolonged test duration observed in the block g-circulant model experiments likely arises from employing a vector matrix multiplication algorithm involving more intricate procedural steps, such as the algorithm DCT-DST. Initially, some people anticipated that implementing the algorithm would streamline the testing process, but contrary to expectations, the opposite happened. This can be attributed to the algorithm’s relatively intricate structure and the deep transformer architecture, which involves a significant number of weight matrix and input vector multiplication operations.

5. Conclusions

Incorporating a structured block g-circulant matrix as a weight matrix, combined with the DCT-DST algorithm for multiplication with the input vector in the transformer model, effectively elevated the BLEU score and conserved storage space. However, this approach resulted in a slight reduction in accuracy and an extension of testing time. In a mathematical context, the Kronecker product operation plays a pivotal role in defining the matrices used in the algorithm, enabling the execution of the DCT-DST algorithm on the multiplication of the weight matrix with the transformer input vector.

Author Contributions

Conceptualization, I.M.-A., A.P. and E.A.; methodology, I.M.-A.; software, E.A. and A.P.; validation, I.M.-A. and A.P.; formal analysis, E.A.; investigation, E.A.; resources, I.M.-A. and A.P.; data curation, A.P.; writing—original draft preparation, E.A.; writing—review and editing, E.A., I.M.-A. and A.P.; visualization, I.M.-A.; supervision, I.M.-A. and A.P.; project administration, E.A.; funding acquisition, I.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Research and Community Service Program of the Indonesian Ministry of Education, Culture, Research and Technology 2023. The doctoral-degree scholarship was supported by the Center for Higher Education Fund (Balai Pembiayaan Pendidikan Tinggi), the Indonesia Endowment Fund for Education (LPDP), and Agency for the Assessment and Application of Technology (BPPT) through the Beasiswa Pendidikan Indonesia (BPI) Kemendikbudristek.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors express their gratitude to Mikhael Martin for his invaluable contribution in conceptualizing the programming language.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mitsuda, K.; Higashinaka, R.; Sugiyama, H.; Mizukami, M.; Kinebuchi, T.; Nakamura, R.; Adachi, N.; Kawabata, H. Fine-Tuning a Pre-trained Transformer-Based Encoder-Decoder Model with User-Generated Question-Answer Pairs to Realize Character-like Chatbots. In Conversational AI for Natural Human-Centric Interaction: Proceedings of the 12th International Workshop on Spoken Dialogue System Technology, Singapore, IWSDS 2021; Springer Nature: Singapore, 2022; pp. 277–290. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; dan Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Ranganathan, J.; Abuka, G. Text summarization using transformer model. In Proceedings of the 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), Milan, Italy, 29 November–1 December 2022; pp. 1–5. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
Zeng, P.; Zhang, H.; Song, J.; Gao, L. S2 transformer for image captioning. In Proceedings of the International Joint Conferences on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1608–1614. [Google Scholar]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Toral, A.; Oliver, A.; Ballestín, P.R. Machine translation of novels in the age of transformer. arXiv 2020, arXiv:2011.14979. [Google Scholar]
Araabi, A.; Monz, C. Optimizing transformer for low-resource neural machine translation. arXiv 2020, arXiv:2011.02266. [Google Scholar]
Tian, T.; Song, C.; Ting, J.; Huang, H. A French-to-English machine translation model using transformer network. Procedia Comput. Sci. 2022, 199, 1438–1443. [Google Scholar] [CrossRef]
Ahmed, K.; Keskar, N.S.; Socher, R. Weighted transformer network for machine translation. arXiv 2017, arXiv:1711.02132. [Google Scholar]
Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. arXiv 2019, arXiv:1906.01787. [Google Scholar]
Kissel, M.; Diepold, K. Structured Matrices and Their Application in Neural Networks: A Survey. New Gener. Comput. 2023, 41, 697–722. [Google Scholar] [CrossRef]
Keles, F.D.; Wijewardena, P.M.; Hegde, C. On the computational complexity of self-attention. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, Singapore, 20–23 February 2022; PMLR:2023. pp. 597–619. [Google Scholar]
Pan, Z.; Chen, P.; He, H.; Liu, J.; Cai, J.; Zhuang, B. Mesa: A memory-saving training framework for transformers. arXiv 2021, arXiv:2111.11124. [Google Scholar]
Yang, H.; Zhao, M.; Yuan, L.; Yu, Y.; Li, Z.; Gu, M. Memory-efficient Transformer-based network model for Traveling Salesman Problem. Neural Netw. 2023, 161, 589–597. [Google Scholar] [CrossRef]
Sohoni, N.S.; Aberger, C.R.; Leszczynski, M.; Zhang, J.; Ré, C. Low-memory neural network training: A technical report. arXiv 2019, arXiv:1904.10631. [Google Scholar]
Sainath, T.N.; Kingsbury, B.; Sindhwani, V.; Arisoy, E.; Ramabhadran, B. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6655–6659. [Google Scholar]
Sindhwani, V.; Sainath, T.; Kumar, S. Structured transforms for small-footprint deep learning. arXiv 2015, arXiv:1510.01722. [Google Scholar]
Cheng, Y.; Yu, F.X.; Feris, R.S.; Kumar, S.; Choudhary, A.; Chang, S. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2857–2865. [Google Scholar]
Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Boston, MA, USA, 14–17 October 2017; pp. 395–408. [Google Scholar]
Yang, Z.; Moczulski, M.; Denil, M.; Freitas, N.D.; Song, L.; Wang, Z. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Thomas, A.; Gu, A.; Dao, T.; Rudra, A.; Ré, C. Learning compressed transforms with low displacement rank. arXiv 2018, arXiv:1810.02309. [Google Scholar]
Dao, T.; Gu, A.; Eichhorn, M.; Rudra, A.; Ré, C. Learning fast algorithms for linear transforms using butterfly factorizations. Proc. Mach. Learn. Res. 2019, 97, 1517–1527. [Google Scholar] [PubMed]
Pan, V. Structured Matrices and Polynomials: Unified Superfast Algorithms; Springer Science and Business Media: Boston, MA, USA; New York, NY, USA, 2001. [Google Scholar]
Davis, P.J. Circulant Matrices; Wiley: New York, NY, USA, 1979; Volume 2. [Google Scholar]
Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. Real Block-Circulant Matrices and DCT-DST Algorithm for Transformer Neural Network. Front. Appl. Math. Stat. 2023, 9, 1260187. [Google Scholar] [CrossRef]
Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. g-Circulant Matrices and Its Matrix-Vector Multiplication Algorithm for Transformer Neural Networks. AIP Conf. 2024. post-acceptance. [Google Scholar]
Liu, Z.; Chen, S.; Xu, W.; Zhang, Y. The eigen-structures of real (skew) circulant matrices with some applications. Comput. Appl. Math. 2019, 38, 1–13. [Google Scholar] [CrossRef]
Reid, S.; dan Mistele, M. Fast Fourier Transformed Transformers: Circulant Weight Matrices for NMT Compression. 2019. Available online: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15722831.pdf (accessed on 23 May 2024).
Saxena, A.; Fernandes, F.C. DCT/DST-based transform coding for intra prediction in image/video coding. IEEE Trans. Image Process. 2013, 22, 3974–3981. [Google Scholar] [CrossRef]
Park, W.; Lee, B.; Kim, M. Fast computation of integer DCT-V, DCT-VIII, and DST-VII for video coding. IEEE Trans. Image Process. 2019, 28, 5839–5851. [Google Scholar] [CrossRef]
Olson, B.; Shaw, S.; Shi, C.; Pierre, C.; Parker, R. Circulant matrices and their application to vibration analysis. Appl. Mech. Rev. 2014, 66, 040803. [Google Scholar] [CrossRef]
Serra-Capizzano, S.; Debora, S. A note on the eigenvalues of g-circulants (and of g-Toeplitz, g-Hankel matrices). Calcolo 2014, 51, 639–659. [Google Scholar] [CrossRef]
Wilkinson, J.H. The Algebraic Eigenvalue Problem; Clarendon: Oxford, UK, 1965; Volume 662. [Google Scholar]
Domingo, M.; Garcıa-Martınez, M.; Helle, A.; Casacuberta, F.; Herranz, M. How much does tokenization affect neural machine translation? arXiv 2018, arXiv:1812.08621. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Post, M. A call for clarity in reporting BLEU scores. arXiv 2018, arXiv:1804.08771. [Google Scholar]

Figure 1. The encoder-decoder structure of the block g-circulant DCT-DST transformer.

Figure 2. (a) The loss of Dense-Block 2-Circulant DCT-DST model; (b) The accuracy of Dense-Block 2-Circulant DCT-DST model; (c) The loss of Dense-Block 3-Circulant DCT-DST model; (d) The Accuracy of Dense-Block 3-Circulant DCT-DST model.

Figure 3. The loss values for the four transformer models.

Figure 4. The accuracy values for the four transformer models.

Figure 5. The BLEU score for the four transformer models.

Figure 6. The Model memory size for the four transformer models.

Table 1. The experiment details of models training.

Dataset	Portuguese-English translation dataset
	from TED talks Open Translation project
Tokenizer	Moses tokenizer
Training hyperparameters	number of epoch $= 20$
	batch size $= 64$
	number of layer $= 4$
	$d_{m o d e l} = 16, 32, 64, 128, 256$
	$d_{f f} = 4 \times d_{m o d e l}$
	number of head $= 8$
	dropout rate $= 0.1$
Optimizer	Adam optimizer, $β_{1} = 0.9$ ; $β_{2} = 0.98$ , and $ϵ = 10^{- 9}$
Weight matrix dimension	$16, 32, 64, 128, 256$
g	$0, 1, 2, 3$

Table 2. The transformer model being tested along with the g value and size of the weight matrix.

Model	g	$d_{model}$
Dense-dense (A)	0	$16, 32, 64, 128, 256$
Dense-Block 1-Circulant DCT-DST (B)	1	$16, 32, 64, 128, 256$
Dense-Block 2-Circulant DCT-DST (C)	2	$16, 32, 64, 128, 256$
Dense-Block 3-Circulant DCT-DST (D)	3	$16, 32, 64, 128, 256$

Table 3. The evaluated metrics across the four transformer models include accuracy, loss, test duration, BLEU score, and model memory consumption.

Model	g	$d_{model}$	Loss	Accuracy (%)	Test Duration (Second)	BLEU (%)	Model Memory (KB)
		16	$4.027$	$0.327$	$7.831$	$2.78$	1751
		32	$2.963$	$0.485$	$7.907$	$2.71$	3546
A	0	64	$2.404$	$0.569$	$8.260$	$2.71$	7540
		128	$2.320$	$0.607$	$8.895$	$25.43$	$18,394$
		256	$2.219$	$0.616$	$9.348$	$25.43$	$50,855$
		16	$4.1058$	$0.3186$	$8.9704$	$1.75$	1714
		32	$3.3718$	$0.4201$	$9.5401$	$11.64$	3227
B	1	64	$2.651$	$0.5244$	$11.566$	$22.23$	6542
		128	$2.373$	$0.579$	$27.839$	$26.47$	$14,322$
		256	$2.3089$	$0.5855$	$121.908$	$24.65$	$34,491$
		16	$4.151$	$0.317$	$17.591$	$2.29$	1732
		32	$3.338$	$0.425$	$24.934$	$9.06$	3246
C	2	64	$2.619$	$0.528$	$31.145$	$20.8$	6560
		128	$2.448$	$0.555$	$59.918$	$21.69$	$14,340$
		256	$2.323$	$0.577$	$180.119$	$24.18$	$34,509$
		16	$4.0974$	$0.3232$	$18.4132$	$2.54$	1732
		32	$3.2157$	$0.4404$	$24.5743$	$11.64$	3246
D	3	64	$2.6102$	$0.5312$	$36.9716$	$21.35$	6560
		128	$2.373$	$0.570$	$68.211$	$24.12$	$14,340$
		256	$2.3079$	$0.5798$	$283.2846$	$24.98$	$34,509$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine. Mathematics 2024, 12, 1697. https://doi.org/10.3390/math12111697

AMA Style

Asriani E, Muchtadi-Alamsyah I, Purwarianti A. On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine. Mathematics. 2024; 12(11):1697. https://doi.org/10.3390/math12111697

Chicago/Turabian Style

Asriani, Euis, Intan Muchtadi-Alamsyah, and Ayu Purwarianti. 2024. "On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine" Mathematics 12, no. 11: 1697. https://doi.org/10.3390/math12111697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

Abstract

1. Introduction

2. Block g-Circulant Matrix

2.1. Eigenvalues of Block g-Circulant Matrices

2.1.1. Case $g = 0$

2.1.2. Case $g = 1$

2.1.3. Case $(n m, g) \neq 1$

2.1.4. Case $(n m, g) = 1$

3. Block g-Circulant DCT-DST Transformer

3.1. Multihead Attention Sublayer

3.2. Block g-Circulant Positionwise Feedforward Sublayer

3.3. Block Sizes

3.4. DCT-DST Algorithm

4. Experiment and Result

4.1. Data and Experimental Details

4.2. Evaluation

4.3. Result and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

Abstract

1. Introduction

2. Block g-Circulant Matrix

2.1. Eigenvalues of Block g-Circulant Matrices

2.1.1. Case g = 0

2.1.2. Case g = 1

2.1.3. Case ( n m , g ) ≠ 1

2.1.4. Case ( n m , g ) = 1

3. Block g-Circulant DCT-DST Transformer

3.1. Multihead Attention Sublayer

3.2. Block g-Circulant Positionwise Feedforward Sublayer

3.3. Block Sizes

3.4. DCT-DST Algorithm

4. Experiment and Result

4.1. Data and Experimental Details

4.2. Evaluation

4.3. Result and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1.1. Case $g = 0$

2.1.2. Case $g = 1$

2.1.3. Case $(n m, g) \neq 1$

2.1.4. Case $(n m, g) = 1$