GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers

He, Huaiwen; Yang, Xiangdong; Huang, Feng; Yi, Feng; Liang, Shangsong

doi:10.3390/math12142189

Open AccessArticle

GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers

by

Huaiwen He

¹

,

Xiangdong Yang

^1,2

,

Feng Huang

^1,2,

Feng Yi

¹ and

Shangsong Liang

^3,*

¹

School of Computer, Zhongshan Institute, University of Electronic Science and Technology of China, Zhognshan 528400, China

²

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

³

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2189; https://doi.org/10.3390/math12142189

Submission received: 18 June 2024 / Revised: 8 July 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

(This article belongs to the Section Computational and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

Capturing long-term dependency from historical behaviors is the key to the success of sequential recommendation; however, existing methods focus on extracting global sequential information while neglecting to obtain deep representations from subsequences. Previous research has revealed that the restricted inter-item transfer is fundamental to sequential modeling, and some potential substructures of sequences can help models learn more effective long-term dependency compared to the whole sequence. To automatically find better subsequences and perform efficient learning, we propose a sequential recommendation model with a gated recurrent unit and Transformers, abbreviated as GAT4Rec, which employs Transformers with shared parameters across layers to model users’ historical interaction sequences. The representation learned by the gated recurrent unit is used as the gating signal to identify the optimal substructure in user sequences. The fused representation of the subsequence and edge information is extracted by the encoding layer to make the corresponding recommendations. Experimental results on four well-known publicly available datasets demonstrate that our GAT4Rec model outperforms other recommendation models, achieving performance improvements of 5.77%, 1.35%, 11.58%, and 1.79% in the normalized discounted cumulative gain metric (NDCG@10), respectively.

Keywords:

sequential recommendation; transformer; gated recurrent unit

MSC:

68T01; 68T07

1. Introduction

In the era of big data, users often purchase items online based on historical experience or current interests, and many e-commerce companies provide recommendations based on users’ historical shopping information [1]. Compared with recommendations based on users’ purchase similarities, sequence recommendation [2,3,4] takes into account the relationship between items and can better model users’ historical information. For example, after purchasing a pair of scissors, someone is more likely to purchase a maintenance tool rather than eggs. In addition, user information is not always available due to privacy concerns, and sequential recommendation is more suitable for such scenarios compared to other methods. The goal of sequential recommendation is to capture the transfer patterns in user interaction sequences and use the set of products with higher probabilities as the list to be recommended based on the learned implicit representation [5].

In order to obtain the sequence patterns of user–item interactions more effectively, different methods have been proposed to learn representations of sequential information. Researchers have attempted to apply mathematical models to recommendation methods such as Markov chains (MCs) [6,7]. The MC model is a strong hypothesis model with the convention that the next behavior depends only on the previous N behaviors. This method can perform well on short sequences but encounters difficulty in capturing internal deep relationships. Recently, with the popularity of deep learning, recurrent neural network (RNN)-based methods [8,9,10,11] have also made great progress in the recommendation field. The main idea is to use their ability to model sequence behaviors to obtain a hidden representation that contains the contextual information of historical behaviors and use it to predict the next possible user interaction. Convolutional neural network (CNN)-based methods [2,12,13] use convolution kernels of different sizes to obtain features with different amounts of information for future prediction. In addition to the above methods, many recent works use self-attention mechanisms [14,15] as an alternative to RNNs and CNNs to process sequence information. Compared with RNNs, self-attention-based methods have the natural advantage of high parallelization in training speed.

Regardless of the achievements mentioned above, these models still demonstrate several issues: (1) They fail to explore users’ intentions from sequential auxiliary information, which has greater information density and is more generalizable compared to user sequences. (2) Although the entire sequence as input to the model can reflect the authenticity of users’ behavior, we argue that the variation in users’ interests only depends on part of the sequence. Obtaining the optimal subsequence of users’ behavior is crucial for making targeted predictions of users’ behavior.

To tackle these challenges, we propose a novel model combining a gated recurrent unit and Transformers named GAT4Rec, which is based on bidirectional self-attention and can capture signals of shifts in users’ interests using auxiliary information. To effectively learn the potential meaning of users’ dynamic behavior, we apply the gated recurrent unit to model the k categories at the end of the sequence as a representation of users’ interests. At the same time, we use this hidden representation as the gating signal in our gated input layer to filter out the optimal subset of the sequence. Then, we use the bidirectional model to capture the output corresponding to the masked token on the sequence as a basis for recommendations. Finally, the comparative experimental results on four real-world datasets show that GAT4Rec outperforms the baseline models.

Our main contributions are summarized as follows:

Inspired by [16], we propose a novel bidirectional self-attention sequence recommendation model named GAT4Rec, which takes into account the auxiliary information of a sequence and the dynamic variation in users’ interests. Transformers that share parameters across layers are employed to model users’ historical interaction sequences. To learn the dynamic changes in users’ interests, we apply a gated recurrent unit (GRU) to the last k categories in the sequence to model the category information and use the learned hidden layer to express users’ intentions.
To identify the best subsequence constructor, we design a gated module and use the learned potential representation as the gating signal. We use the distance between the embedded vectors of items and users’ intentions to represent the similarity between them. By improving the criteria for identifying relevant items, our proposed model achieves robust computation.
Experiments on four datasets show that GAT4Rec outperforms the baseline methods in terms of recall and rank-oriented evaluation metrics. Extensive ablation studies reveal the impact of each component on the experimental results.

The rest of this paper is organized as follows. Section 2 reviews related works. The formulation of the problem and the proposed model are given in Section 3. Several experiments are conducted to evaluate our model in Section 4. Section 5 concludes this paper.

2. Related Works

The related works can be classified into three categories, which are briefly discussed below.

2.1. General Recommendation

Early works are mainly based on collaborative filtering to analyze hidden information from user–item interactions. For example, for implicit feedback data, some researchers [17,18] leveraged matrix factorization to infer users’ interest in items that have not been interacted with before.

Recently, neural networks have gained attention in exploring the potential of feature extraction for recommendation tasks. He et al. [19] proposed a model to obtain the probability of user–item interaction using neural networks. Chen et al. [20] leveraged a Variational Autoencoder (VAE) to capture low-dimensional representations of user information. In addition, some models [21,22] integrate auxiliary information into their learning modules to simulate more accurate representations. The wide spectrum of applications in deep machine learning has significantly influenced the evolution of recommendation algorithms, driving the development of more accurate and personalized user experiences [23]. Choudhary et al. [24] presented an ensemble-based deep learning approach for recommendation systems, which simultaneously analyzes both users’ ratings and reviews to enhance recommendation effectiveness. Fang et al. [25] proposed a deep learning-based sequential recommender system, systematically categorizing existing algorithms by behavioral sequence types, analyzing key performance factors, and evaluating their impact.

2.2. Sequential Recommendation

Some outstanding works have leveraged Markov chains in the field of sequential recommendation. Zhang et al. [26] integrated first-order and second-order Markov models to improve network recommendations. He et al. [27] proposed a model named Fossil that combines a similarity-based method with a Markov chain-based method to make predictions on sparse datasets.

Recently, with the popularity of deep learning, different model structures have been used to mine patterns in sequential behaviors. The sequential recommendation method based on RNNs aims to establish a sequence dependency relationship through given user–item interactions to predict the next possible interaction [28,29,30]. Hidasi et al. proposed GRU4Rec [31], which aims to help the model learn session-based sequential latent representations. Then, GRU4Rec+ [32] was proposed as an improved version to make top-k recommendations. On the other hand, some works have used more complex models such as hierarchical RNNs [33,34] to obtain sequential patterns with richer interaction information.

Meanwhile, CNN-based methods have also achieved magnificent success. Tang et al. [2] proposed Caser to leverage conventional neural networks to capture hidden representations of user–item interactions. Yuan et al. [12] argued that Caser cannot effectively handle long sequences, so they optimized the model’s ability to handle both long and short sequences.

2.3. Attention-Aware Recommendation

Recently, many works related to recommendation have drawn on natural language processing (NLP) ideas. For example, researchers have applied dense embedding to solve the next-basket recommendation [35,36,37,38] problem. The Euclidean distance of similar items is often smaller than that between unrelated ones in vector space, so the correlation between users or items can be calculated to make recommendations.

In addition, some works have utilized attention mechanisms to explore the potential of models. Li et al. [10] proposed a hybrid encoder with an attention mechanism to capture users’ main intentions. Chen et al. [39] integrated an attention mechanism with a CNN for more accurate user modeling. While these models regard attention as merely an auxiliary tool, Transformers [14] can completely replace RNNs and their variants when processing sequential information. Inspired by this, some works have started to apply Transformers to sequential recommendation. In [40,41], the authors exploited a self-attention-based Transformer to obtain optimal representations for sequential behaviors, achieving great progress in recall and rank tasks. Transformers are not only widely used in the field of NLP but are also applied to computer image processing. In [42], the authors proposed a novel end-to-end infrared and visible image fusion method based on a Y-shape dynamic Transformer. In [43], the authors developed a multimodal medical image fusion method named MATR based on a multiscale adaptive Transformer. In [44], the authors proposed a two-level hybrid contrastive model for unsupervised person re-identification (Re-ID).

Unlike the aforementioned methods, our approach leverages feature-level details to capture users’ interests, effectively employing these insights within a gated module. We develop a gated module that adeptly shapes these interests into a comparative framework, enabling the precise extraction of optimal subsequences that are most aligned with the current interests of users. The specific gap that our model aims to fill is twofold: (1) Long-term dependency modeling—Many current models, including those based on RNNs and CNNs, either fail to capture long-term dependencies or do so at the cost of increased computational complexity. Our proposed GAT4Rec model leverages a bidirectional self-attention mechanism to effectively model these dependencies, providing a more accurate representation of users’ behavior over time. (2) Dynamic interest evolution—The field lacks models that can dynamically adapt to the shifting interests of users, which is crucial for providing timely and relevant recommendations. Our method, GAT4Rec, introduces a gating mechanism that identifies the most relevant subsequences, allowing the model to focus on the latest user interactions that are more indicative of current interests.

3. The Proposed Method

In this section, we first present the problem as a mathematical formulation. Then, we give a general introduction to the overall framework of the model. Finally, we discuss each part of the model in detail.

3.1. Problem Statement

First, we denote a user as

u \in {u_{1}, u_{2}, \dots, u_{N}}

, item

i \in {i_{1}, i_{2}, \dots, i_{M}}

, a category as

c \in {c_{1}, c_{2}, \dots, c_{K}}

, and user u’s historical purchase history as

S_{u} = {s_{u}^{1}, s_{u}^{2}, \dots, s_{u}^{T}}

, where

s_{u}^{j} = {(i_{j}, c_{j})}_{u}

and T is the length of the action sequence.

{(i_{j}, c_{j})}_{u}

represents a two-tuple consisting of item

i_{j}

purchased by user u at the jth moment, and

c_{j}

represents a category belonging to

i_{j}

. The model employs the sequential representation obtained by the encoders to predict the next possible item. Define

s_{u}^{T + 1} = {(i_{T + 1}, c_{T + 1})}_{u}

; then, the probability value of user u interacting with each item at

T + 1

moments can be obtained:

P (s_{u}^{T + 1} | S_{u})

(1)

Then, by sorting the probability value corresponding to each item, we can obtain a top-K candidate set for the user according to the selection of the K value. The main abbreviations and notations used throughout this paper are summarized in Table 1 and Table 2, respectively.

3.2. Overview of the GAT4Rec Model

In this section, the GAT4Rec model is introduced to model the target formula, which is based on a Transformer and GRU combination, as shown in Figure 1. The whole model is divided into the following layers: (1) user embedding layer; (2) gated filter layer; (3) item embedding layer; and (4) Transformer layer.

Denote

S_{u}^{k} = {s_{u}^{T - k + 1}, s_{u}^{T - k + 2}, \dots, s_{u}^{T}}

as the most recent k actions in the whole sequence, which are selected as the input in the user interests encoding layer. According to the literature [9], it is known that user interest mostly focuses on the most recently purchased products. For example, when choosing a cell phone, users usually tend to check another cell phone as a reference. In order to balance diversity and generality, we select the category of purchased items as the user interest tendency representation.

According to the user interest tendency feature representation, we filter the historical items that can support the current vector in the gating filter layer, i.e., the embedding of the corresponding category of historical items is more isotropic with the learned user interest tendency representation in the vector space. The filtered item sequences are fed into a coding layer consisting of L layers of Transformers stacked together, with each layer consisting of H Transformers, where the layers interact with each other through fully connected information. Unlike an RNN, the Transformer ensures parallel training of the whole model, and each layer is equivalent to recoding the input items. Finally, the output of the items corresponding to the mask notation is used to predict the final set of recommended items.

3.3. User Embedding Layer

Considering the hidden nature of user IDs (anonymous users), modeling the IDs directly may have a negative effect on the model due to null values, so we need to use other information to help construct user representations. The category of items is easier to obtain compared to other features and using it to represent users’ interests ensures that the representation also serves as an embedding of users. This is because the feature representation based on category information ensures the scalability of the model with a smaller number of users, achieving better scalability and diversity in recommendation results.

A GRU [45] is a gating mechanism in a recurrent network with fewer parameters that lacks an output gate, unlike Long Short-Term Memory (LSTM). A GRU only has two gates and directly applies a reset gate to the previous hidden state, achieving better performance on smaller, frequent datasets in certain tasks like natural language processing [46,47]. Thus, here, we use a GRU to model the category sequences to obtain current user representations, considering the limited length of k and the fact that a GRU is more efficient than LSTM in training.

Specifically, we have the sequence

S_{u}^{k} = {s_{u}^{T - k + 1}, s_{u}^{T - k + 2}, \dots, s_{u}^{T}}

with its corresponding category sequence

C^{k} = {c_{T - k + 1}, c_{T - k + 2}, \dots, c_{T}}

, where

c_{i}

denotes the category corresponding to the i-th item in the sequence

S_{u}^{k}

and T denotes the last index. By mapping transformations, we can obtain the embedding of the category sequence

E_{k}^{c} = {e_{T - k + 1}^{c}, e_{T - k + 2}^{c}, \dots, e_{T}^{c}}

, where

e_{i}^{c} \in R^{c}

. We input this sequence of categories into the structure composed by the GRU to obtain the user embedding representations. The nodes of the GRU are updated as follows:

\begin{matrix} r_{t} = σ (W_{r} \cdot [h_{t - 1}, e_{t}^{c}]), \end{matrix}

(2)

\begin{matrix} z_{t} = σ (W_{z} \cdot [h_{t - 1}, e_{t}^{c}]), \end{matrix}

(3)

\begin{matrix} \tilde{h_{t}} = \tan h (W_{\tilde{h}} \cdot [r_{t} ⊙ h_{t - 1}, e_{t}^{c}]), \end{matrix}

(4)

\begin{matrix} h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h_{t}}, \end{matrix}

(5)

where

W_{r}, W_{z}, W_{\tilde{h}}

are the weight matrices to be learned, ⊙ denotes the Hadamard product (e.g., for two matrices A and B of the same dimensions, the Hadamard product

A ⊙ B

is a matrix of the same dimension as the operands, with the elements

{(A ⊙ B)}_{i j} = {(A)}_{i j} {(B)}_{i j})

), and

σ

denotes the sigmoid activation function. Finally, we make the hidden layer representation obtained from the kth GRU unit as the user interest tendency representation, whose expression is

h_{u} = h_{k},

(6)

where

h_{u} \in R^{d_{c}}

is the user interest tendency representation. The network parameters of the GRU are trained by feeding

h_{i}

to the multilayer perceptron and softmax operation. Considering the complexity of the overall network, the training time of this network needs to be reduced to prevent overfitting.

3.4. Gated Filter Layer

The gated filter layer in our model serves a crucial role in selectively refining users’ historical interaction sequences. Inspired by the gating mechanisms in LSTM networks [48], our layer dynamically adjusts the information flow, retaining elements that are significant and discarding those deemed less relevant. As users accumulate a history of shopping items over time, modeling the entire sequence may lead to the inclusion of items that no longer align with their current interests, potentially degrading the recommendation quality. Therefore, it is imperative to filter the historical sequence to identify items that resonate with the users’ current preferences.

In recent years, attention mechanisms [14], including self-attention, soft attention, and hard attention, have become prevalent in sequential modeling. Hard attention is specifically employed to highlight the most salient items within a sequence. Our model adopts the concept of hard attention to pinpoint the sequence’s most pertinent items that align with users’ current interest tendencies.

Given the hidden state

h_{i}

obtained from our model, we proceed to filter the eligible historical interactions. We focus on items with an index less than or equal to

T - k

, effectively filtering out those that are not recent enough. The historical sequence

S_{u}^{T - k} = {s_{u}^{1}, s_{u}^{2}, \dots, s_{u}^{T - k}}

is mapped to a corresponding category sequence

C_{u}^{T - k} = {c_{1}, c_{2}, \dots, c_{T - k}}

, and subsequently, to category word embeddings

E_{c}^{T - k} = {e_{c_{1}}, e_{c_{2}}, \dots, e_{c_{T - k}}}

.

Building upon Cai’s method [49], which selects items based on the top-k softmax values of their interaction with the user interest representation, we introduce a hyperparameter

λ

to refine this selection process. We define the relevance score for each category embedding

e_{c_{i}}

as follows:

{score}_{e_{c_{i}}} = σ (h_{i} \cdot e_{c_{i}}),

where

σ

is the sigmoid function and

h_{i}

is the user interest representation obtained from the model. Only items with a relevance score exceeding

λ

are considered, which is mathematically expressed as follows:

Keep item s_{u}^{j} if σ (h_{i} \cdot e_{c_{j}}) > λ,

This condition results in a subsequence

S^{'}

that is more attuned to users’ current interests:

S^{'} = {s_{u}^{j} \in S_{u}^{T - k} ∣ σ (h_{i} \cdot e_{c_{j}}) > λ}

This refined subsequence

S^{'}

serves as the input for our model’s recommendation process. Here, we set a condition where only items with a sigmoid-activated dot product of their category embedding and the user interest representation exceeding

λ

are considered. This results in a subsequence that is more attuned to users’ current interests, serving as the refined input for our model.

3.5. Item Embedding Layer

The number of items is often in the thousands, and their item IDs are usually labeled in the integer domain using one-hot encoding, which greatly increases the number of parameters in the model. So, we first map the one-hot encoding to a low-dimensional embedding vector to reduce the dimension and improve the representation learning ability of the model. We denote the size of the dataset as

| V |

and the dimensionality of the embedding vector as d such that the number of parameters the model needs to learn can be represented as

| V | \times d

. It should be noted that here we use factorized embedding parameterization [16] to further reduce the number of parameters in the model, which helps in expanding the model’s capabilities. In simple terms, embedding factorization means adding another layer to the original mapping matrix. Assuming that the added embedding dimension is E, the number of parameters is

| V | \times E + E \times d = E \times (| V | + d)

, and if

E ≪ d

, the reduction in the number of parameters is obvious. Thus, the project embedding can be expressed as

E_{p} = {e_{p}^{1}, \dots, e_{p}^{| V |}}

\in R^{| V | \times d}

. The factorization is schematically shown in Figure 2.

The sequential order of items can reflect the changes in users’ behavior, but the Transformer module does not include timing information like recurrent neural networks, so additional sequential embedding is required to ensure that the model can learn the importance of position. We denote the position embedding as

P = {p_{1}, \dots, p_{L}}

, where L is the maximum length of the sequence. Here, we leverage the model to learn the embeddings.

To make full use of the hidden layer information in user representations, we choose to add it to the Transformer layer for joint learning. The vector combination process is shown in Figure 3. First, add the item embedding

e_{i}

and the position embedding

p_{i}

to obtain

t_{i}

. Then,

t_{i}

is spliced with the user embedding u and linearly transformed to obtain the final input embedding

h_{i}

for the model.

3.6. Transformer Layer

Learning from sequences to transfer patterns between items is the primary purpose of sequence recommendation. Compared to RNN-based feature extraction networks, Transformers not only perform better in handling long sequence tasks [50] but also have parallelized processing capability, which ensures that their training speed is better than that of RNNs [51]. The Transformer layer is the main structure of this model and consists of multiple stacked Transformers. The Transformer [14] is a recently developed Sequence-to-Sequence (Seq2Seq) model based on multi-head self-attention, which consists of two main components: an encoder and a decoder. As shown in Figure 4, the Transformer consists of a multi-head attention layer, a feedforward network layer, and a normalization layer. The key component of this module is the multi-head attention layer, which employs a self-attentive mechanism with multiple heads to assist in learning representation vectors in different subspaces. The multi-head formula is shown in Equation (7):

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d / h}}) V,

(7)

where

Q, K,

and V represent the query, key, and value, respectively. Because the self-attention mechanism is used in the Transformer,

Q, K,

and V are generated by the same vector. In multi-head attention, the vector is divided into several parts to calculate independently during the calculation process, as shown in Formula (8):

h_{i} = Attention (H^{L - 1} W_{i}^{Q}, H^{L - 1} W_{i}^{K}, H^{L - 1} W_{i}^{V}),

(8)

M H (H^{L}) = [h_{1}; h_{2}; \dots; h_{n}] W^{o},

(9)

where

H^{L}

represents the hidden layer representation output of the L-th layer. The corresponding attention for each head is obtained using independent weight matrices

W_{i}^{Q} \in R^{d \times d / n}, W_{i}^{K} \in R^{d \times d / n}, W_{i}^{V} \in R^{d \times d / n}

, where each head is not shared. Finally, the obtained n heads are concatenated together and then multiplied by a weight matrix

W^{o}

to obtain the current multi-head attention value of the L-th layer.

The feedforward network layer mainly uses the Gaussian Error Linear Unit (GELU) function to activate the multi-head attention value, which is defined in Equation (11) [52]. Compared with ReLU, GELU is smoother [52]. The activation formula is shown in Equations (10) and (11):

F F N (x) = G E L U (x W^{f_{1}} + b^{f_{1}}) W^{f_{2}} + b^{f_{2}},

(10)

\begin{matrix} G E L U (x) = x ϕ (x) \approx 0.5 x (1 + \tanh [\sqrt{2 / π} (x + 0.044715 x^{3})]), \end{matrix}

(11)

where

ϕ (x)

is the cumulative distribution function of the standard Gaussian distribution;

W^{f_{1}} \in R^{d \times 4 d}, W^{f_{2}} \in R^{4 d \times d}

,

b^{f_{1}} \in R^{4 d}, b^{f_{2}} \in R^{d}

are the learned parameters; and

F F N (\cdot)

is the feedforward network. These parameters are shared across each Transformer.

In the normalization layer, we use the residual network to ensure the parameter learning effect of the deep network. Combining the multi-attention layer and the feedforward network layer, the overall process of the Transformer is as follows:

\begin{matrix} A N_{1}^{L} = L N (H^{L - 1} + M H (H^{L - 1})), \end{matrix}

(12)

\begin{matrix} F F N (A N_{1}^{L}) = G E L U (A N_{1}^{L} \dot{W^{f_{1}}}) W^{f_{2}} + b^{f_{2}}, \end{matrix}

(13)

\begin{matrix} A N_{2}^{L} = L N (F F N (A N_{1}^{L}) + A N_{1}^{L}), \end{matrix}

(14)

\begin{matrix} H^{L} = A N_{2}^{L} . \end{matrix}

(15)

The entire encoding layer is composed of many Transformers, and the parameters are shared among layers, which greatly reduces the overall number of model parameters, allowing for the possibility of extending the model.

We select the Transformer output

H_{l}^{L}

corresponding to the end of the sequence as the item’s representation vector for subsequent recommendations, where L represents the L-th layer and l represents the output corresponding to the l-th node as follows:

{h^{'}}_{T} = H_{l}^{L} .

(16)

The flowchart of the GAT4Rec model is depicted in Figure 5. The flowchart begins with the initialization of user and item embeddings, which are essential for capturing the characteristics of a user’s behavior and item features. It then depicts the transition through the GRU layer, emphasizing the role of this layer in modeling the dynamic evolution of the user’s interests. Next, the flowchart illustrates the gating mechanism, a distinctive feature of our model that filters out irrelevant subsequences to focus on the most salient parts of a user’s interaction history. This selective process is crucial for enhancing the accuracy and relevance of our recommendations. Following the gating mechanism, the flowchart leads to the Transformer layer, highlighting the self-attention mechanism that allows for the parallel processing of sequence elements, thus improving the model’s efficiency and effectiveness in handling long-range dependencies. Finally, the flowchart culminates in the recommendation generation step, where the filtered and transformed user sequence is used to predict and rank potential items of interest to the user.

We summarize our GAT4Rec model in Algorithm 1 with pseudocode. In Algorithm 1, Lines 5 and 6 represent the hard attention process within the gated filter layer. For each category

c_{j}

in the sequence, the dot product of the gating signal

h_{u}^{T}

, which represents the latent representation of users’ interests, and the category embedding vector

c_{i}

is calculated. If this result is less than a predefined threshold

λ

, the operation described in Line 7 is executed. Line 7 indicates that if the computed value is below the threshold

λ

, the corresponding item

i_{j}

is removed from the sequence. This constitutes the implementation of hard attention, which selectively focuses on the most relevant parts of the sequence. By applying this mechanism, the model filters out items that are deemed less important, thereby enhancing the model’s ability to concentrate on users’ current interests.

Algorithm 1: Bidirectional self-attention model for recommendation: GAT4Rec.

3.7. Model Training

Due to the bidirectional modeling property of our model, it needs to be trained through mask learning. The general idea is that for the sequence

S = {s_{1}, s_{2}, \dots, s_{n}}

, we sample

ρ %

and perform the mask operation on it. Here, 80% of the sampling is replaced with a mask token, 10% is replaced with any item in the item set, and 10% remains unchanged. After the mask operation, the model inputs can be expressed as

S_{m} = {s_{1}, [m a s k_t o k e n], [m a s k_t o k e n], s_{4}, \dots, s_{n}}

. The purpose of this model is to predict the masked items from their original counterparts using the obtained output features. Therefore, the learning task of the model can be considered as a multiclassification task. Here, we choose to use the softmax function to obtain the probability values of the occurrence of each item in the item set. Considering that the model focuses only on whether the predicted items match the true labels, the training on the coding layer uses the cross-entropy loss function, as follows:

L_{t} = \frac{1}{| sub (S_{m}) |} \sum_{j = 1}^{N_{s}} \sum_{i \in sub (S_{m})} - log p (i = i^{'} ∣ H_{i}^{O}),

(17)

where

| sub (S_{m}) |

denotes the number of items affected by the mask,

N_{s}

is the total number of input sequences, and

i^{'}

is the label value of item i, i.e., the item corresponding to the item before the mask operation.

In addition, the GRU module needs to be trained to predict the next item’s category information. The category label may contain more than one category, making the task multi-label classification. So, the binary cross-entropy loss function is used, as follows.

\begin{matrix} L_{c} = & - \sum_{j = 1}^{N_{s}} \sum_{i = 1}^{N_{c}} y_{i}^{'} log y_{i} - (1 - y_{i}^{'}) log (1 - y_{i}), \end{matrix}

(18)

where

N_{c}

denotes the number of categories contained in the training samples.

Thus, the loss function consists of coding layer loss and category prediction loss, and

α

is the weight value of the two loss functions, which can be adjusted according to the actual training situation. The expression of the loss function is as follows:

L = α (L_{t}) + (1 - α) L_{c} .

(19)

3.8. Discussion

Our proposed model, GAT4Rec, consists of the following components: a user embedding layer, gated filter and item embedding layers, and a Transformer. Accordingly, the time complexity of our proposed model is

O (2 k λ η N B T | V | E d + Q K V N L d)

. The training of the proposed model is performed offline, whereas the prediction of items for users is performed online and is fast.

4. Experiments

In this section, we provide the statistical information of the datasets used in the experiments, the evaluation metrics for evaluating the experimental results, and the specific parameter settings for the experiments. We provide comparison results with the baseline models, study the impact of important hyperparameters on the model, and perform an ablation study.

4.1. Experimental Setup

4.1.1. Datasets

We verify the effectiveness of our model on four real-world datasets. The statistical information of the datasets is shown in Table 3.

The datasets used include the following:

ML-1M and ML-20M (https://grouplens.org/datasets/movielens/ (accessed on 17 June 2024)): These MovieLens datasets are two public movie-browsing sequence datasets, containing 1 million and 20 million user–item interactions, respectively. In our work, we retain all the data in the datasets and do not delete users or items with fewer interactions.
Taobao and Taobao_m (https://tianchi.aliyun.com/dataset/dataDetail?dataId=649 (accessed on 17 June 2024)): Both datasets are subsets of a public dataset named Tianchi. Taobao randomly selects 6469 user interaction sequences and retains users, items, and categories with low interaction frequency. Taobao_m represents the modified Taobao dataset, which extracts the first 20k user interaction sequences in the original sample and filters out interactions with user sequence lengths less than 20 and item and category occurrences less than 10.

We apply the same preprocessing operation [40] to all datasets, keeping columns corresponding to users, items, and categories. The entire dataset is ordered by user and timestamp. Figure 6 shows the sequence length distributions of the four datasets. It can be seen that the categories in the MovieLens datasets are fewer, and the average user interaction lengths are longer. The two sub-datasets of Taobao contain more categories, and the user interaction sequences are much shorter compared to the other two datasets.

4.1.2. Evaluation Metrics

Our task is to rank the desired related items as high as possible in the candidate set. Thus, we use three commonly used evaluation indicators to assess each model, as in [49]: recall (R), normalized discount cumulative gain (NDCG), and mean reciprocal rank (MRR). The definitions of the evaluation metrics are listed below.

(1) Recall (R): Each user only has one ground-truth item, where the recall value is 1 if the ground-truth item is in the candidate list; otherwise, it is 0.

\begin{matrix} Recall @ K_{u} = \{\begin{matrix} 0 & ground - truth \notin C_{1 : K} \\ 1 & ground - truth \in C_{1 : K} \end{matrix} \end{matrix}

(20)

\begin{matrix} Recall @ K = \frac{1}{k} \sum_{j = 1}^{k} R e c a l l @ K_{j} \end{matrix}

(21)

where

C_{1 : K}

represents the top-k items in the candidate set.

(2) NDCG and MRR.

\begin{matrix} NDCG = \frac{1}{k} \sum_{j = 1}^{k} \frac{1}{{log}_{2} (I_{j} + 1)} \end{matrix}

(22)

\begin{matrix} MRR = \frac{1}{k} \sum_{j = 1}^{k} \frac{1}{I_{j}} \end{matrix}

(23)

where

I_{j}

is the index position of the j-th prediction (starts at one). Compared with recall (R), NDCG and MRR pay more attention to the order of ground truths in the candidate set.

The two metrics MRR and NDCG consider the ranking positions of the related items. In this work, we use NDCG@K, R@K, where k = 1, 5, 10.

In order to avoid a huge number of calculations, we adopted the same strategy as in [40,41]: we randomly selected 100 samples where users did not interact (negative samples) and combined them with the ground truths to form the candidate set.

4.1.3. Baselines

To verify the effectiveness of our method, we selected five SOTA models as our baselines:

Popularity-based Recommendation (PopRec): This method simply ranks based on popularity, which is determined by the number of user–item interactions.
Neural Collaborative Filtering (NCF) [19]: This method uses multilayer perceptrons to model users and items, and the output represents the probability value of their interaction.
GRU for Session-based Recommendation (GRU4Rec) [31]: We use the GRU module to model session-based sequences and predict the next item as the training target.
Self-Attention Sequential Recommendation (SASRec) [40]: This model employs a unidirectional Transformer module with a self-attention mechanism to capture user sequence information, and it is better than sequence models based on RNNs/CNNs.
BERT for Sequential Recommendation (BERT4Rec) [41]: This model utilizes a self-supervised mask model to obtain sequence information. This model is based on a bidirectional Transformer module and has better information acquisition capabilities compared to unidirectional models.

We opted not to include the Caser model as a baseline in our experiments, despite its innovative contributions to session-based recommendation since its introduction. This exclusion was a deliberate choice to focus our analysis on models that are particularly adept at capturing long-term sequential dependencies and the evolving dynamics of users’ interests, key aspects that are integral to the GAT4Rec approach. Our selection of baselines was thus guided by their relevance to these methodological priorities.

For SASRec [53] and BERT4Rec [54], the source codes provided by the authors were applied, adapting them to the corresponding PyTorch version used by NCF and GRU4Rec. For all models except PopRec, we used a latent dimension d ∈ {32, 64, 128, 256}, a learning rate

η

∈ {0.0005, 0.001, 0.01, 0.1, 1}, and a dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}. The other hyperparameters all used the default configurations recommended by the authors. The data shown in this article are the optimal values achieved by each model. The source codes for SASRec and BERT4Rec are publicly available at https://github.com/pmixer/SASRec.pytorch (accessed on 17 June 2024) and https://github.com/FeiSun/BERT4Rec (accessed on 17 June 2024), respectively.

4.1.4. Implementation Details

We implemented the model using PyTorch. The dropout rate for the entire model was 0.1, and the batch size was 64. For the GRU module, the hidden layer dimensions of the user embedding

d_{G}

were 16 and 32 for the MovieLens and Taobao datasets, respectively. The learning rate was set to

η

= 0.01. For the Transformer module, the hidden layer dimension was 256, the mask probability (mask_prob) was 0.2, the number of Transformer layers was four, and the number of attention heads was two. The maximum sequential length was set to 200 for the MovieLens datasets and 50 for the Taobao datasets. We chose Adam as the optimizer for the model. For the other hyperparameters, we set intent_num k to 4, threshold

λ

= 0.5 for MovieLens, and threshold

λ

= 0.03 for Taobao. All our experiments were conducted on a single RTX 2080ti.

4.1.5. Research Questions (RQs)

The remainder of this paper is guided by the following research questions:

(RQ1): Can our way of utilizing gated modules and auxiliary information boost the performance of other baselines and achieve superior results?
(RQ2): How does the learned representation from different numbers of inputs impact recommendation performance?
(RQ3): How can we demonstrate the effectiveness of the gated signal?
(RQ4): Is GAT4Rec sensitive to changes in the user embedding dimension?
(RQ5): Which part of the improvement primarily contributes to the effect?

4.2. Performance Comparison

RQ1: A comparison of the best results of the five baselines is shown in Table 4.

The last column in the table describes the improvement rate of GAT4Rec compared to each baseline. From the table, we can conclude that PopRec, the recommendation method based on popularity, performed the worst. NCF and GRU4Rec outperformed PopRec on all datasets, with GRU4Rec>NCF, indicating that the RNN’s ability to capture sequence information is better than that of multi-layer perceptrons. In addition, the performance of SASRec and BERT4Rec was better than that of NCF and GRU4rec, which shows that the ability of the Transformer module based on the attention mechanism to extract sequence information far exceeds that of the RNN or its variants. Among these models, BERT4Rec surpassed SASRec across all indicators on the two MovieLens datasets but performed worse in some metrics compared to SASRec on the two Taobao datasets. A possible reason for this is that the bidirectional sequence modeling capability is related to the mask probability, with fewer masks in shorter sequences potentially leading to less effective training compared to longer sequences.

We can observe that almost all indicators for GAT4Rec are better than those for the other methods, gaining 1.92% in R@10, 5.14% in NDCG@10, and 5.62% in MRR on average across the four datasets. These improvements may stem from the following:

Utilization of auxiliary data: SASRec and BERT4Rec only use the interaction sequence between users and items and do not model the concerns of each user, which undoubtedly brings a certain degree of loss. GAT4Rec models the categories of items that have been interacted with recently and combines the obtained hidden variables of users’ interests with user sequences to make recommendations that are more in line with users’ current interests.
Relevance: The model and algorithm determine the lower limit of learning, while the data determine the upper limit of the model. Most existing models use the entire user interaction sequence as a learning sample, but not all items can positively improve the recommendations. Usually, our focus is on recent items. Using this information to filter out ‘irrelevant items’ in the sequence can help the model better extract sequence information.

4.3. Impact of k

RQ2: We varied the length of the inputs to explore the influence patterns.

k is an important hyperparameter that determines how many recent interactions are involved in user interest modeling. Let

λ

= 0.4 on the MovieLens datasets and

λ

= 0.003 on the Taobao datasets, and the rest of the parameters were set to default values. The experimental results are shown in Figure 7. It can be seen that the initial experimental effect gradually increased with the increase in k. The performance of the model showed a decreasing trend when it reached a certain value. The possible reason for this is that the number of categories was small at first, and the model filtered more input sequences, so there was a loss of sequence information obtained by the model. As k increased, the model learned better sequence substructures. However, when k was too large, the user embedding contained too much information, which resulted in overfitting and led to degradation in the performance of the model. Specifically, on the two MovieLens datasets, the model achieved optimal performance when k = 4, and then k tended to decrease overall. On the two Taobao datasets, the corresponding k value was 6.

4.4. Impact of Threshold

RQ3: We evaluated the performance of GAT4Rec at different thresholds to analyze the effectiveness of the gated signal.

Figure 8 shows the variation in the impact of

λ

on each dataset.

λ

is one of the most influential parameters in the whole model and can directly act on the user sequences and influence the input of the model. When

λ

= 0, the model was unable to perform the filtering of sequence information and only made use of the user interest propensity representation. Considering the difference in the number of categories in the datasets,

λ

was set to different values on the Taobao and MovieLens datasets. When

λ

= 0.4, the NDCG@10 values achieved by the Ml-1M dataset and Ml-20M datasets were 0.6507 and 0.8575, respectively, which, for these two datasets, is the optimal value of

λ

. In addition, when

λ

= 0.03, the NDCG@10 values achieved by the Taobao and Taobao_m datasets were 0.5376 and 0.5357, respectively. As

λ

gradually increased, the performance of the model decreased, which may be because the whole sequence was essentially regarded as noise and filtered out, resulting in excessive loss of sequence information and a decrease in the learning ability of the model.

4.5. Impact of User Embedding Dimension

RQ4: We set multiple values to investigate the effect of different dimensions on the model.

The effect of the implicit dimension of user embedding is shown in Figure 9. NDCG@10 and R@10 were evaluated as the implicit dimension

d_{G}

was increased sequentially from 8 to 512. It can be observed that the NDCG@10 values were better across all four datasets when the implicit dimension was small. As

d_{G}

increased, the overall model performance tended to decrease, probably because of the following: (1) The number of item categories was limited, and the information learned by the model in larger dimensions was more sparse. (2) The input of the Transformer was the vector from the splicing of the item embedding and the user embedding. In larger dimensions, the user embedding may bias the learning of the model toward the user embedding, thus affecting its ability to model the item sequence.

4.6. Ablation Study

To further reveal the effectiveness of the different components in our model, we conducted ablation experiments on GAT4Rec with the following scenarios: (1) default, with all components in Figure 1; (2) without the user embedding (UE) layer; (3) without the filter layer (FL); and (4) with the mask last (ML) dataloader.

To observe the influence of each component more concisely and intuitively, we only utilized the NDCG@10 and R@10 indicators. The experimental results are shown in Table 5.

4.6.1. User Embedding

We added the user embedding obtained by the GRU module to the Transformer module for joint training with the sequence. In Table 4, we can observe that after the user embedding was removed from the Transformer module, the model recommendation effect decreased, and the NDCG@10 and R@10 values exhibited losses of 1.84% and 0.94%, respectively. After removing the user embedding, the learning of the model only relied on the item embedding, which shows that the item’s auxiliary information can help the model learn better parameters to a certain extent.

4.6.2. Sequence Filter

The removal of the gated filtering layer was more pronounced for the ML-1M, ML-20M, and Taobao datasets, and Taobao_m was relatively less affected. This indicates that the filtered pre-processed dataset removed some of the noise, and the gated filtering layer helped find a better sequence substructure. In addition, the NDCG@10 values decreased more than the R@10 values on the two MovieLens datasets, and the R@10 values were affected more than the NDCG@10 values on the two Taobao datasets, i.e., the MovieLens dataset was more sensitive to the recall task compared to the sorting task, and the Taobao dataset was the opposite. This may be related to the lengths of the sequences.

4.6.3. Mask Last

Our model is based on the mask model of the bidirectional Transformer. In the verification phase, the mask token is added to the end of the sequence to predict the corresponding item. The task of sequence recommendation is to predict whether the next item is the ground truth, i.e., predict the item corresponding to the end mask token through calculation. When adding the mask token to the end of each sequence as a new data-loading method, we can see from the results in Table 5 that the effectiveness of the model decreased rather than increased. This indicates that the mask model with random preprocessing can learn more effective representation vectors, and the end mask processing leads to overfitting of the model during learning.

4.7. Discussion and Comparison of GAT4Rec with Baseline Methods

In this subsection, we provide a comprehensive comparison of our proposed GAT4Rec model with the baseline methods used in our experiments. The baseline methods, including PopRec, NCF, GRU4Rec, SASRec, and BERT4Rec, each offer unique approaches to the sequential recommendation task. PopRec, for instance, relies on the popularity of items, which lacks personalization and can be inefficient for users with niche preferences. NCF utilizes a neural collaborative filtering approach that captures general interactions between users and items but falls short in modeling sequential dependencies.

GRU4Rec, which employs gated recurrent units, is proficient at capturing short-term dependencies within sequences but may struggle with longer-term patterns due to the vanishing gradient problem. SASRec and BERT4Rec, both based on the Transformer architecture, demonstrate strengths in handling long sequences owing to their self-attention mechanisms. However, they may not fully leverage the dynamic evolution of users’ interests over time.

In contrast, GAT4Rec integrates the strengths of bidirectional self-attention and gated recurrent units to model both the short-term and long-term dependencies in users’ behavior. Our model’s gated filter layer, a unique feature, enables dynamic filtering of users’ historical sequences to focus on the most relevant subsequences, potentially improving recommendation accuracy. The Transformer component in GAT4Rec allows for parallel processing, which significantly accelerates training compared to the sequential nature of RNN-based methods.

Despite these advantages, GAT4Rec has its limitations. The complexity of the model could lead to increased computational requirements during training and it may not scale as efficiently for extremely large datasets. Additionally, the hyperparameter tuning for GAT4Rec can be more challenging due to the interplay between the GRU and Transformer components.

Our experimental results, as presented in the previous sections, demonstrate GAT4Rec’s superiority in terms of recall, NDCG, and MRR metrics across various datasets. The improvements are particularly notable in scenarios where capturing fine-grained users’ intentions and interests is crucial for effective recommendation.

In conclusion, while GAT4Rec outperforms the baseline models in several aspects, it is essential to consider the trade-offs in computational complexity and scalability. Future work will focus on optimizing the model to address these challenges and further enhance its performance in diverse recommendation scenarios.

4.8. Model Interpretability and Explainability

The GAT4Rec model’s interpretability is primarily facilitated by the attention mechanism within its Transformer architecture. This mechanism allows us to understand the influence of different historical interactions on the model’s recommendations. By analyzing the attention weights, we can identify which items in the user sequence are prioritized by the model, offering a qualitative insight into the decision-making process. The attention weight distribution reveals the model’s focus on the specific subsequences of the user’s history that are considered most relevant for generating recommendations. Items with higher weights indicate a stronger contribution to the prediction, reflecting the model’s assessment of their importance in the context of the user’s current interests.

The interpretability provided by the attention mechanism is valuable for several reasons. Firstly, it aids in building trust with users by showing which of their past interactions are considered when recommendations are made. Secondly, it allows us to analyze the model’s behavior and identify any potential biases or unexpected patterns in the recommendations.

The visualization of attention weights also serves as a diagnostic tool for researchers and practitioners, enabling the examination of the model’s internal workings and the interplay between different components in user sequences. This level of interpretability is a step toward transparency and opens up possibilities for further analysis and refinement of the model’s recommendations.

5. Conclusions

In this paper, we study the problem of sequential recommendation. To address this problem, we propose a method called GAT4Rec to solve the problem of insufficient information utilization in sequence recommendation and the failure to consider users’ interests. We use a GRU to help us model the metadata of the product to better simulate users’ choices for the next moment. Experimental results on existing real-world datasets show that GAT4Rec almost outperforms all other sequence modeling methods. In terms of the ranking-oriented evaluation metrics and recall metric, GAT4Rec performs better on both long and short sequences.

As to future work, we plan to use more auxiliary information to model users’ interests and expect to apply other sequence filtering methods to obtain more efficient subsequences.

Author Contributions

Methodology, H.H.; Validation, X.Y. and F.H.; Formal analysis, H.H.; Investigation, H.H. and F.Y.; Resources, S.L.; Data curation, H.H. and F.Y.; Writing—original draft, X.Y. and F.H.; Writing—review and editing, H.H.; Supervision, F.Y. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Planning Project of Guangdong Province, China (No. 2021A0101180005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank all of the reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Khan, R.A.; Khan, A. Cloud Migration: Standards and Regulatory Issues with Their Possible Solutions. Int. J. Adv. Netw. Appl. 2019, 10, 4113–4119. [Google Scholar]
Tang, J.; Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018; pp. 565–573. [Google Scholar]
Liu, Q.; Wu, S.; Wang, D.; Li, Z.; Wang, L. Context-aware sequential recommendation. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1053–1058. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Chen, X.; Xu, H.; Zhang, Y.; Tang, J.; Cao, Y.; Qin, Z.; Zha, H. Sequential recommendation with user memory networks. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018; pp. 108–116. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International World Wide Web Conference, Raleigh, NC, USA, 11 March 2010; pp. 811–820. [Google Scholar]
Zhang, H.; Ni, W.; Li, X.; Yang, Y. Modeling the heterogeneous duration of user interest in time-dependent recommendation: A hidden semi-Markov approach. IEEE Trans. Syst. Man Cybern. Syst. 2016, 48, 177–194. [Google Scholar] [CrossRef]
Smirnova, E.; Vasile, F. Contextual sequence modeling for recommendation with recurrent neural networks. In Proceedings of the Second Workshop on Deep Learning for Recommender Systems, Como, Italy, 27 August 2017; pp. 2–9. [Google Scholar]
Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. A dynamic recurrent model for next basket recommendation. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 729–732. [Google Scholar]
Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; Ma, J. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1419–1428. [Google Scholar]
Donkers, T.; Loepp, B.; Ziegler, J. Sequential user-based recurrent neural network recommendations. In Proceedings of the 11th ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 152–160. [Google Scholar]
Yuan, F.; Karatzoglou, A.; Arapakis, I.; Jose, J.M.; He, X. A simple convolutional generative network for next item recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, Melbourne VIC Australia, 11–15 February 2019; pp. 582–590. [Google Scholar]
Tuan, T.X.; Phuong, T.M. 3D convolutional networks for session-based recommendation with content features. In Proceedings of the 11th ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 138–146. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Ma, H.; Yang, H.; Lyu, M.R.; King, I. Sorec: Social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and knowledge management, Napa Valley, CA, USA, 26–30 October 2008; pp. 931–940. [Google Scholar]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Chen, Y.; de Rijke, M. A collective variational autoencoder for top-n recommendation with side information. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems, Vancouver, BC, Canada, 6 October 2018; pp. 3–9. [Google Scholar]
Vasile, F.; Smirnova, E.; Conneau, A. Meta-prod2vec: Product embeddings using side-information for recommendation. In Proceedings of the Tenth ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 225–232. [Google Scholar]
Pourgholamali, F.; Kahani, M.; Bagheri, E.; Noorian, Z. Embedding unstructured side information in product recommendation. Electron. Commer. Res. Appl. 2017, 25, 70–85. [Google Scholar] [CrossRef]
Li, Z.; Shahrajabian, H.; Bagherzadeh, S.A.; Jadidi, H.; Karimipour, A.; Tlili, I. Effects of nano-clay content, foaming temperature and foaming time on density and cell size of PVC matrix foam by presented Least Absolute Shrinkage and Selection Operator statistical regression via suitable experiments as a function of MMT content. Phys. A Stat. Mech. Its Appl. 2020, 537, 122637. [Google Scholar] [CrossRef]
Choudhary, C.; Singh, I.; Kumar, M. SARWAS: Deep ensemble learning techniques for sentiment based recommendation system. Expert Syst. Appl. 2023, 216, 119420. [Google Scholar] [CrossRef]
Fang, H.; Zhang, D.; Shu, Y.; Guo, G. Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations. ACM Trans. Inf. Syst. TOIS 2020, 39, 1–42. [Google Scholar] [CrossRef]
Zhang, Z.; Nasraoui, O. Efficient hybrid Web recommendations based on Markov clickstream models and implicit search. In Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, Fremont, CA, USA, 2–5 November 2007; pp. 621–627. [Google Scholar]
He, R.; McAuley, J. Fusing similarity models with markov chains for sparse sequential recommendation. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 191–200. [Google Scholar]
Zhu, Y.; Li, H.; Liao, Y.; Wang, B.; Guan, Z.; Liu, H.; Cai, D. What to Do Next: Modeling User Behaviors by Time-LSTM. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; Volume 17, pp. 3602–3608. [Google Scholar]
Bogina, V.; Kuflik, T. Incorporating Dwell Time in Session-Based Recommendations with Recurrent Neural Networks. In Proceedings of the 11th ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 57–59. [Google Scholar]
Hidasi, B.; Quadrana, M.; Karatzoglou, A.; Tikk, D. Parallel recurrent neural network architectures for feature-rich session-based recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 241–248. [Google Scholar]
Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. arXiv 2015, arXiv:1511.06939. [Google Scholar]
Hidasi, B.; Karatzoglou, A. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 843–852. [Google Scholar]
Cui, Q.; Wu, S.; Huang, Y.; Wang, L. A hierarchical contextual attention-based network for sequential recommendation. Neurocomputing 2019, 358, 141–149. [Google Scholar] [CrossRef]
Quadrana, M.; Karatzoglou, A.; Hidasi, B.; Cremonesi, P. Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proceedings of the 11th ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 130–137. [Google Scholar]
Barkan, O.; Koenigstein, N. Item2vec: Neural item embedding for collaborative filtering. In Proceedings of the 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vietri sul Mare, Italy, 13–16 September 2016; pp. 1–6. [Google Scholar]
Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 135–144. [Google Scholar]
Grbovic, M.; Cheng, H. Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 311–320. [Google Scholar]
Wan, M.; Wang, D.; Liu, J.; Bennett, P.; McAuley, J. Representing and recommending shopping baskets with complementarity, compatibility and loyalty. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 1133–1142. [Google Scholar]
Chen, X.; Zhang, Y.; Qin, Z. Dynamic explainable recommendation based on neural attentive models. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 21 January–1 February 2019; Volume 33, pp. 53–60. [Google Scholar]
Kang, W.C.; McAuley, J. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 197–206. [Google Scholar]
Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 2–7 November 2019; pp. 1441–1450. [Google Scholar]
Tang, W.; He, F.; Liu, Y. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans. Multimed. 2022, 25, 5413–5428. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y.; Duan, Y. MATR: Multimodal Medical Image Fusion via Multiscale Adaptive Transformer. IEEE Trans. Image Process. 2022, 31, 5134–5149. [Google Scholar] [CrossRef] [PubMed]
Si, T.; He, F.; Zhang, Z.; Duan, Y. Hybrid contrastive learning for unsupervised person re-identification. IEEE Trans. Multimed. 2022, 25, 4323–4334. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Su, Y.; Kuo, C.C.J. On extended long short-term memory and dependent bidirectional recurrent neural network. Neurocomputing 2019, 356, 151–161. [Google Scholar] [CrossRef]
Ravanelli, M.; Brakel, P.; Omologo, M.; Bengio, Y. Light gated recurrent units for speech recognition. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 92–102. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Cai, R.; Wang, Q.; Wang, C.; Liu, X. Learning to Structure Long-term Dependence for Sequential Recommendation. arXiv 2020, arXiv:2001.11369. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the 35TH AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Huang, Z. PyTorch (v1.6). 2018. Available online: https://github.com/pmixer/SASRec.pytorch (accessed on 17 June 2024).
Available online: https://github.com/FeiSun/BERT4Rec (accessed on 17 June 2024).

Figure 1. The overview of our proposed GAT4Rec model. Select the nearest k categories to obtain the gating signal through the GRU module for identifying relevant items, marked in yellow. Only the first

T - k

items in the sequence need filtering. The filtered subsequence is used as input to the encoding layer, which consists of multiple layers of Transformers. Finally, the recommended content is calculated from the output of the corresponding position of the mask token.

Figure 1. The overview of our proposed GAT4Rec model. Select the nearest k categories to obtain the gating signal through the GRU module for identifying relevant items, marked in yellow. Only the first

T - k

items in the sequence need filtering. The filtered subsequence is used as input to the encoding layer, which consists of multiple layers of Transformers. Finally, the recommended content is calculated from the output of the corresponding position of the mask token.

Figure 2. General conversion needs

N \times L

parameters, while the parameters can be reduced to

N \times E + E \times L

after using FEP.

Figure 2. General conversion needs

N \times L

parameters, while the parameters can be reduced to

N \times E + E \times L

after using FEP.

Figure 3. Embedding Layer.

Figure 4. Transformer.

Figure 5. Flowchart of our GAT4Rec model.

Figure 6. The sequence length distributions of the four datasets (L < 400).

Figure 7. The performance of GAT4Rec on the four datasets with different values of k.

Figure 8. Results of NDCG@10 and R@10 with different thresholds ranging from 0.1 to 1.0.

Figure 9. Results of NDCG@10 and R@10 with different dimensions.

Table 1. Summary of the main abbreviations used throughout this paper.

Abbreviation	Description
GAT4Rec	Gated Attention for Sequential Recommendation
GRU	Gated recurrent unit
RNN	Recurrent neural network
CNN	Convolutional neural network
TRM	Transformer
Transformer	A model architecture for processing sequential data
MC	Markov chains
SASRec	Self-Attention Sequential Recommendation
BERT4Rec	Bidirectional Encoder Representations from Transformers for Sequential Recommendations
GRU4Rec	GRU for Session-based Recommendation
NDCG	Normalized discounted cumulative gain
MRR	Mean reciprocal rank
LSTM	Long Short-Term Memory
GELU	Gaussian Error Linear Unit
GPU	Graphics Processing Unit
CPU	Central Processing Unit
UE	User embedding
FL	Filter layer
ML	Mask last
AI	Artificial intelligence
FEP	Factorized Embedding Parameterization
PyTorch	A deep learning library
Adam	Adaptive Moment Estimation optimizer
RTX	A brand of graphics processing units by NVIDIA
RQ	Research Question
NCF	Neural Collaborative Filtering
PopRec	Popularity-based Recommendation

Table 2. Main notations used throughout this paper.

Notation	Description
$S_{u}$	The historical purchase sequence of user u
$s_{u}^{j}$	The purchased item and the category of the item in the jth moment in the action sequence of user u
T	The length of action sequence $S_{u}$
$P (s_{u}^{T + 1} \| S_{u})$	The probability of each item being purchased at moment $T + 1$ for user u
$S_{u}^{k}$	The sequence consisting of the most recent k items
$S_{u}^{T - k}$	The sequence of user u except for $S_{u}^{s}$
$C^{k}$	The category information corresponding to the $S_{u}^{k}$ sequence
$e_{i}^{c}$	The embedding of the category information of the ith moment
$C^{T - k}$	The sequence of categories corresponding to $S_{u}^{T - k}$
$E_{k}^{c}$	The embedding of the category sequence
$E_{c}^{T - k}$	The sequence of representations of the category information
$h_{u}$	The interest propensity representation of user u
$W_{i}^{Q}$	The weight matrix of the query vector
$W_{i}^{K}$	The weight matrix of the key vector
$W_{i}^{V}$	The weight matrix of the value vector
${h^{'}}_{T}$	User interest representation obtained through Transformer encoding
$H_{l}^{L}$	The output corresponding to the last node of the L-th level of the Transformer
$S_{m}$	Masked user sequence
$R^{c}$	c-dimensional real vector

Table 3. Statistics of datasets.

Dataset	#Users	#Items	#Categories	#Actions	avg.act/User
ML-1M	6040	3706	18	1.0 M	165.5
ML-20M	138,493	26,744	19	20.0 M	144.4
Taobao	6469	45,393	2034	0.4 M	66.2
Taobao_m	17,719	36,716	1702	0.8 M	44.5

Table 4. Comparison of experimental results of baselines used in our experiments. Boldface text represents the best value among the baselines.

Dataset	Metric	PopRec	NCF	GRU4Rec	SASRec	BERT4Rec	GAT4Rec	Improvement
ML-1M	Recall@1	0.0141	0.1934	0.2675	0.3662	0.4143	0.4552	9.87%
	Recall@5	0.0547	0.5242	0.5751	0.7248	0.7292	0.7646	4.85%
	Recall@10	0.1027	0.6881	0.7125	0.8202	0.8277	0.8517	2.89%
	NDCG@5	0.0482	0.3637	0.4417	0.5573	0.5834	0.6225	6.70%
	NDCG@10	0.0703	0.4167	0.4852	0.5883	0.6152	0.6507	5.77%
	MRR	0.0496	0.3484	0.4235	0.5227	0.5557	0.5938	6.86%
ML-20M	Recall@1	0.0161	0.4383	0.5677	0.7013	0.7051	0.7278	3.22%
	Recall@5	0.0892	0.7142	0.8065	0.8579	0.9343	0.9478	1.44%
	Recall@10	0.1606	0.8179	0.8372	0.8977	0.9717	0.9787	0.72%
	NDCG@5	0.0744	0.6399	0.6618	0.8094	0.8332	0.8453	1.45%
	NDCG@10	0.1074	0.6738	0.7025	0.8127	0.8452	0.8566	1.35%
	MRR	0.0655	0.6008	0.6359	0.7841	0.8055	0.8202	1.82%
Taobao	Recall@1	0.0012	0.2869	0.3152	0.3657	0.4044	0.4991	23.42%
	Recall@5	0.0109	0.4784	0.5057	0.5371	0.5024	0.5731	6.70%
	Recall@10	0.0134	0.5633	0.5826	0.6237	0.5619	0.6517	4.49%
	NDCG@5	0.0078	0.3888	0.3985	0.4548	0.4549	0.5132	12.82%
	NDCG@10	0.0089	0.4161	0.4473	0.4827	0.4741	0.5386	11.58%
	MRR	0.0056	0.3826	0.3961	0.4531	0.4612	0.5163	11.95%
Taobao_m	Recall@1	0.0021	0.2284	0.3374	0.4549	0.4132	0.4435	−0.51%
	Recall@5	0.0099	0.4639	0.5015	0.5808	0.5723	0.5895	1.50%
	Recall@10	0.0158	0.5856	0.6176	0.6854	0.6691	0.6827	−0.39%
	NDCG@5	0.0087	0.3514	0.4205	0.4733	0.4953	0.5082	2.60%
	NDCG@10	0.0113	0.3902	0.4598	0.5071	0.5263	0.5357	1.79%
	MRR	0.0074	0.3451	0.4134	0.4644	0.4982	0.5074	1.85%

Table 5. Results of ablation study on the four datasets.

Architecture	Metrics	Dataset
Architecture	Metrics	ML-1M	ML-20M	Taobao	Taobao_m
Default	NDCG@10	0.6507	0.8566	0.5386	0.5357
Default	R@10	0.8517	0.9787	0.6517	0.6827
Without UE	NDCG@10	0.6331	0.8526	0.5228	0.5289
Without UE	R@10	0.8452	0.9723	0.6388	0.6801
Without FL	NDCG@10	0.6301	0.8491	0.5137	0.5304
Without FL	R@10	0.8477	0.9769	0.6136	0.6759
With ML	NDCG@10	0.6281	0.7407	0.5346	0.5294
With ML	R@10	0.8423	0.9262	0.6509	0.6744

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, H.; Yang, X.; Huang, F.; Yi, F.; Liang, S. GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers. Mathematics 2024, 12, 2189. https://doi.org/10.3390/math12142189

AMA Style

He H, Yang X, Huang F, Yi F, Liang S. GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers. Mathematics. 2024; 12(14):2189. https://doi.org/10.3390/math12142189

Chicago/Turabian Style

He, Huaiwen, Xiangdong Yang, Feng Huang, Feng Yi, and Shangsong Liang. 2024. "GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers" Mathematics 12, no. 14: 2189. https://doi.org/10.3390/math12142189

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GAT4Rec: Sequential Recommendation with a Gated Recurrent Unit and Transformers

Abstract

1. Introduction

2. Related Works

2.1. General Recommendation

2.2. Sequential Recommendation

2.3. Attention-Aware Recommendation

3. The Proposed Method

3.1. Problem Statement

3.2. Overview of the GAT4Rec Model

3.3. User Embedding Layer

3.4. Gated Filter Layer

3.5. Item Embedding Layer

3.6. Transformer Layer

3.7. Model Training

3.8. Discussion

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Baselines

4.1.4. Implementation Details

4.1.5. Research Questions (RQs)

4.2. Performance Comparison

4.3. Impact of k

4.4. Impact of Threshold

4.5. Impact of User Embedding Dimension

4.6. Ablation Study

4.6.1. User Embedding

4.6.2. Sequence Filter

4.6.3. Mask Last

4.7. Discussion and Comparison of GAT4Rec with Baseline Methods

4.8. Model Interpretability and Explainability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI