In this section, we first present the problem as a mathematical formulation. Then, we give a general introduction to the overall framework of the model. Finally, we discuss each part of the model in detail.
3.1. Problem Statement
First, we denote a user as
, item
, a category as
, and user
u’s historical purchase history as
, where
and
T is the length of the action sequence.
represents a two-tuple consisting of item
purchased by user
u at the
jth moment, and
represents a category belonging to
. The model employs the sequential representation obtained by the encoders to predict the next possible item. Define
; then, the probability value of user
u interacting with each item at
moments can be obtained:
Then, by sorting the probability value corresponding to each item, we can obtain a top-
K candidate set for the user according to the selection of the
K value. The main abbreviations and notations used throughout this paper are summarized in
Table 1 and
Table 2, respectively.
3.2. Overview of the GAT4Rec Model
In this section, the GAT4Rec model is introduced to model the target formula, which is based on a Transformer and GRU combination, as shown in
Figure 1. The whole model is divided into the following layers: (1) user embedding layer; (2) gated filter layer; (3) item embedding layer; and (4) Transformer layer.
Denote
as the most recent
k actions in the whole sequence, which are selected as the input in the user interests encoding layer. According to the literature [
9], it is known that user interest mostly focuses on the most recently purchased products. For example, when choosing a cell phone, users usually tend to check another cell phone as a reference. In order to balance diversity and generality, we select the category of purchased items as the user interest tendency representation.
According to the user interest tendency feature representation, we filter the historical items that can support the current vector in the gating filter layer, i.e., the embedding of the corresponding category of historical items is more isotropic with the learned user interest tendency representation in the vector space. The filtered item sequences are fed into a coding layer consisting of L layers of Transformers stacked together, with each layer consisting of H Transformers, where the layers interact with each other through fully connected information. Unlike an RNN, the Transformer ensures parallel training of the whole model, and each layer is equivalent to recoding the input items. Finally, the output of the items corresponding to the mask notation is used to predict the final set of recommended items.
3.3. User Embedding Layer
Considering the hidden nature of user IDs (anonymous users), modeling the IDs directly may have a negative effect on the model due to null values, so we need to use other information to help construct user representations. The category of items is easier to obtain compared to other features and using it to represent users’ interests ensures that the representation also serves as an embedding of users. This is because the feature representation based on category information ensures the scalability of the model with a smaller number of users, achieving better scalability and diversity in recommendation results.
A GRU [
45] is a gating mechanism in a recurrent network with fewer parameters that lacks an output gate, unlike Long Short-Term Memory (LSTM). A GRU only has two gates and directly applies a reset gate to the previous hidden state, achieving better performance on smaller, frequent datasets in certain tasks like natural language processing [
46,
47]. Thus, here, we use a GRU to model the category sequences to obtain current user representations, considering the limited length of
k and the fact that a GRU is more efficient than LSTM in training.
Specifically, we have the sequence
with its corresponding category sequence
, where
denotes the category corresponding to the
i-th item in the sequence
and
T denotes the last index. By mapping transformations, we can obtain the embedding of the category sequence
, where
. We input this sequence of categories into the structure composed by the GRU to obtain the user embedding representations. The nodes of the GRU are updated as follows:
where
are the weight matrices to be learned, ⊙ denotes the Hadamard product (e.g., for two matrices
A and
B of the same dimensions, the Hadamard product
is a matrix of the same dimension as the operands, with the elements
), and
denotes the sigmoid activation function. Finally, we make the hidden layer representation obtained from the
kth GRU unit as the user interest tendency representation, whose expression is
where
is the user interest tendency representation. The network parameters of the GRU are trained by feeding
to the multilayer perceptron and softmax operation. Considering the complexity of the overall network, the training time of this network needs to be reduced to prevent overfitting.
3.4. Gated Filter Layer
The gated filter layer in our model serves a crucial role in selectively refining users’ historical interaction sequences. Inspired by the gating mechanisms in LSTM networks [
48], our layer dynamically adjusts the information flow, retaining elements that are significant and discarding those deemed less relevant. As users accumulate a history of shopping items over time, modeling the entire sequence may lead to the inclusion of items that no longer align with their current interests, potentially degrading the recommendation quality. Therefore, it is imperative to filter the historical sequence to identify items that resonate with the users’ current preferences.
In recent years, attention mechanisms [
14], including self-attention, soft attention, and hard attention, have become prevalent in sequential modeling. Hard attention is specifically employed to highlight the most salient items within a sequence. Our model adopts the concept of hard attention to pinpoint the sequence’s most pertinent items that align with users’ current interest tendencies.
Given the hidden state obtained from our model, we proceed to filter the eligible historical interactions. We focus on items with an index less than or equal to , effectively filtering out those that are not recent enough. The historical sequence is mapped to a corresponding category sequence , and subsequently, to category word embeddings .
Building upon Cai’s method [
49], which selects items based on the top-k softmax values of their interaction with the user interest representation, we introduce a hyperparameter
to refine this selection process. We define the relevance score for each category embedding
as follows:
where
is the sigmoid function and
is the user interest representation obtained from the model. Only items with a relevance score exceeding
are considered, which is mathematically expressed as follows:
This condition results in a subsequence
that is more attuned to users’ current interests:
This refined subsequence serves as the input for our model’s recommendation process. Here, we set a condition where only items with a sigmoid-activated dot product of their category embedding and the user interest representation exceeding are considered. This results in a subsequence that is more attuned to users’ current interests, serving as the refined input for our model.
3.5. Item Embedding Layer
The number of items is often in the thousands, and their item IDs are usually labeled in the integer domain using one-hot encoding, which greatly increases the number of parameters in the model. So, we first map the one-hot encoding to a low-dimensional embedding vector to reduce the dimension and improve the representation learning ability of the model. We denote the size of the dataset as
and the dimensionality of the embedding vector as
d such that the number of parameters the model needs to learn can be represented as
. It should be noted that here we use factorized embedding parameterization [
16] to further reduce the number of parameters in the model, which helps in expanding the model’s capabilities. In simple terms, embedding factorization means adding another layer to the original mapping matrix. Assuming that the added embedding dimension is
E, the number of parameters is
, and if
, the reduction in the number of parameters is obvious. Thus, the project embedding can be expressed as
. The factorization is schematically shown in
Figure 2.
The sequential order of items can reflect the changes in users’ behavior, but the Transformer module does not include timing information like recurrent neural networks, so additional sequential embedding is required to ensure that the model can learn the importance of position. We denote the position embedding as , where L is the maximum length of the sequence. Here, we leverage the model to learn the embeddings.
To make full use of the hidden layer information in user representations, we choose to add it to the Transformer layer for joint learning. The vector combination process is shown in
Figure 3. First, add the item embedding
and the position embedding
to obtain
. Then,
is spliced with the user embedding
u and linearly transformed to obtain the final input embedding
for the model.
3.6. Transformer Layer
Learning from sequences to transfer patterns between items is the primary purpose of sequence recommendation. Compared to RNN-based feature extraction networks, Transformers not only perform better in handling long sequence tasks [
50] but also have parallelized processing capability, which ensures that their training speed is better than that of RNNs [
51]. The Transformer layer is the main structure of this model and consists of multiple stacked Transformers. The Transformer [
14] is a recently developed Sequence-to-Sequence (Seq2Seq) model based on multi-head self-attention, which consists of two main components: an encoder and a decoder. As shown in
Figure 4, the Transformer consists of a multi-head attention layer, a feedforward network layer, and a normalization layer. The key component of this module is the multi-head attention layer, which employs a self-attentive mechanism with multiple heads to assist in learning representation vectors in different subspaces. The multi-head formula is shown in Equation (
7):
where
and
V represent the query, key, and value, respectively. Because the self-attention mechanism is used in the Transformer,
and
V are generated by the same vector. In multi-head attention, the vector is divided into several parts to calculate independently during the calculation process, as shown in Formula (
8):
where
represents the hidden layer representation output of the
L-th layer. The corresponding attention for each head is obtained using independent weight matrices
, where each head is not shared. Finally, the obtained
n heads are concatenated together and then multiplied by a weight matrix
to obtain the current multi-head attention value of the
L-th layer.
The feedforward network layer mainly uses the Gaussian Error Linear Unit (GELU) function to activate the multi-head attention value, which is defined in Equation (
11) [
52]. Compared with ReLU, GELU is smoother [
52]. The activation formula is shown in Equations (
10) and (
11):
where
is the cumulative distribution function of the standard Gaussian distribution;
,
are the learned parameters; and
is the feedforward network. These parameters are shared across each Transformer.
In the normalization layer, we use the residual network to ensure the parameter learning effect of the deep network. Combining the multi-attention layer and the feedforward network layer, the overall process of the Transformer is as follows:
The entire encoding layer is composed of many Transformers, and the parameters are shared among layers, which greatly reduces the overall number of model parameters, allowing for the possibility of extending the model.
We select the Transformer output
corresponding to the end of the sequence as the item’s representation vector for subsequent recommendations, where
L represents the
L-th layer and
l represents the output corresponding to the
l-th node as follows:
The flowchart of the GAT4Rec model is depicted in
Figure 5. The flowchart begins with the initialization of user and item embeddings, which are essential for capturing the characteristics of a user’s behavior and item features. It then depicts the transition through the GRU layer, emphasizing the role of this layer in modeling the dynamic evolution of the user’s interests. Next, the flowchart illustrates the gating mechanism, a distinctive feature of our model that filters out irrelevant subsequences to focus on the most salient parts of a user’s interaction history. This selective process is crucial for enhancing the accuracy and relevance of our recommendations. Following the gating mechanism, the flowchart leads to the Transformer layer, highlighting the self-attention mechanism that allows for the parallel processing of sequence elements, thus improving the model’s efficiency and effectiveness in handling long-range dependencies. Finally, the flowchart culminates in the recommendation generation step, where the filtered and transformed user sequence is used to predict and rank potential items of interest to the user.
We summarize our GAT4Rec model in Algorithm 1 with pseudocode. In Algorithm 1, Lines 5 and 6 represent the hard attention process within the gated filter layer. For each category
in the sequence, the dot product of the gating signal
, which represents the latent representation of users’ interests, and the category embedding vector
is calculated. If this result is less than a predefined threshold
, the operation described in Line 7 is executed. Line 7 indicates that if the computed value is below the threshold
, the corresponding item
is removed from the sequence. This constitutes the implementation of hard attention, which selectively focuses on the most relevant parts of the sequence. By applying this mechanism, the model filters out items that are deemed less important, thereby enhancing the model’s ability to concentrate on users’ current interests.
Algorithm 1: Bidirectional self-attention model for recommendation: GAT4Rec. |
|
3.7. Model Training
Due to the bidirectional modeling property of our model, it needs to be trained through mask learning. The general idea is that for the sequence
, we sample
and perform the mask operation on it. Here, 80% of the sampling is replaced with a mask token, 10% is replaced with any item in the item set, and 10% remains unchanged. After the mask operation, the model inputs can be expressed as
. The purpose of this model is to predict the masked items from their original counterparts using the obtained output features. Therefore, the learning task of the model can be considered as a multiclassification task. Here, we choose to use the softmax function to obtain the probability values of the occurrence of each item in the item set. Considering that the model focuses only on whether the predicted items match the true labels, the training on the coding layer uses the cross-entropy loss function, as follows:
where
denotes the number of items affected by the mask,
is the total number of input sequences, and
is the label value of item
i, i.e., the item corresponding to the item before the mask operation.
In addition, the GRU module needs to be trained to predict the next item’s category information. The category label may contain more than one category, making the task multi-label classification. So, the binary cross-entropy loss function is used, as follows.
where
denotes the number of categories contained in the training samples.
Thus, the loss function consists of coding layer loss and category prediction loss, and
is the weight value of the two loss functions, which can be adjusted according to the actual training situation. The expression of the loss function is as follows: