The deep recurrent neural networks perform well in natural language processing (NLP). As the LSTM in recurrent neural networks has recently been successfully applied in sequence prediction problems, this paper leverages the deep Bi-LSTM and self-attention to learn user preference information on a deeper level.
2.1. Item Embedding
Given all user interaction item sets , where represents item sequence of the user i, the purpose of item embedding is to generate low-dimensional item vectors for each item. This paper only selects sequences that show feedback on user preferences. For example, if a user scores a low-interest item, the user is not interested in the item. Traditional item embedding often only considers the second-order correlation between items and does not consider the relationship between item attributes and content. We embed the class labels of an item into the item vector, which can better calculate the relevance of the item and learn the user preferences.
Item2vec [
13] is one of the important extensions of Skip-gram and negative sampling [
14] for item embedding for item-based collaborative filtering recommendations. To introduce the class label into the model, this paper has made some improvements to Item2vec: the class label of an item is one-hot encoded to obtain a vector, and this vector is connected with the embedding vector learned by Item2vec to obtain the final embedded representation vector of the item. Similar to Word2vec, this paper treats each item as a word. The sequence of user-interacted items is treated as a sentence, and each item is embedding into a vector of fixed dimensions. Each user has an item sequence of interactions that is different from each other. Finally, by embedding each user’s item sequence of interactions to obtain a fixed-dimensional item vector, the closer the vectors are, the more similar the vectors are in the embedding space. The process of item embedding is shown on the top of
Figure 1.
Given a user’s item sequence of interactions, the Skip-gram goal is to maximize the following objective functions:
where M is the length of the item sequence of interactions,
is the SoftMax function:
where
is a commonly used Sigmoid activation function and
N is the number of negative samples in the positive sample. By item embedding, we get the item sequence
.
2.2. Weight Update
The attention mechanisms are recently widely employed as a powerful tool in sequence data scenarios such as machine translation, speech recognition, and part-of-speech tagging. The attention mechanism can be used alone or mixed with other models. It uses the automatic weighted transformation of data to connect two different parts to highlight key data, so that the entire model can reach better performance. The attention mechanism is similar to the principle that the human brain observes certain things. For example, when the people observe a certain painting for describing the content of the painting, people first observe the words on the painting, and then make a purposeful observation for the part of the picture that represents the theme from their judgment. When people describe the painting, they often first describe the content that is most relevant to the painting, and then describe other aspects; the self-attention mechanism also is a mechanism by assigning sufficient attention to the key information and highlighting locally important information.
In real life, user preferences are not static. When users focus on certain products, they may potentially ignore other products. For example, as shown in
Figure 2, the sequence of items that the target user interacts with is “Movie A, Movie B, Movie C, Movie D, Movie E”. We want to predict what’s his next favorite movie. If the most recent interactions (movie C, movie D, movie E) are regarded as the context and are given higher priority, it is likely that the items recommended to the user are “comedy” movies, such as “Movie F”. However, the actual interaction record shows that the user chooses “Movie G” as the next item choice, because the choice of “Movie G” may depend on the first two items (Movie A and Movie B) that the user actually interacts with. This indicates that a good recommender system should pay much attention to those items (movie A and movie B) that are more related to the target item (movie G), rather than the newly added but less relevant items like “Movie C, Movie D, Movie E”.
Therefore, this paper further proposed an item-aware weight update model based on the principle of self-attention mechanism. This model uses the self-attention mechanism to model the internal relationship between user-interactive items when learning the user’s potential preference representation, so that the user preference representation is more effective, as shown in the middle of
Figure 1.
The weight update introduces self-attention mechanism which we elaborate here. First, for a fixed target item, it traverses the state of all encoders to compare the state of the target item with that of the source item (i.e., the relationship between the target item and the source item), so as to generate a score for each state in the encoder. Second, the SoftMax function is used to normalize all scores to generate a probability distribution given the target item state. Finally, we obtain the item weight from the distribution.
In essence, the item feature representation is mapped from
d-dimensional space to
z-dimensional space. The relationship mapping is shown as follows:
where
W is a weight matrix,
b is bias and
is the feature representation vector after item feature embedding. Equation (
3) indicates that the user interaction item features in the
d-dimensional space are mapped to the
z-dimensional space. Equation (
4) calculates the contribution weight of all user interaction items in the
d-dimensional space to each user interaction item in the
z-dimensional space. The model automatically adjusts the weight matrix
W through the loss function during model training, and the matrix
A normalized by the Softmax function. The items in the
z-dimensional space are weighted. After weighting, the feature representation of each item in the
z-dimensional space is jointly represented by itself and all the items associated with it. The final output
is the characteristics of the item after weighted by the self-attention mechanism. By weight update, we get the item different weight
.
2.3. User Preference Learning
RNN (Recurrent Neural Networks) play a vital role in predicting the next target of the sequence. Inspired by the literature [
15,
16], the user item sequence of interactions is treated as a sentence, and each item can as a word. We use deep recurrent neural networks to learn the relevance of each item in the item sequence of interactions to the adjacent item. This paper is based on deep bidirectional LSTM (i.e., deep Bi-LSTM), as shown in
Figure 1, enabling the model to better utilize forward and backward contexts representation. And deep recurrent neural networks also can better extract user characteristics.
Figure 1 shows the preference modeling, which has a double hidden layer, and information of each upper layer in the structure is provided by its lower layer. As the picture shows, in the network structure, the previous time step generates a set of parameters and passes the set of parameters to the inter-neurons in the same Bi-LSTM layer at a later time step
t. At the same time, the inter-neuron needs to receive two sets of related parameters from the previous layer of the Bi-LSTM hidden layer in the time step
t; the input sequence of each hidden layer in the model starts from two directions: from left to right, from right to left.
The relational of the deep Bi-LSTM structure is denoted in Equations (
6) and (
7). At the same time step
t, each output of the layer Bi-LSTM of
layer serves as an input to each intermediate neuron of the
r layer. At each time step in the training model, the result is produced by the hidden layer propagation via connecting all the input parameters. The last hidden layer produces the final output
P (Equation (
8)).
where
,
and
are weight matrix and offset vector generated in forward propagation at the
r layer of the model;
,
and
are weight matrix and offset vector generated in backward propagation at the
r layer of the model;
is output vector;
and
are respectively the intermediate representation of the past and the future is used to discriminate the input vector.
The r layer propagation of the model is based on the hidden state of the current moment of the previous layer and the hidden state of the previous moment of the current layer to calculate the forward direction of the hidden state of the current layer at the current moment; In contrast, backward propagation requires the hidden state of the current layer and the future state of the current layer to calculate the hidden state of the update reverse. Thus, each hidden representation can be computed using a series calculation function that concatenates the forward and backward hidden representations. The last layer of the hidden layer outputs the preference vector through the fully connected layer.
The embedded vector of each item is used as the input of the model. In the model training process, the mean square error (MSE) and Adagrad are used to learn the optimization model, so that the model can well learn the preferences of each user and better understand and represent the user long-term stable preference.
The MSE equation is shown as follows:
where
is the actual user interaction item in the test set,
is the predicted item, and
> 0 is the item weight. The more similar the predicted item and the actual interaction item are, the better the model performs, meaning the more accurate its prediction is.
2.4. Algorithms
The entire process of our recommender system includes Algorithm 1 Item Embedding and Algorithm 2 User Preference Learning as follows.
Algorithm 1 Item Embedding |
Input: |
- All user item sequences; |
- Class label vector. |
Output: |
- A vector representation of per items in a low-dimensional space; |
- matrix corresponding to user i’s item sequence. |
1: for each do |
2: Feed of user i into Item2vec; |
3: end for |
4: for each do |
5: |
6: end for |
7: return |
In Algorithm 1, the user item sequences and the content of the items are fed into Item2vec to train the item’s embedding vectors.
First, we extract the item set and the item’s label set from the user rating data.
Then, we convert the item set and the label set into a one-by-one item sequence and a one-hot encoded label vector , respectively.
Finally, the item sequence and label vectors are learned as the embedded vector representation of items in low-dimensional space by Item2vec model (Line 1–5).
Algorithm 2 is to feed the item’s embedding vectors corresponding to the user item sequence into deep Bi-LSTM, which generated the user preference vector by optimization of the model.
First, we use deep Bi-LSTM to learn user preferences vector , and add multiple hidden layers to enhance the model’s expressive ability (Line 1–11).
Then, we calculate the similarity between the preference vector of the target user and the vector of each item learned in the low-dimensional space: (Line 12–16).
Last, we filter out the item set
that the target user does not interact with, and sort items according to the similarity
. Top-
k items are recommend to the target users (Line 17).
Algorithm 2 Weight Update and User Preference Learning |
Input: |
- The first item set of all item sequences; |
- Impact weight of item i on the next item selection; |
length - The length of item sequence. |
Output: |
- The preference vector of the target user ; |
A recommendation Top-k list. |
1: for each do |
2: if length-1 then |
3: and input to deep Bi-LSTM; |
4: else if j = length-1 then |
5: as a target item of deep Bi-LSTM; |
6: else |
7: break; |
8: end if |
9: end for |
10: MSE optimization, parameter update; |
11: return ; |
12: for each do |
13: for each do |
14:
; |
15: end for |
16: Sort items according to ; |
17: end for |
18:
return A Top-k recommendation list. |
Note that a item sequence consists of items that the user has interacted with, which is denoted by , and the of the model training output represents the user preference vector.