1. Introduction
With dramatically increasing information available online, the Recommender System (RS) is playing an increasingly important role in many customer-oriented applications and services to alleviate information overload. Facing information explosion, Internet users rely on RS to filter out uninteresting information and direct access helpful information in classical application scenarios such as listening music, watching movie and online shopping. Meanwhile, recommender systems can also bring traffic and make profit for online companies. For example, it is reported in [
1] that recommender systems in Netflix accounted for about 80% of movies watched and generated immense economic value of more than
$1 billion (this paper is an extended version to our previous conference publication [
2]).
The key of recommender systems lays in the modeling of users’ preference [
3]. Collaborative Filtering (CF) plays a pivotal role in modern recommender systems and many promising approaches in the literature adopt the idea of CF to predict the users’ preference [
4]. The basic assumption of CF is that the users who have similar behavior such as historical clicking or rating tend to share similar preference and make the same choice. As a representative approach in CF, Probabilistic Matrix Factorization (PMF) has become the most famous and popular technique in RS [
5]. PMF projects the uses and items into a common latent space and constructs the latent vector through factorization of rating matrix. The elements of a latent vector can be seen as contributions of underlying factors to the preference of a user or the property of an item. In PMF, the preference of users on items is modeled as the inner product of corresponding vectors.
Although the CF is a clever and compact idea, there are still several significant challenges in practical scenarios [
6]. The first one comes from the sparsity of rating matrix [
7]. The sparsity problem arises from the imbalance between the total number of items and the average number of items that a user can rate. Since there is plenty of missing value on the rating matrix, it is perplexing for CF to make prediction and give recommendation. In addition, CF techniques show weak performance in the ability to provide explanation. Some research [
8,
9] has pointed out that it is beneficial for RS to give some explanation for the sake of persuasion and convenience. Credible explanations on the recommended items can increase the users’ confidence and help them make decision more efficiently. Another dilemma is caused by the model of the interaction between users’ and items’ latent vectors. The traditional factorization approaches in RS, including Matrix Factorization (MF) [
10], Factorization Machine (FM) [
11] and Pairwise Interaction Tensor Factorization (PITF) [
12], apply an inner product to model the pair-wise interaction between latent vectors and make rating prediction. However, as shown in [
13,
14], the simple inner product may miss high-order and non-linear parts in user–item interactions and lead to a deviation from original rating.
To address the sparsity problem and improve the explainability of RS, we explore the use of review text in RS. On most online commerce websites, users are encouraged to not only give a numerical rating on items but also write text reviews. The reviews contain rich information that can reflect the property and feature of items to some extent, which can give some suggestions and has certain reference significance to other users [
15]. Recently, some studies have reported that importing the review text into RS can mitigate the effect of sparsity problem and strengthen the ability of providing explanation [
16,
17].
There are many models to extract the feature vector of review text in RS. However, it is still a challenge and significant issue for RS to extract the most informative feature from a tremendous amount of reviews. Existing RS models utilize reviews to boost the model of latent factor [
17,
18,
19,
20] and give explanation [
21,
22,
23]. Although these methods have achieved impressive improvement, they still have some limitations in the design of attention mechanism. First, they do not model the contribution of each review and that of words in a single review text jointly. Some methods focus on the extraction of keywords and phrases but miss the contribution of each review [
18,
19], while some approaches only consider the usefulness of reviews [
21]. Second, these methods mainly generate explanations through simple extractions of keywords or reviews, which may mislead users.
In real life, people are not only attracted to the keywords or phrases when reading review text but also focus on the useful reviews, especially those reflecting the property of items in plenty of reviews. Inspired by this common phenomenon, we design a multi-level attention mechanism for RS to extract most helpful information from all related reviews. The weights of words and reviews are defined as the informativeness and the usefulness of themselves. The weights of reviews is a extension of the definition of
usefulness in [
21] and the weights of words is an extension of the
weights in [
23]. To provide helpful information about items and give suggestions based on the interest of users, the weights are learned according to the preference of users or the property of items in two deep neural networks.
To capture the high-order and non-linear user–item interaction and avoid a deviation from original rating, we combine Deep Neural Network with FM [
11]. As a popular method to extract feature representation, deep neural networks show the potential to handle the sophisticated feature interaction in the prediction of click-through rate (CTR) [
14,
24,
25,
26]. FM captures pair-wise interactions between latent factor vectors and shows impressive results. We extend the coupling of FM and deep neural networks in CTR [
14,
26] to the prediction layer of our RS model to capture both high-order and low-order interactions.
In this paper, we propose a novel Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model. Specifically, we apply both word level and full-text level attention mechanisms on reviews through two deep neural networks, which extract feature vectors for each user and item. The word level attention mechanism is realized by a CNN text processor with a word-level attention layer, which learns the weights of words from a slip window. The full-text level attention mechanism applies a two-layered neural network with softmax function to measure the weights of reviews. In the prediction layer, we integrate FM and deep neural networks to model low-order user–item interactions as in FM and capture the high-order interactions as in deep neural networks. Experimental results show that our MAHR model consistently outperforms state-of-the-art methods. In addition, the results also show that and MAHR alleviates the sparsity problem and improve the explainability of RS.
The main contributions of this paper are:
We propose a Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model. To the best of our knowledge, we are the first to introduce a multi-level attention mechanism in RS, which alleviates the sparsity problem and improve the explainability of RS.
In prediction layer, we hybrid FM and DNN to overcome the shortcomings of traditional handling of user–item interactions by simple inner product and capture both high-order and low-order interactions to achieve better prediction performance.
Experimental results on benchmark datasets show that our MAHR model obtains better prediction results than the state-of-the-art models and the average relative improvement over PMF is up to 30.06%.
The extension of our journal paper to our previous conference publication [
2] is as follows. In this article, we add the ID embedding of users and items to model the quality of users and items, which helps to identify users and items. In addition, the hybrid prediction layer is tuned to improve the memory efficiency. For the completeness of the model, we give a more precise description and analysis of our proposed method (
Section 3). To make our experiments more solid, we extended our experiments from the comparison methods (CTR and D-Attn in
Section 4.1) to the parameter sensitivity analysis (Figures in
Section 4.3). We also detail the experimental methodology in
Section 4.2. The discussion on comparison performance (Table in
Section 4.4) is extended in a variable-controlling approach (
Section 4.4). We also carried a test to compare our full-text level mechanism quantitatively (Table in
Section 4.7).
The reminder of our work is organized as follows. We present related work in
Section 2.
Section 3 describes our models in detail. Subsequently,
Section 4 shows experimental analysis, followed by the conclusion and future work in
Section 5.
3. Our Approach
3.1. Overview
To select both keywords and useful reviews, we utilize the multi-level attention mechanism to automatically assign weights to reviews and words when modeling users and items.
Figure 1 illustrates the architecture of
MAHR. The model contains two similar attention-based neural networks for users (
) and items (
) respectively, after which is the vector concatenation layer and the hybrid prediction layer. Note that the hybrid prediction layer consists of Factorization Machine and Deep Neural Network to capture the feature of concatenated vector more efficiently. Reviews written by the same users or on the same items are fed into to
and
. The corresponding rating is calculated as the sum of FM and DNN (we can also use other methods to combine the output of FM and DNN). In the following subsections, due to the similarity between
and
, we only give the details
. The same analysis is also practicable for
.
In the first stage of , the word-attention CNN Processor is applied to process the textual reviews of item i. In this processor, the network processes each review of item i, respectively. Specifically, each review is first transformed into a matrix of word vectors by the word-embedding technique, which we denote as , , ⋯, . Then, these matrices are sent to word attention CNN and the feature vectors of them can be obtained from the output. These feature vectors are denoted as , , ⋯, . Note that the keywords obtain higher weights in the word-attention CNN Processor, which will result more in precise feature vector of reviews than the normal CNN without word level attention mechanism.
The second stage of
is the full-text-attention layer. The inputs are the feature vectors from word-attention CNN and the respective ID embedding of users who writes reviews for item
i. In the full-text-attention layer, we calculate the contribution of each review and aggregate these vectors to get the representation of item
i:
where
is the corresponding weight of the
l-th review for item
i, which will learn by the full-text-attention layer.
After we obtain the feature vector of item i, we pose a fully connected layer to customize the dimension of vectors according to the number of latent factors. The output of is the latent vector of item i, which is denoted as . Recall that the output of left network is the latent vector of user u, which is denoted as .
The final structure is the prediction layer. In MAHR, we adopt the assumption of latent factor model that the rating of users to items are the results of the interaction of users’ and items’ latent vector. We extend the origin handle of user–item interaction by replacing the inner product with the coupling of FM and DNN. The objective function is the -norm of recorded rating matrix. To avoid overfitting, we adopt the dropout technique to improve the generalization ability of MAHR.
3.2. Word-Level Attention-Based Mechanism
Inspired by the human visual attention, we further develop a word attention mechanism for reviews. When we read text or see images, we probably focus on a certain part of the input to understand or recognize them more efficiently. Generally, people tend to be attracted by the significant word in a local part. Here, we introduce a word-level attention module to measure the significance of one word.
Suppose
X is a review with
T word embeddings
and
is a slip window. Here,
is the center word and
w is the width of local window. The significance for each center word in window is given by:
where
is the informative score of the
i-th word and
is a parameter matrix and
is a bias. The * is the operation of element-wise multiplication and sum. The score can be direct used as a weight for
i-th word embedding or we can apply a threshold to remove “trivial” words and only consider informative attention words. In this work, we use scores as weights. We use sigmoid for the activation function
g. The weighted word embedding is:
where
. To learn a global semantic representation, the weighted word embedding matrix
is then fed into textual CNN module, which consists of a 1-D convolutional layer and max pooling layer. The output is the feature vector
of the
l-th review in item
i.
The textual CNN module includes a convolutional layer and a max-pooling layer. Suppose the number of filters is
. Let
be a filter and
be a bias where
. Then,
is obtained by:
where ⊙ is the convolution operation and
is the max-pooling operation.
is a nonlinear activation function. The
is the feature obtained by convolution filter
and max-pool layer.
3.3. Full-Text Level Attention-Based Mechanism
The goal of the Full-Text Level Attention-based Layer in
is to calculate the significance of one review for the features of item
i and then aggregate all weighted reviews to characterize item
i. A two-layer network is used for the attention score
. The input contains the feature vector of the
l-th review of item
i (
) and the user who wrote it (ID embedding,
). The ID embedding is added to model the quality of users, which helps identify users who always write less-useful reviews. Formally, the attention network is defined as:
where
,
,
,
, and
are parameters. The feature vector
is obtained by the word-attention CNN. In addition,
is calculated from one-hot code of user ID by the lookup operation in tensorflow.
t denotes the size of hidden layer.
The final weights of reviews are predicted by the softmax function to normalize the above attention scores. The contribution of the
l-th review to the final feature vector of item
i is given by:
After we obtain the attention weight of each review, the feature vector of item
i is calculated as Equation (
1).
Then,
is sent to a fully connected layer with weight matrix
and bias
to customize the dimension of latent vector. The final representation of item
i is given by:
3.4. Hybrid Prediction Layer
We aim to capture both linear and non-linear interactions between users and items. We develop a hybrid prediction layer in MAHR. As shown in
Figure 1 , the prediction layer consists of two components,
FM component and
Deep component, which share the same input. First, let us concatenate
and
into a single feature vector
. For the feature vector
, the predicted rating
is given by:
where
and
are, respectively, the output of FM component and deep component.
The FM component is a factorization machine [
11]. In addition to modeling first-order linear interactions among features, FM also uses a dot product between vectors to model a second-order pairwise feature interactions. The output of FM is the sum of a bias and the two kinds of interactions.
is given by:
where
is the global bias,
measures the impact of the
i-th variable in
and
models the second-order interactions.
The deep component is a deep neural network for learning the high-order feature interactions. The architecture of the network is given by:
where
l is the depth of neural network and
is an activation function.
,
, and
are the input, weight matrix, and bias of the
l-th layer, respectively.
The prediction layer for
in deep component is given by:
where
H is the number of hidden layers and
is the sigmoid function.
Based on DeepFM, all parameters, including , , and the network parameters ( and ), are trained jointly for the combined prediction model.
3.5. Learning
The task that we focus on in this paper is rating prediction, which actually is a regression problem. For regression, a popular objective function is the squared loss. The objective function is given by:
where
denotes the set of instances for training and
is the rating assigned by user
u to item
i. To optimize the objective function, we adopt the Adaptive Moment Estimation (ADAM) as the optimizer. It can adjust the learning rate during the training phase, which avoids the process of choosing an efficient learning rate and results in the faster convergence than the SGD.
To alleviate overfitting, we consider dropout [
37], a widely used method in deep learning models. Dropout stop working during testing and we use the whole network for prediction. Through dropout, we can prevent complex coadaptations of neurons on training data. Moreover, dropout may potentially improve the performance of the whole neural network due to the side effect of performing model averaging with smaller neural networks.
4. Experiment
We performed extensive experiments on three popular rating datasets with reviews to verify the effectiveness of MAHR model in comparison with other state-of-the-art approaches. We first give the details of experimental settings in
Section 4.1, including datasets description, comparison methods, evaluation metric, and parameter settings. The experimental methodology is presented in
Section 4.2, followed by a parameter sensitivity analysis in
Section 4.3. Besides, we present the performance comparison in
Section 4.4. In
Section 4.5 and
Section 4.6, we present the validation of the effectiveness of multi-level attention mechanism and hybrid prediction structure, respectively. Afterwards, the explainability analysis is presented.
4.1. Experimental Settings
4.1.1. Datasets
We tested MAHR methods on three popular datasets. The first dataset is
Yelp Challenge 2015 (
https://www.yelp.com/dataset/challenge), a dataset for restaurant ratings and comments. Yelp contains more than one million reviews and thirty thousand users. The second and third dataset are both from
Amazon (
http://jmcauley.ucsd.edu/data/amazon/) product data with five cores [
38]:
Books (8,898,041 reviews and 22,507,155 ratings) and
Electronics (1,689,188 reviews and 7,824,482 ratings). These two datasets were selected to cover the most classic scenarios for online services and applications, i.e., reading books and online shopping.
Table 1 shows the statistics of three datasets. In
Table 1, we can find that the three datasets have different sparsity in rating matrix. All of these datasets have more than one million reviews, while users in Yelp and Electronics provide fewer reviews than Books on average, which indicates that Yelp and Electronics are much sparser than Books. As mentioned in the Introduction, the sparsity problem could deteriorate the prediction performance of recommender systems. Note that the lengths of reviews in all datasets are less than 150 words on average.
4.1.2. Comparison Methods
To show the superiority of our MAHR method, we selected seven comparison methods: Probabilistic Matrix Factorization (PMF) [
5], Non-negative Matrix Factorization (NMF) [
39], Latent Dirichlet Allocation (LDA) [
40], Collaborative Topic Regression (CTR) [
41], Deep Cooperative Neural Networks (DeepCoNN) [
19], Neural Attentional Regression model with Review-level Explanations (NARRE) [
21], and dual attention-based model (D-Attn) [
23]. These methods can be divided into three categories: (1) traditional CF model without leveraging review (PMF and NMF); (2) topic modeling based approaches with the application of review information (LDA and CTR); and (3) deep recommender systems with the utilization of review (DeepCoNN, NARRE, and D-Attn). The first category was the blank group to validate whether the review information is helpful to for recommender system. The second category was the topic-model group to compare deep recommender model with topic modeling based recommender systems. The third category was the deep-model group to compare our MAHR model with other deep RS with review information. The topic-model group and deep-model group import the review information to improve the prediction performance. The characteristics of the comparison methods are listed in
Table 2.
PMF: Probabilistic Matrix Factorization models the rating matrix as the product of two lower-rank user and item matrix with a Gaussian error distribution and obtains good performance on the large, sparse, and very imbalanced datasets.
NMF: Non-negative Matrix Factorization is another useful matrix factorization with only rating matrix involved.
LDA: Latent Dirichlet Allocation is a famous topic modeling algorithm. By employing LDA on review text, we can learn a topic distribution for each item and obtain the topic preference for each user.
CTR: Collaborative Topic Regression combines CF with topic modeling in a probabilistic model, which show better performance than matrix factorization methods.
DeepCoNN: Deep Cooperative Neural Networks learns hidden latent features for users and items jointly using two coupled neural networks. The prediction layer introduces Factorization Machine as the estimator of the corresponding rating. DeepCoNN is a state-of-the-art method in deep RS that utilize review information.
NARRE: Neural Attentional Regression model with Review-level Explanations use a novel attention mechanism to assign weights to reviews, which learn latent features for users and items jointly using two parallel neural networks. The prediction layer is based on the Latent Factor Model. NARRE is a novel deep RS model with full-text level attention mechanism.
D-Attn: Dual Attention-based model is an advanced deep RS model with dual word level attention mechanism. The weights of words are learned from a local or global window. The local window extracts keywords and the global one selects noisy words. The predicted rating is calculated by the inner product of user feature vector and item vector.
4.1.3. Parameter Settings
The dataset was randomly split into training set (80%), validation set to tune hyper-parameters (10%), and test set (10%). The hyper-parameter for comparison methods were initialized according to the corresponding papers and tuned based on the performance on the validation set to obtain optimal performance.
For PMF and NMF, we selected the optimal parameter for the number of latent factors from
, and regularization parameter from
. For LDA and CTR, we chose the number of topics from
. After searching, the number of latent factors for PMF and NMF were set to 60 and the number of topics for LDA and CTR was 10. Following the experiments in [
19], the hyper-parameters for CTR were:
,
and
.
For deep models, we reused the parameters for deep neural network reported in the corresponding papers. Moreover, Google News was imported as a pre-trained word embeddings and the dimension of word embedding is 300. For MAHR model, we searched the dropout ratio from and the number of hidden layer from . The number of latent factors was searched from . The number of hidden layer was 2 and the dropout ratio was . The window size for the word attention layer was 5 and the number of convolutional kernels was 100. The latent factor number was 60.
4.1.4. Evaluation Metric
In the experiments, we adopted the Root Mean Square Error (RMSE) to evaluate the prediction performance of our MAHR model. RMSE has been used in plenty of works on RS, including for both traditional and deep learning methods [
5,
21]. RMSE can be defined as follows:
where
is the set of observation.
4.2. Methodology
We followed a four-step methodology to demonstrate the performance of our proposed MAHR model:
To explore the optimal parameter of MAHR model and the parameter sensitivity, we carried out a parameter study on the validation set. We evaluatd our approach based on the optimal parameters.
To analyze the prediction performance of MAHR, we compared the RMSE of different methods based on the test set. We made some observations based on the results.
To validate the effectiveness of MAHR model, we produced some variant MAHR models (MAHR-multi, MAHR-word, MAHR-full-text, MAHR-non, MAHR-hybrid, MAHR-fm, MAHR-nn, and MAHR-dot-product) by controlling the application of multi-level attention mechanism and hybrid prediction structure. We also observed the advantage of our methods in comparison with variant methods.
To show the explainability, we visualized our multi-level attention layers and the corresponding weights. We also calculated the prediction and recall of our full-text level attention for predicting useful reviews on Amazon datasets (Books and Electronics). The comparison results with simple strategies (random, latest, and longest) are reported.
4.3. Parameter Sensitivity Analysis
Based on the Yelp and Books dataset, we analyzed the influence of different hyper-parameters of traditional approaches and deep models including DeepCoNN, NARRE, D-Attn, and MAHR. The order was: (1) dropout rate; (2) number of hidden layers; and (3) number of latent factors.
We first studied the impact of the drop rate. We searched the dropout from
. The results of deep models with respect to different dropout ratios are shown in
Figure 2. we found that all methods benefited from a proper value of dropout ratio and could reach the peak of prediction performance when dropout rate was
, which demonstrates that the dropout technique could prevent overfitting and improve the prediction accuracy. Basically, all of these curves witnesses a decrease in RMSE as the dropout rate oncreased after
. These results prove that a proper dropout rate could enhance the robustness of model.
In addition, we also observed that the RMSE curve on Yelp dataset changed more sharply than that on Books dataset, which indicates the prediction accuracy of Yelp dataset was more sensitive to dropout rate than the prediction accuracy of Books dataset. We think the difference in the parameter sensibility was caused by the difference on the sparsity of dataset, given the common phenomenon in deep learning that a small dataset tends to be more likely overfit without dropout.
We then studied the effect of the number of hidden layers in hybrid prediction structure. Because the DeepCoNN, NARRE, and D-Attn have no hidden layers in prediction layer, we only carried out the second parameter study on MAHR. As presented in
Figure 3, increasing number of hidden layers in the hybrid prediction layer raised the accuracy of the models at the beginning. However, when we kept increasing the number of hidden layers, their performance degraded, as a result of overfitting.
We finally explored the effect of the number of latent factors. The results are shown in
Figure 4 For deep models group (DeepCoNN, NARRE, D-Attn, and MAHR), the number of latent factors was equal to the dimension of input vector for prediction layer. Generally, we observed that the RMSE curves of all deep methods showed a gentle and trivial change while the blank group (PMF and NMF) and topic model group (LDA and CTR) had a bigger change than deep model group. This demonstrates the benefits of deep learning for extracting the feature of review text. We found that MAHR achieved the best RMSE performance on both datasets and under all latent factor numbers, in comparison with other deep models, which shows the advantages of multi-level attention mechanism and hybrid prediction layers.
4.4. Performance Comparison
The performance of MAHR and the baselines are reported in terms of RMSE in
Table 3. The best performance is shown in bold and the averages on three datasets are reported. From the results, several observations can be made.
First, we found that both traditional matrix factorization methods (PMF and NMF) in blank group did not obtain comparable performance to those methods in the control group that utilize reviews. The gap in performance between blank group methods and control group methods validated our hypothesis that review text could supply additional information and considering reviews in models further improve the accuracy of rating prediction.
Secondly, although the simple employment of topic modeling methods (LDA and CTR) to learn topic characters from item reviews could improve the performance of recommendation system, deep models group better captured the feature of review when compared with topic modeling group. By modeling ratings and reviews together and using supervised learning for regression tasks, DeepCoNN, NARRE, and MAHR obtained additional improvements. In these experiments, LDA modeled reviews without the feedback from users ratings. Thus, the learned unsupervised features from LDA might be not as efficient as the expectation. Compared with LDA, CTR obtained additional improvements by jointly modeling reviews and rating. However, CTR could not beat deep methods on all datasets because of the weaker capacity of jointly modeling and extracting semantic features.
Thirdly, as shown in
Table 3, our method MAHR consistently outperformed all baseline methods. Although review information was useful in recommendation, the performance varied depending on how the review information was utilized. Our model proposes a multi-level attention mechanics for extracting both word-level and full-text-level information. This allows a review to be modeled with a finer granularity, which can lead to a better performance according to the results. Compared to PMF, our approach gained 30.06% improvement on average. In
Table 3, all methods obtained better performance on Book dataset than on Yelp and Electronics. We think it was caused by the sparsity of Yelp and Electronics. We found that the method using review information could effectively alleviate the sparsity problem on all datasets, which validates the hypothesis that review information can help model user and item, leading to additional improvement.
4.5. Effect of Multi-Level Attention
We next focused on validating the effeteness of multi-level attention mechanism. We produced several variant MAHR models through assigning normalized constant weights. The variant MAHR models include the original MAHR model with multi-level attention mechanism (MAHR-multi), the MAHR model with word level attention mechanism (MAHR-word), the full-text level attention mechanism one (MAHR-full-text), and the model without any attention mechanism (MAHR-non). Note that, when we did not use full-text-level attention mechanics, a normalized constant weight was assigned to each review, and, when we did not consider the word-level attention, the Word-Attention degenerated to a normal textual CNN module. For MAHR-word (MAHR-full-text) model, we assigned normalized constant weights to reviews (words). For MAHR-non, we assigned constant weights to both words and reviews. In
Figure 5, we compare the average RMSE on three datasets.
In the figure, we make several observations. First, all three methods with attention layer (MAHR, MAHR-word, and MAHR-full-text) outperformed MAHR-non model, which did not apply any attention mechanism. No matter what type the attention mechanism, the prediction accuracy benefited from the extracted information in review text, which justifies the common assumption that the significance of different reviews and words vary and proper modeling of different weights results in additional improvement. Second, the multi-level attention approach made the most precise prediction, which validates the usefulness of our approach. From the better performance of the full-text level approaches compared to the word-level one, we found that the full-text-level attention has a more significant contribution to the improvement. Third, the average RMSE of MAHR-non model was lower than that of DeepCoNN, which indicates the prediction accuracy of MAHR without any attention mechanism is better than the accuracy of DeepCoNN. This improvement was probably caused by the effective hybrid prediction layer in MAHR.
4.6. Effect of Hybrid Prediction
In this section, our controlled experiment on the effect of hybrid prediction layer is presented. Three new variants were used. The first one (MAHR-dot-product) is the traditional dot-product prediction layer:
. The last two approaches use FM (MAHR-fm) or NN(MAHR-nn) solely as prediction layer. We implemented the MAHR-dot-product model by switching off the hybrid prediction layer and replacing the vector concatenation layer with the dot-product output layer. As for MAHR-fm and MAHR-nn, we implemented them by switching off the Deep Neural Networks and Factorization Machine, respectively. We reused the hyper-parameter setting of MAHR model in the last two variants. The average RMSE on three datasets are shown in
Figure 6.
As shown in
Figure 6, the MAHR-dot-product obtained the worst performance in all methods. The difference in RMSE justifies that the simple dot product for prediction layer might miss information about the non-linear interaction between users and items and lead to a loss in prediction accuracy. Furthermore, we found that the MAHR-hybrid made the most precise prediction, which validates the usefulness of our hybrid prediction layer. When compared with MAHR-FM, MAHR-NN obtained higher prediction accuracy. We think the reasons were as follows. First, the deep neural networks could capture the non-linear user–item interactions, while our implemented FM could only model the second-order interactions, which is the major limitation of prediction layer. Second, the dropout technique in deep learning could avoid potentially overfitting and obtain additional improvement. In addition, the RMSE of MAHR-dot-product was apparently lower than that of LDA and CTR, which also use the inner product of latent vectors to predict rating. The decrements in RMSE probably benefited from the robust multi-level attention mechanism.
4.7. Explainability Analysis
4.7.1. Word Level Explanation
To validate our design on the word-level attention module, we highlighted keywords with high weights in the attention module. Colored words were considered as informative words, and green words had higher attention scores than those of blue words. We randomly selected the same review from Yelp but highlighted differently by the user network and the item network in
Figure 7.
We made two key observation. First, similar keywords were highlighted in the user network and item network. All of these keywords were likely words that describe properties of the item or some more personalized words. However, the two networks chose different attentional words, because the two networks were trained with different sets of reviews and the network decided the keywords by reviews and ID embedding. For example, the user network gave high scores to keywords expressing subject feelings, e.g., “good”, “excited” and so on. The item network tended to focus on nouns that describe properties of an item.
4.7.2. Full-Text Level Explanation
To analyze the explainability of full-text-level attention module, we first provide some reviews and their final attention weights in
Figure 8 and then compare the prediction performance of full-text-level attention module in
Table 4.
Figure 8 shows examples of the high-weight and low-weight reviews selected by our model, where
means the weight of attention. The first reviews for Item 1 and Item 2 had higher weights, while the second reviews for Item 1 and Item 2 were less helpful reviews with lower weights. Generally, the reviews with high attention weight contain more information about the item. For example, the buyers can easily get the feature of each item from Reviews 1a and 2a, which is highly instructive for making purchasing decisions. In contrast, the low attention reviews only contain the authors’ general opinions, but give fewer details to help make a decision.
We carried out a prediction and recall test on the Books and Electronics datasets, which contain some reviews that have been rated useful by other users. We assumed that the rated reviews are ground truth to study the performance of full-text-level attention module. We only reserved the items having at least one rated review. We selected three comparison methods: Latest (the rated reviews list are generated by selecting the latest
K reviews), Random (the rated reviews list are generated randomly), and Length (the rated reviews list are generated by selecting the longest
K reviews). Our MAHR selected
K reviews with the highest weights. We calculated the precision and recall for a review list with
K reviews according to
Precision@
K and
Recall@
K in [
21].
Precision@
K and
Recall@
K are as follows:
where
= 1/0 indicates whether the No.j review in the Top-
K list is rated helpful. The
is the number of rated reviews in item
i. To evaluate the effect of length of review list, we set
K = 1 and 10.
In
Table 4, we can find that the precision and recall of MAHR were impressively better than that of the three other methods. It shows that the weights obtained by full-text-level attention module are consistent with the users’ needs and perceptions. By applying the full-text level attention mechanism, the significance of different reviews can be learned effectively.
5. Conclusions and Future Work
In this paper, We propose a Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model that not only leverages the hybrid prediction structures to replace simple inner product of two latent vectors, but also innovatively implement a multi-level attention mechanism to combine word level significance and full-text level usefulness. It selects both useful words and reviews automatically to provide word-level and full-text-level explanations and make a more precise prediction. In addition, it models the non-linear interaction between user and item in a hybrid prediction layer, which couples the factorization machine to a deep neural network. Extensive experiments were made on three real-life datasets from Amazon and Yelp. The visualization and analysis of keyword and useful reviews validated the reasonability of our multi-level attention mechanism. In terms of recommendation performance, the proposed MAHR consistently outperformed the state-of-the-art recommendation models based on matrix factorization and deep learning in rating prediction. We believe this work offers a new approach to capture the context of recommendation systems.
In the future, we plan to combine the transfer learning and the latent factor model to build a more robust prediction layer for recommender system. Moreover, we are interested in the exploration of more advanced neural networks, e.g., Long Short-Term Memory (LSTM) network, which use sequence learning, to handle sequence and sentiment analysis in the review texts.