1. Introduction
Recommendation systems have been widely used in various online services, such as search engines, e-commerce, online news, and social media sites, and have become one of the most powerful ways to solve the problem of information overload [
1,
2]. However, a large number of recommendation methods are still black-boxes that do not provide explanations for users. In recent years, explainable recommendation has attracted increasing attention in the academic and industrial communities. Explainable recommendation systems not only unveil the recommendation process, but also help to improve the effectiveness, persuasiveness and satisfaction of the recommendations.
Traditional recommendation methods, e.g., matrix factorization, mainly infer the preferences of users for items by using implicit or explicit user–item interaction data [
3]. The key to generating accurate recommendation results is to obtain the representations of users and items with rich expressive power, while these traditional methods suffer from the sparseness of interaction data [
4,
5]. A common idea which can solve the problem of data sparseness is to introduce some auxiliary information into the recommendation system. Auxiliary information can make up for the sparseness or lack of interaction data, enrich preferences of users and features of items and enhance the performance of the recommendation system effectively [
6]. What is more, traditional recommendation methods only provide some simple explanations, such as
“Customers Who Bought This Item Also Bought…”, with which users are not satisfied in general.
Fortunately, various kinds of auxiliary information have become increasingly available in online services. This auxiliary information can be easily organized into heterogeneous information networks. Heterogeneous information networks contain rich attribute information and semantic associations, so can provide potential relations between users and items for recommender systems [
7]. By connecting different kinds of relations in heterogeneous information networks, latent higher-order interaction information between users and items can be discovered. The emerging success of mining heterogeneous information networks may shed some light on solving these issues of data sparseness and simple explanation in the recommendation system. Many existing models [
8] regard reviews, an item’s aspects and meta-paths as contextual information about the user–item interaction and leverage them to improve the recommendation performances and generate recommendation explanations. Explainable recommendation has also attracted remarkable attention in recent years [
9,
10].
Although the above methods have achieved a better performance, there are two challenges in applying heterogeneous information networks to recommender systems: (1) how to extract effective information that can be used in recommendation systems from heterogeneous information networks; (2) how to effectively integrate high-order interaction information for better recommendation results and explanations. In order to solve the first challenge, we intend to design multiple different types of meta-paths for heterogeneous information network architectures to produce corresponding similarity matrices [
11]. As for auxiliary information, it can tackle the issue of the sparseness of the original user–item interaction matrix. Then, the latent representations of users and items are obtained through matrix decomposition methods [
12]. Aiming at the second issue, we present a dual-attention network to distinguish the contribution of each representation from different meta-paths to the final representations of users and items. Then, the dual-attention networks will aggregate the representations from multiple meta-paths through the attention coefficients to generate the final representations of users and items.
In this paper, we propose a framework of explainable recommendation by exploiting dual-attention networks in heterogeneous information networks (DANER), to capture the latent representations of user preferences and item features, and to learn the joint representation of user–item interactions using the dual-attention networks for the recommendation predictions and explanations. The contributions of this paper are summarized as follows:
In order to alleviate the problem of data sparseness, we extracted multiple kinds of meta-paths between users and items from the heterogeneous information networks and generated multiple similarity matrices, which were used as complements of the rating matrix. Then, we decomposed the similarity matrices by matrix decomposition to obtain the multiple representations of users and items corresponding to different meta-paths;
We propose a novel dual-attention network for explainable recommendation in heterogeneous information networks (DANER). It leverages a local attention layer to learn the representations of users and items, and a global attention layer to learn the joint representations of user–item interactions, both of which integrate multiple groups of different meta-path information. An attention mechanism helps to improve the explainability of the recommendation;
We demonstrate better rating prediction accuracy than the state-of-the-art methods by performing comprehensive experiments on two benchmark datasets. In addition, by providing a critical meta-path based on attention coefficient, we show a case study on the explainability of DANER.
The rest of this paper is organized as follows:
Section 2 highlights the related work of typical recommendation methods; HIN in recommendation and attention mechanisms, respectively.
Section 3 introduces the definition and problem formulation.
Section 4 presents the details of our proposed DANER model.
Section 5 shows the experimental results. Finally,
Section 6 concludes this paper.
3. Problem Statement
3.1. Definitions
There are several definitions of heterogeneous information networks. In this paper, we introduce the definitions of HIN, the network schema and the meta-path. Next, we will illustrate these three definitions in detail.
Definition 1 (Heterogeneous Information Networks). HIN is defined as a graph with an object type mapping function and a relation type mapping function , where each object belongs to a specific object type , and each relation corresponds to a specific relation type , where the number of object types or relation types .
An example of the heterogeneous information networks is shown in
Figure 1. There are four object types and three relation types in the heterogeneous information networks. The four object types are group, user, business and category. The relation between group and user indicates that a user belongs to a group. The relation between user and business indicates that a user prefers a business. The relation between business and category indicates that a business belongs to a category.
Definition 2 (Network Schema). The network schema is a meta template of heterogeneous networks including object mapping function and relation type mapping function . Network schema is defined as a directed graph composed of object types A and relation types R, denoted as .
Figure 2 illustrates the network schema corresponding to the Yelp dataset. The Yelp dataset has five object types and five relation types. There may be more than one meta-path between two objects in an HIN. For example, user and business can be connected via
or
. These paths are called meta-paths, defined in Definition 3.
Definition 3 (Meta-path). A meta-path is a path defined in the network schema with a starting node and a target node, such as , where is the type of different object and is the relation between the two objects. Apparently, the complex relation between node and node can be represented in meta-path, denoted as , where the number of relation is the length of the meta-path.
3.2. Problem Statement
For inputs to our framework, we have the user set , the business set , and the relation set , where is the relation between two objects which can be user, business, category, city and so on. When the represents the relation of user and business, the weight between them indicates the rating of user on business. We design multiple meta-paths , and obtain multiple similarity matrices through the meta-paths. For the output of our framework, we provide the predicted rating of user on business and a meta-path-level explanation.
Accordingly, the two main tasks of DANER can be summarized as: (1) obtaining more expressive presentations of user preferences and item features through auxiliary information in heterogeneous information networks; (2) using the attention mechanism to aggregate these representations to get better recommendation results and providing some explanations based on the attention coefficient simultaneously.
4. Framework
In this section, we mainly introduce the use of auxiliary information in HIN and the establishment of recommendation model based on double attention mechanism. The overall structure of DANER is shown in
Figure 3. DANER mainly includes three parts: the Similarity Matrices Generation, the Matrix Decomposition and the Recommendation model based on Attention Mechanism.
4.1. Similarity Matrix Based on Meta-Path
For a recommendation system, the starting node of the meta-path is user u, and the target node is item i. The meta-path represents the high-order relation between user u and item i. For example, the meta-path in Amazon dataset indicates that user has purchased item , and item and item belonging to the same category. The similarity matrix based on the meta-path is defined as , where represents the relation matrix between the object type and , ⊗ is a matrix multiplication operation between the two relation matrices.
L user–item similarity matrices can be obtained by
L pre-designed meta-paths. The meta-paths of different datasets used in the experiment are shown in
Table 1. For example, the meta-path used in the dataset Amazon
indicates that users will buy other items that have been purchased by users with the same preferences, which can be regarded as a user-based collaborative filtering. Thus, the similarity matrix corresponding to this meta-path can be obtained by the following formula
. Besides,
refers to the categories to which the item belongs,
refers to the brand to which the item belongs,
refers to which other items have been viewed by users who have rated the item,
refers to the number of compliments a user receives from other users, and
refers to the city in which the restaurant is located. The calculation of the relevant meta-path is similar to the procedure mentioned above.
4.2. Latent Representation by Matrix Decomposition
The recommendation learning process can be regarded as a representation learning process [
41]. After obtaining user–item similarity matrices corresponding to
L meta-paths, we adopted matrix decomposition to obtain the latent representations of users and items. By using low-dimensional vector of the latent representation, we can reduce noise and alleviate the data sparseness problem of the original rating matrix [
42]. Based on the theory of matrix decomposition, the similarity matrix
M can be decomposed into two low rank matrices
and
, where
represents the latent features of users’ preferences and
represents the latent features of items. Then we can use
to generate the prediction similarity matrix
. By reducing the difference between
M and
, we can obtain the latent representation matrices
and
, which can represent the latent features of users and items better. To be specific, low-dimensional representations of users and items can be obtained by solving the following optimization problem:
where
and
are dynamic parameters, which are used to control the influence of Frobenius norm regularization to avoid overfitting. The goal of optimization is to make
and
restore the similarity matrix
M as complete as possible.
For L similarity matrices based on meta-paths, we can obtain L groups of feature representations of users and items by performing a matrix decomposition operation.
4.3. Recommendation Model Based on Attention Mechanism
After obtaining
L groups of representations of users and items, we also need to fuse them to obtain more expressive representations of users and items. Thus, at first, we designed a model including two attention networks to integrate these representations. The local attention network was oriented to each user (item), which was used to distinguish the importance of each user (item) representation corresponding to different meta-paths. According to the weighted combination of attention coefficients, the representations of users and items integrating
L groups of meta-path information can be obtained respectively. The global attention network is oriented to each meta-path, which is capable of discriminating the importance of each user–item joint representation corresponding to different meta-paths. Besides, the attention coefficients can be used to select the meta-path that has the most influence on the final prediction results. By way of the global attention network, we can obtain the user–item joint representations integrating
L groups of meta-path information. Then, the representations obtained from the two attention networks are concatenated as the input of next part. Finally, we can utilize a multi-layer perceptron to generate the prediction ratings. The specific recommendation model is shown in
Figure 4, mainly including three parts, which will be introduced separately below.
4.3.1. Local Attention Network
The goal of local attention network is to learn the representations of users and items, which integrate
L groups of representations corresponding to different meta-paths. The input of local attention network is
L groups of representations of user and item obtained by matrix decomposition. Each group of representations contains user representation
and item representation
. For
L groups of user representations
, we feed them into the user-oriented attention neural network to obtain the attention coefficient
corresponding to
:
where
is a user-oriented attention neural network. To be specific, the input of
is user representations from different meta-paths, and its output is attention scores.
and
are parameter matrix and bias term of the fully connected neural network of layer
i, and we use
as the activation function of each layer. Then, to compute the attention coefficient
, the
function is introduced to normalize
L output values of the neural network.
By adopting the same operation for item, the attention coefficient
corresponding to the item
from different meta-paths can be obtained as follows:
Then, according to the obtained attention coefficients
and
, we can combine
L groups of representations of user (item) from different meta-paths to produce
(
). The adopted combination method is to multiply the user (item) representation with the corresponding attention coefficient
(
), and then directly concatenate the
L groups of
(
):
The local attention network layer can generate user representation and item representation , which contain different meta-path information and focus on the critical meta-path information. The degree of reservation of meta-path information depends on the value of attention coefficient, the larger the attention coefficient is, the more meta-path information will be retained.
Finally, we can concatenate the user representation
and item representation
to obtain the local user–item joint representation
, which is a part of the input vector of multi-layer perceptron in the interaction model, as shown below:
4.3.2. Global Attention Network
The global attention network focuses on distinguishing the contributions of user–item joint representations corresponding to different meta-paths. Firstly, we concatenate the representations
and
to obtain the user–item joint representation
, where
. Then, we feed the
L groups of
into the path-oriented neural network
to compute the corresponding attention coefficient
:
where
and
are parameter matrix and bias terms of the fully connected neural network of layer
i, the input of
is
L groups of user–item joint representations
, the output is attention scores. Besides,
is used as the layer activation function in the neural network. After that, to obtain the attention coefficients
, we introduce the
function to normalize the
L output values of the neural network.
Finally, according to the obtained attention coefficients
, we combine the user–item joint representations
from
L groups of meta-paths to obtain the global user–item joint representation
. Here, we propose to multiply the
L groups of user–item joint representations
with the corresponding attention coefficients
. Then, we concatenate
L groups of
to obtain
directly:
Based on the attention coefficients of global attention network, we can explain the recommendation results more sufficiently, that is, the meta-path with large attention coefficient contributes more to the recommendation result.
4.3.3. Interaction Model
After obtaining the local user–item joint representation
by way of the local attention network and the global user–item joint representation
by global attention network, respectively, we need to integrate them together as the input of the subsequent interaction model [
43,
44]. Here are two kinds of combination methods:
where
and
are weighted parameters of
and
. The first method is to add the local user–item joint representation
and the global user–item joint representations
weighted by
and
, and the second method is to use concatenation instead of addition in the first method. Based on these two methods, we design two variants of the model in the experiment section. Both add and concat are common operations used to aggregate feature information in neural networks. The concate operation overlays the dimensions of the feature vector. The information contained in each dimension of the vector does not change, but the dimension of the vector is doubled. The add operation adds the corresponding values of the feature vectors. The dimensions of the vectors do not change, but the information contained in each dimension is increased. Add enriches the representation information for each feature, while concat increases the number of features. After obtaining the combined user–item joint representation
, we need an interaction model to fuse the feature information of representation for generating the rating prediction. The traditional methods mostly use Factorization Machine, which has the advantages of simple operation and low calculation cost. But it can only fuse the first-order and second-order features. So it is difficult to fuse the high-order features. Therefore, in this paper, multi-layer perceptron is adopted as the interaction model, due to its powerful capability of automatically combining high-order features. What is more, the input of multi-layer perceptron model is combined with user–item joint representation
, the output is the prediction rating, defined as follows:
4.4. Model Optimization
The task of this paper is rating prediction based on explicit data. Here, the square loss function is used as the optimization goal [
17]:
where
is the prediction rating obtained by the proposed framework,
is the real rating of user on item,
are the trainable parameters in the neural network. The first term indicates the difference between
and
, and the second term is
norm regularization, in which the coefficient
is devised to control the regularization intensity to prevent overfitting.