In this section, we present the
self-attentive subset learning model (SaSLM), as shown by an illustrative overview in
Figure 2, which is composed of three components: (1) the
rating prediction module, which predicts item-level ratings by collaborative filtering (CF); (2) the
subset learning module that selects a most representative subset from the original set; and (3) the
rating aggregation module integrates results from (1) and (2) to predict the user’s set-level rating.
The goal of SPL is to learn item-level preference by set-based feedback. Therefore, for SPL, the explicit generation of item-level prediction scores is required (
rating prediction module). However, due to the absence of item-level supervision signals, the predicted scores of individual items should further be aggregated to form the prediction of set-level ratings (
rating aggregation module). Since the set-level ratings are better revealed by a subset of items in the set [
5], to reduce the bias between set-level ratings and item-level ratings, SaSLM learns a subset of items from the set (
subset learning module).
4.2. Self-Attentive Subset Learning
As mentioned before, to simulate the decision-making process of set rating, it is reasonable to select a subset of items to reduce the bias between set-level and item-level preferences [
5]. Then, the set rating will be predicted based on the chosen subset.
To choose subsets, we propose a
self-attentive policy network (SaPN). SaPN learns the personalized policy for how users select subsets from sets.
Figure 3 illustrates the network structure. Note that we borrow the concept of the
policy network form reinforcement learning (RL), which takes the current state as input and outputs the probability distributions of all possible actions [
32]. Intrinsically, rather than generating probabilities, SaPN learns to generate the selected subset directly based on the
Gumbel-softmax trick [
33]. The SaPN consists of three parts: (1) an embedding layer that implicitly models user’s preference and item attributes; (2) a self-attention layer that takes all items into consideration while choosing subsets; (3) a policy network layer that samples the chosen subset according to the reparameterization sample.
The SaPN is an important part of SaSLM. This section will provide more details layer by layer.
Embedding layer. Given a user
u and a set containing several items
, the embedding layer takes the one-hot representations as input. We assume
is the one-hot vector of user
u, where only the element corresponding to
u is one, and other elements are all zero. Similarly, we have the one-hot vectors of items, i.e.,
. Through an embedding layer, we can retrieve the embeddings of users and items. The embedding layer maintains two matrices, user matrix
and item matrix
, respectively. The output of the embedding layer is the embeddings of user
u and item
:
Note that for rating prediction, we also learn user and item matrices
P and
Q. One possible way is to share the embeddings. However, to ensure that the learning of subsets is independent from the prediction of item-level ratings, we learn another set of embeddings, which we distinguish as
and
.
Self-attention layer. For each item, whether it is to be selected in the subset is determined in turn. However, when choosing the subset, it is important to take all items in the set into consideration. In order to capture correlations among contained items, we apply self-attention. Thus, by transforming item embeddings through self-attentive layer, when determining whether to select an item, other items in the set can be also considered.
Aside from the correlations of other items, we also hope that user preferences can be considered when choosing the subset. To be more specific, given the same set, its contained item embeddings should be catered for different users. The intuition is that users have different levels of focus on the same set when they rate. Therefore, we first multiply
with
and then pack the item embeddings into a matrix, denoted by
:
Following [
34], we use scaled dot-product attention to calculate the self-attention score:
where
is a self-attention score matrix which indicates the similarities among
items in set
S.
is used to scale the attention score for preventing gradient vanishing. The softmax function is to normalize the self-attention scores. The output of the self-attention layer is the matrix product of
with
. To stabilize the training as [
35], residual connections [
36] are obtained as the final output:
Policy network. After transforming item embedding through the self-attention layer, we obtain the set-aware item embeddings . To learn the user’s personalized subset selection strategy, the item embeddings are concatenated with the user embedding, i.e., . Then, the concatenated embeddings are fed into the policy network.
Typically, a policy network outputs the probability distribution of all possible actions. In our case, it gives the probability of being chosen in the subset,
, or not,
. Then, the policy network selects items by sampling from the probabilities. However, the sampling of items involves discrete variables, so that the gradient is unable to propagate backwards. Another straightforward solution is to rely on a naïve Monte–Carlo gradient estimator [
37]. The naïve Monte–Carlo gradient estimator samples the discrete variable
from
multiple times to estimate the gradient. It can be understood as the REINFORCE algorithm [
38] with a single decision. However, as pointed out in [
39], the naïve Monte–Carlo gradient estimator exhibits very high variance, which destablizes the sampling process.
To reduce the variance, the reparameterization trick is generally utilized [
39]. Rather than sampling from the data-dependent distribution
, reparameterization samples a random variable
from the data-independent distribution, say, a standard normal distribution. Then, variable
is sampled by a function involving both
and
. Since the sampled
is independent and identically distributed, the variance during estimation can be largely reduced. For reparameterizing categorical variables in our case, Gumbel-softmax is often utilized [
33]. Given the probabilities
and
, Gumbel-softmax performs reparameterization in the following way:
Note that is not exactly binary but close to zero or one. , named “temperature”, is the hyper-parameter for controlling to what extent is close to binary.
At this point, we get the personalized optimal subset from the self-attentive subset learning module. By doing so, we model the pattern how item-level preferences influence users’ set-level feedback. It is the key to address the SPL problem by inferring item-level feedback from set-based preferences.
4.3. Position-Aware Rating Aggregation
When predicting the scores for individual items in
Section 4.1, there are no supervised signals on individual items for learning. Instead, we use set ratings as supervision in SPL. Therefore, we need to predict the set-level ratings in order to learn the model. Thus, we present a subset selection strategy in
Section 4.2 to reproduce the process in which users rate sets according to their item-level preferences. After predicting item-level scores and deciding on the subset, we discuss the way to aggregate item-level scores in the position-aware rating aggregation module. To do so, the aggregation of items in the set is required.
Item ratings are generated explicitly, since the aggregation is in the granularity of ratings. Given item ratings, the rating aggregation plays the role of estimating set ratings. In theory, various forms can be defined for the estimation, e.g., regression and neural network. However, these high-level aggregation can introduce certain bias between item ratings and set ratings. Since the ultimate goal of SPL is to estimate item-level preferences, the
unbiased estimation is required. In this paper, we propose to select subset of items to estimate the set rating. To ensure the unbiased estimation as well, we define the ⊕ operation as follows:
where ⊕ is an aggregation operation to combine items in the set. It is straightforward that Equation (
6) is unbiased.
is the variable to indicate whether item
i is selected in the subset or not.
While we define
to indicate whether item
i is selected in the subset or not, the hard selection brings instability for training. To ensure the smoothness during training, one possible way is to learn soft selection, i.e., learning a weight
for each item
i in the set. However, soft selection deviates from the idea of subset learning. Differently from learning soft assignments, we propose to learn a folded selection variable. We define
in the following way:
Equation (
7) means that
is a learnable weight parameter as long as item
i is selected.
will be folded if
i is not selected.
Learning for each user–item pair a separate weight is infeasible due to the large parameter space. Thus, we propose to learn user-specific weights. We further introduce personalized positional weights to aggregate ratings of items in the subset—that is, for each position p in the set, we will learn a weight . The positions of items are determined by sorting items according to their predicted ratings. Since the weights are not specific to items, the parameter space can be reduced greatly.
Aside from smoothness, learning positional weights also helps to better fit the set ratings, given item ratings. For example, the positional weights corresponding to larger item ratings can be lowered if a user tends to overrate sets. The positional-aware weights require parameter space rather than ().
We illustrate our proposed positional-aware rating aggregation in
Figure 4. Given the ratings of items in the selected subset, e.g.,
, we first sort the ratings. Through ranking, we can generate the ordered ratings
if
. Accordingly, we also retrieve three positional weights from
; i.e.,
. Therefore, we can predict the ratings by user
u to set
S as:
These user-specific positional weights are used to aggregate item-level rating predictions (rating prediction module in
Section 4.1) according to the subset selection results (self-attentive subset learning module in
Section 4.2). Then, the aggregated set ratings can be used to update parameters. With the loss of set ratings diminishing, SaSLM naturally learns more precise item-level ratings and personalized aggregating patterns. Thus, we reach the goal of SPL, inferring item-level preferences by set-based feedback.