In this section, we provided results and analysis of our work. In general, we aimed to answer the following research questions:
6.1. RQ1: Overall Performance
In this section, we answer the research question RQ1 about whether our model can provide calibrated and accurate recommendations. The performances of baselines and our model are listed in
Table 2, where the best performance is marked in bold.
Recommendation Accuracy We first analyze the performance from the perspective of accurate recommendation (i.e., Rec@K and MRR@K). In general, on both datasets, our model achieves the best prediction accuracy in terms of Recall and MRR. By considering calibration, users’ preference distributions are incorporated. In addition, our model decoupled the two objectives by two sequence encoders and aggregated their outputs. Therefore, the preference distribution contributed to the prediction of the next item, leading to the improvement of accuracy. For example, on the Ml-1m dataset, the Recall and MRR of our model are higher than the original SASRec model (e.g., 0.1338 vs. 0.1271 in terms of MRR@20). In contrast, the post-processing-based models fragmented the relationship between accuracy and calibration and therefore resulted in a reduction in accuracy. For example, on the Ml-1m dataset, the MRR@20 of our model is 0.1338, while it is 0.1168 and 0.0784 for CaliRec and CaliRec-GC model, with the improvement of and , respectively. On the Tmall dataset, the CaliRec model also decreases the prediction accuracy.
Calibrated Recommendation Our model can provide more calibrated recommendation lists compared to the original sequential recommendation model. On both datasets, and of our model are lower than the original SASRec model. For example, the of our model is 0.7262, which is better than the 0.8548 of the SASRec model. On the Tmall dataset, our model also achieves a improvement in terms of . Compared to the post-processing-based CaliRec model, our model still achieves competitive performances in terms of calibration. For example, on the ML-1m dataset, the performances of of our model and the CaliRec model are 0.7262 and 0.7322. On the Tmall dataset, our model achieves an improvement of in terms of . The comparisons show the ability of our model to achieve better accuracy while obtaining competitive performances of calibration compared to the post-processing-based models. As for the possible reasons, on the one hand, the proposed loss function calibrated the preference distribution of items with the highest scores to the historical preference distribution. On the other hand, the decoupled-aggregated framework ensures accuracy when improving the calibration.
We also observe that the CaliRec-GC model performs differently on the two datasets. On the Ml-1m dataset, the CaliRec-GC model achieves the lowest
value among all models, including our proposed model (e.g., 0.3847 vs. 0.7262 of
). While on the Tmall dataset, the CaliRec-GC model cannot provide calibrated recommendation lists. For example, the
of CaliRec-GC is
, which is close to the original SASRec model. We think this phenomenon results from two aspects. On the one hand, the number of item attributes of the Tmall dataset is much more than that of the Ml-1m dataset. The Ml-1m dataset contains 18 different item attributes, while the Tmall dataset has 70 attributes. On the other hand, the average length of the user behavior sequence of the Tmall dataset is less than the Ml-1m dataset, as shown in
Table 1. The shorter sequence and larger item attributes set lead to the lower coverage of item attributes. The CaliRec-GC model adopts an adaptive selection of the trade-off factor
for calibration based on the coverage of item attributes. The greater coverage leads to the higher value of
. Therefore, it performs best in terms of calibration on the Ml-1m dataset, and almost does not work on the Tmall dataset.
The differences between the two datasets also lead to the different calibration performances of the two datasets. In general, performances of on the Ml-1m dataset are better than the Tmall dataset. For example, the of our DACSR model is 1.6240 on the Tmall dataset, which is much higher than the 0.7262 on the Ml-1m dataset. This is similar for the original SASRec model, with 2.1103 vs. 0.8548 on the two datasets. A possible reason is that the lower coverage of item attributes mentioned above results in the higher divergence. The large amount of 0 in makes it difficult in achieving calibration, especially with the concern of accuracy.
Time Consumption In addition, we also compared the response time of our model against the post-processing-based CaliRec model and the original SASRec model. We focused on the average time required to generate the recommended list for each sequence. For the SASRec model, we reported the time consumption when the dimension of hidden states equals to 64 and 128 (namely
and
). This is because the size 64 is the setting of each sequence encoder of our DACSR model. We conducted experiments on the same device, and removed the GPU acceleration for a fair comparison. The performance is listed in
Table 3.
Compared to the original SASRec model, our model needs more computation. For example, on the Ml-1m dataset, the time consumption for each sequence of the model is 2.01 s, which is approximately half of our DACSR model. This is because it incorporates two SASRec encoders and an extraction net, which is more complex than the single SASRec model. In contrast, our model can provide more accurate and calibrated recommendations than the original SASRec model. The requires more time than because it contains a larger scale of parameters. Compared to the CaliRec model, our model costs much less time to generate recommendation lists. For a single sequence, our model only needs 4.24 and 5.76 s on the Ml-1m and Tmall datasets, respectively. In contrast, the CaliRec requires approximately 0.06 s for a sequence, which needs 200 times more time than our model. This is because the CaliRec model needs an extra ranking stage. The original SASRec model provided scores of all items, and selected the top-100 items with the highest scores. The post-processing-based CaliRec model then re-ranks the top-100 items with K steps (K stands for the top-K recommendations). At each step, it computes the gains of the candidate items when they are added to the recommendation list. However, our model follows an end-to-end framework only with a sorting stage to select top-K items after the scores of all items are computed. Therefore, our model obtains better performance and requires less time consumption than the CaliRec model for each sequence.
Generalization of DACSR Model We are also interested in whether our model is also effective when the sequence encoder changes. We incorporated the GRU4Rec model [
1] as the sequence encoder, which is also a widely used sequential recommendation model. The experimental settings were same to the previous section with
. We use DACSR(G) to denote our DACSR model which takes GRU4Rec as the sequence encoder, and CaliRec(G) to denote the post-processing-based CaliRec model with candidates provided by the GRU4Rec model. The performances are listed in
Table 4.
As shown in the table, our model can still achieve calibrated and accurate recommendation lists when we use GRU4Rec as the sequence encoder. On the Ml-1m dataset, the performances of
are 0.6840 and 0.8356 of our DACSR(G) model and the GRU4Rec model, respectively. On the Tmall dataset, our model also obtains an improvement of
in terms of calibration. Toward recommendation accuracy, the performances of our model are still better than the original GRU4Rec model. The improvement is not as great as the DACSR model with the SASRec sequence encoder. We think that this is because the ability to model sequences of GRU4Rec is worse than that of the SASRec model. The SASRec model with self-attention mechanisms can better find the user’s preference and represent the sequence. The CaliRec(G) model also sacrificed the ranking performance to improve the calibration, which is similar to the CaliRec model with the SASRec model. Compared to the CaliRec(G) model, our DACSR(G) model can also achieve better performance in terms of accuracy and calibration, as listed in
Table 4. The performance comparisons indicate that our model can be used for other basic sequence encoders, which is not specifically designed for the SASRec model.
6.3. RQ3: Ablation Studies
To answer the research question RQ3, we conduct ablation experiments in this section by comparing our model with two variants. The first variant is the original SASRec model optimized by the loss function
. We aim to investigate the performance of a single sequence encoder optimized by both accuracy and calibration. We also reported the performances when the dimension of hidden states equals to 64 and 128 (namely
and
). In addition, we directly add the extractor nets to the
model, namely
. The other one is the direct concatenation of sequence representations and item embedding matrices without extraction nets (namely DACSR-C). We compare our model with these variants by setting
. The performances are listed in
Table 5. In general, our DACSR model obtains the best performance in terms of recommendation accuracy and calibration.
The effectiveness of our designed loss function for calibration can be reflected by the performance of and . By applying the loss function , the SASRec model is able to provide more calibrated recommendation lists than the original SASRec model only optimized by . For example, the of SASRec on the Tmall dataset decreases from 2.1103 to 1.8845. Also, it achieves close performance compared to our DACSR model on the Ml-1m dataset in terms of calibration. The calibration performances of the SASRec model optimized by the weighted loss function verified the effectiveness of our proposed loss function.
The performances between our model and variants also demonstrate the effectiveness of the decoupled-aggregated framework. For example, on the Ml-1m dataset, the MRR@20 of our DACSR model is 0.1338, while it is 0.1269 for the model, and the performances of calibration are close (0.7262 vs. 0.7253). On the Tmall dataset, our DACSR model can achieve competitive recommendation accuracy, and provide more calibrated recommendations (e.g., 1.6240 vs. 1.8845 in terms of ). Compared to the and model which shares parameters for two objectives, the decoupled-aggregated framework can achieve better performance. We believe that such a framework can learn the information of two objectives and combine them to obtain better representations of sequences and items. While a single sequence encoder that improves the performance in one aspect may negatively affect performance in the other because their parameters are shared. In addition, the DACSR-C model removed the extraction net and directly concatenated the representations of sequences and items from two sequence encoders. It obtained worse performance than the DACSR model, showing the importance of the extraction net. On the Ml-1m dataset, the of the DACSR model is 0.7262, which is slightly better than the 0.7433 of the DACSR-C model. But the recommendation accuracy of the DACSR model is higher than the DACSR-C model (e.g., 0.1338 vs. 0.1257 in terms of MRR@20). On the Tmall dataset, our DACSR model also obtains better performances in terms of accuracy, and close performance of calibration. The extraction net takes the concatenation of sequence/item representations as inputs, and provides more suitable representations for the two objectives.
6.4. RQ4: Distribution Modification
In this section, we answer the research question RQ4 about the effectiveness of the proposed distribution modification approaches. We proposed the modified distribution
and
to further improve the diversity and mitigate the imbalanced interest problem. These approaches are related to the diversity. Therefore, we adopted the ILD metric with Jaccard similarity to measure the diversity of the recommendation list:
where
is the item attribute set that the item
i has, and
is the generated recommendation list for sequence
s. The larger ILD value represents the higher diversity of the recommendation list. We set the factor
and 2 for the distribution
and
, respectively.
We first listed the performances of our DACSR model with the raw historical preference distribution (namely DACSR-
) and the modified preference distribution for diversity (namely DACSR-
) along with the original SASRec model in
Table 6.
On the two datasets, the diversity of our model is improved by sacrificing the calibration. For example, on the Ml-1m dataset, the performances of are 0.7012 and 0.6654 for the normalized distribution and the original distribution , respectively. However, the increases from 1.0615 to 1.1347, which means the ability of calibration of our model is weakened. On the Tmall dataset, the performance comparisons are similar. This is because applying the normalized distribution amplifies the effect of item attributes that the user did not interacted in the behavior sequence. Though it does not largely affect the true distribution , it deviates from the calibration to a certain degree.
We observe that our DACSR model performed differently on the two datasets. On the Ml-1m dataset, the diversity is higher for our DACSR model than the original SASRec model, while it is totally different on the Tmall dataset. On the Tmall dataset, our DACSR model achieves worse performance in terms of diversity (e.g., 0.6714 vs. 0.7086 of DACSR and SASRec model). It is possibly due to the difference between two datasets. On the Ml-1m dataset, the coverage of item attributes is higher than the Tmall dataset. Users have more historical behaviors than the Tmall dataset. On the Tmall dataset, users always interacted with several types of items, so that the score of most attributes equals to 0. The limited interest areas resulted in less diversified recommendation lists under the calibration objective.
We also investigate the imbalanced interest problem. We find that main interests are amplified on the Tmall dataset. As illustrated in
Figure 5, the attribute
A occupies the 80% of the sequence, while attribute
B and
C only account for the 20%. For such an imbalanced distribution, our DACSR model amplifies the major interest, as shown in
Figure 5. The recommended list only contains items with attribute
A. This gives our model a negative impact in terms of diversity. By applying the distribution
, the diversity increases and the calibration performance remains stable. As shown in
Table 7, the performances in terms of diversity are improved by the
distribution. This indicates that the modification of distribution with the mask mechanism can mitigate the amplification of major interest.
In conclusion, calibrated recommendations do not always improve the diversity. Considering calibration limits the range of recommendations in users’ interacted interests, so that the diversity of recommendations may decrease. For users with homogeneous interests, their main interests are amplified by our end-to-end framework. By applying the modified preference distribution for diversity, our model further increases the diversity that can explore new interests. The proposed modification of distribution based on mask mechanism can mitigate the problem of imbalanced interests. This also indicates us that whether it is necessary to provide these users with diversified recommendations. We believe this is a question worth investigating in the future.