1. Introduction
The advent of the digital age has led to an explosion of data, particularly in the realm of user–item interactions. This wealth of data has opened up new opportunities for recommendation systems [
1], which aim to predict user preferences and recommend items that are most likely to be of interest. However, the sheer volume and complexity of the data present significant challenges. Traditional recommendation systems often struggle to effectively capture the intricate structure of user–item interactions, and fail to fully leverage the rich information embedded in these interactions.
In this paper, we propose a novel hybrid recommendation model that addresses these challenges by integrating Singular Value Decomposition [
2] and an optimized version of Neighborhood-enriched Contrastive Learning [
3]. Our method aims to capture both the global structure [
4] and local neighborhood information [
3] inherent in the user–item interaction graph [
5], thereby enhancing the recommendation performance.
The primary contributions of this paper are as follows:
Novel Hybrid Recommendation Model: We propose a novel hybrid recommendation model that integrates Singular Value Decomposition (SVD) and an optimized version of neighborhood-enriched contrastive learning. This model is designed to capture both the global structure and local neighborhood information inherent in the user–item interaction graph, thereby enhancing the recommendation performance.
SVD-based Embedding Initialization: We introduce a novel approach to initializing user and item embeddings using SVD. This method captures the global structure of the user–item interaction graph and provides a robust starting point for the learning process. It also expedites the convergence of the training process, leading to improved efficiency.
Optimized Neighborhood-enriched Contrastive Learning: We present several key refinements to the NCL approach, including an adaptive neighborhood structure, unified optimization of contrastive objectives, and prototype regularization. These refinements allow our model to adapt to changing user–item interactions, balance the trade-off between different types of neighborhood information, and enhance the discriminative power of the prototypes.
Empirical Validation: We conduct extensive experiments on several benchmark datasets to validate the effectiveness of our proposed method. The results demonstrate that our method outperforms state-of-the-art recommendation models, thereby confirming its practical utility.
Insights into User–Item Interactions: Our work provides valuable insights into the structure of user–item interactions. By leveraging both global and local information, our model offers a more comprehensive understanding of user–item interactions, which can inform the design of future recommendation systems.
2. Background and Related Work
In this section, we present our proposed method, which is a hybrid recommendation model that combines Singular Value Decomposition (SVD) and Neighborhood-enriched Contrastive Learning (NCL). The goal of our method is to capture both the global structure and the local neighborhood information of the user–item interaction graph to improve the recommendation performance.
2.1. Singular Value Decomposition (SVD)
Singular Value Decomposition [
2] is a renowned technique in linear algebra used for matrix factorization. It has found extensive applications in diverse domains, such as recommendation systems, data compression, and image processing. Given a matrix
A of dimensions
, SVD decomposes the matrix into three matrices:
U,
, and
. Here,
U is an
orthogonal matrix,
is an
diagonal matrix containing the singular values of
A, and
is an
orthogonal matrix. This decomposition can be mathematically represented as follows:
In the realm of recommendation systems, matrix A symbolizes the user–item interaction matrix, where users are represented in rows and items in columns. The entries of matrix A could be explicit ratings or implicit feedback, contingent on the data available. SVD is employed to decompose the user–item interaction matrix into latent factors, thereby capturing the underlying structure in the data and reducing the dimensionality of the original matrix.
Low-rank approximations of
A can be achieved by retaining only the top
N singular values in
and the corresponding columns in
U and
. This truncated SVD can be represented as follows:
In this equation, and are the column-truncated versions of U and , respectively, with only the first N columns retained, and is the top diagonal submatrix of .
The low-rank approximation of A, obtained through truncated SVD, is instrumental for dimensionality reduction in collaborative filtering. By retaining only the most significant singular values and corresponding latent factors, SVD can capture the essential structure and relations between users and items, while discarding the noise and less informative components in the data. This attribute of SVD allows for improved generalization and robustness in recommendation systems, while simultaneously reducing the complexity of the models involved.
2.2. Contrastive Learning
Contrastive learning [
6,
7] has been recently adopted in graph collaborative filtering [
8,
9] to enhance performance, especially in scenarios with data sparsity [
10]. In this study, we present the Neighborhood-enriched Contrastive Learning (NCL) method [
3]. This method distinctively integrates potential neighbors into contrastive pairs by drawing neighbors from both the graph structure and the semantic space for any given user (or item).
2.2.1. Contrastive Learning with Structural Neighbors
Current graph collaborative filtering models are predominantly trained using observed interactions, such as user–item pairings. However, these models often overlook possible relationships between users or items that are not evident in the observed data. To harness the full potential of contrastive learning, we suggest contrasting each user (or item) with their structural neighbors. These neighbors’ representations are collated through the layer-wise propagation in GNN. The initial user/item features or learnable embeddings in the graph collaborative filtering model are represented as
[
11]. The final model output is essentially a fusion of embeddings from a subgraph, encompassing multiple neighbors at varied hops. Specifically, the
l-th layer’s output
of the base GNN model is the weighted sum of
l-hop structural neighbors of each node, assuming no transformations or self-loops during propagation [
11].
Given that our interaction graph, denoted as
, is a bipartite graph, using a GNN-based model for even-numbered iterations facilitates the accumulation of data from similar structural neighbors. This is useful for identifying potential neighbors among users or items. By employing this method, we can extract representations of homogeneous neighborhoods from even layers (e.g., 2, 4, 6) of the GNN model. These representations enable more effective modeling of relationships between users/items and their consistent structural neighbors. In particular, we consider the user’s own embedding and the corresponding embedding from the even-layered GNN output as matched pairs. Building on InfoNCE [
12], we propose the structure contrastive learning objective to minimize the distance between them, as follows:
Given that
represents the standardized output from the
kth GNN layer, where
k is an even integer. The temperature parameter for softmax is denoted by
. Similarly, we can derive the structure contrastive loss for the item side, denoted as
The total structure contrastive objective function is given by the combined weighted sum of the aforementioned losses:
where
acts as a hyperparameter to regulate the balance between the two losses in structure contrastive learning.
2.2.2. Contrastive Learning with Semantic Neighbors
The structure contrastive loss specifically delves into the neighbors as outlined by the interaction graph. Nonetheless, it perceives all neighbors of users/items in the same light, leading to the introduction of unnecessary noise in contrastive pairs. To counteract this noise from structural neighbors, we think about enhancing the contrastive pairs by integrating semantic neighbors. These are nodes that are not directly linked on the graph, yet they bear similar attributes (for items) or tastes (for users).
Drawing from prior studies [
13], we discern these neighbors by determining the hidden prototype for every user and item. Building on this notion, we introduce the prototype contrastive goal, aiming to delve into potential semantic neighbors. This is then woven into contrastive learning, ensuring a more nuanced grasp of the semantic nuances of users and items in collaborative filtering. Specifically, users/items with similarities tend to cluster in neighboring embedding spaces, with prototypes serving as the focal points of these clusters, symbolizing a collection of semantic neighbors. Consequently, we employ a clustering technique on the user and item embeddings to pinpoint the prototypes for both. Given that this method is not conducive to end-to-end optimization, we harness the EM algorithm to achieve the suggested prototype contrastive goal. In a formal sense, the aim of the GNN model revolves around augmenting the subsequent log-likelihood function:
where
is a set of model parameters,
is the interaction matrix, and
is the latent prototype of user
u. Similarly, we can define the optimization objective for items.
Subsequently, the aim of the suggested prototype contrastive learning approach is to reduce the given function derived from InfoNCE [
12]:
where
is the prototype of user
u, which is obtained by clustering over all the user embeddings with the
K-means algorithm, and there are
k clusters over all the users. The objective on the item side is identical:
where
is the prototype of item
i. The final prototype contrastive objective is the weighted sum of user objective and item objective:
In this approach, we deliberately integrate the semantic associations of users/items into contrastive learning to address the issue of data scarcity.
2.3. Neighborhood-Enriched Methods in Graph Learning
Neighborhood-enriched methods in graph learning aim to exploit the local structure of graphs by incorporating information from the neighbors of a given node [
13]. In the context of recommendation systems, this can refer to the relationships between users or items in a user–item interaction graph. Neighborhood-enriched methods can be highly beneficial in capturing the complex dependencies and patterns in the data, which can lead to more accurate and effective recommendations.
Graph neural networks [
14] are a class of deep learning models specifically designed to handle graph-structured data [
15]. GNNs operate on graph data by iteratively aggregating and transforming the features of neighboring nodes to generate node representations that capture both local and global information [
16]. The aggregation function of a GNN can be represented as follows:
where
is the feature vector of node
v at layer
,
is a nonlinear activation function,
is the set of neighbors of node
v,
is a normalization constant,
is a learnable weight matrix at layer
l, and
is the feature vector of node
u at layer
l.
By incorporating neighborhood information, GNNs can learn powerful and expressive representations of users and items, which can be used to make personalized recommendations.
Neighborhood-enriched contrastive learning combines the strengths of contrastive learning and neighborhood-enriched methods to enhance recommendation performance. By incorporating neighbors into contrastive pairs, neighborhood-enriched contrastive learning methods can effectively exploit the potential information contained in the user–item interaction graphs, leading to more accurate and robust recommendations, even in the presence of data sparsity.
3. Proposed Method
In this section, we present our proposed method, a hybrid recommendation model that integrates Singular Value Decomposition (SVD) and an optimized version of Neighborhood-enriched Contrastive Learning (NCL). The objective of our method is to leverage both the global structure and local neighborhood information inherent in the user–item interaction graph, thereby enhancing the recommendation performance.
3.1. Embedding Initialization via Low-Rank Approximation
The traditional methods [
11,
17] of initializing user and item embeddings in recommendation systems often rely on random or heuristic techniques. These approaches, however, may not adequately capture the intrinsic structure of user–item interaction data, which can lead to less than optimal performance in the early stages of training and slower convergence.
To address this, we suggest an alternative method for initializing user and item embeddings using Singular Value Decomposition (SVD), a powerful tool from the field of linear algebra that provides a low-rank approximation of a matrix. The SVD of matrix
A is given by
In this equation, U and V are orthogonal matrices that contain the left and right singular vectors, respectively, and is a diagonal matrix that holds the singular values. The interaction matrix can be broken down into three matrices, where the columns of U and V (i.e., and ) and are left and right singular vectors and a singular value, respectively; ; is the diagonalization operation. Components with larger (smaller) singular values contribute more (less) to interactions, allowing us to approximate R with only the K-largest singular values.
In the realm of recommendation systems, matrix A represents the user–item interaction matrix, with each entry indicating the interaction between user u and item i. By applying SVD to A, we obtain a low-rank approximation that captures the most significant structure in the user–item interactions.
We use the first
k columns of
U and
V as the initial embeddings for users and items, where
k is the dimension of the embedding. This strategy offers two main advantages: First, the SVD-based initialization encapsulates the global structure [
16] of the user–item interaction graph, providing a solid foundation for the learning process. Second, it can potentially speed up the convergence of the training process, as the initial embeddings are already a good approximation of the final embeddings.
Alternatively, we can dynamically learn low-rank representations [
4] through matrix factorization [
18]:
where
is the regularization strength. Each user/item is considered as a node on the graph and parameterized as an embedding vector
with dimension
, and
. By optimizing this objective function, the model is expected to learn important features from interactions (e.g., components corresponding to the
d-largest singular values).
In conclusion, the SVD-based initialization provides a systematic and effective way to initialize the user and item embeddings in recommendation systems, potentially leading to enhanced performance and quicker convergence.
3.2. Enhancing Collaborative Filtering with Contrastive Learning
As mentioned in
Section 2.3, GNN-based methods produce user and item representations by applying the propagation and prediction function on the interaction graph
. In NCL, we utilize GNN to model the observed interactions between users and items. Specifically, following LightGCN [
11], we discard the nonlinear activation and feature transformation in the propagation function as follows:
After propagating with
L layers, we adopt the weighted sum function as the readout function to combine the representations of all layers and obtain the final representations as follows:
With the final representations, we adopt inner product to predict how likely a user
u would interact with items
i:
We incorporate an optimized version of the NCL approach into the learning process to capture local neighborhood information. The NCL approach defines two types of neighbors for a user (or an item): structural neighbors and semantic neighbors.
Structural neighbors are those who have interacted with the same items (or users). We introduce a self-supervised learning loss, denoted as
, to capture the structural neighborhood information. This loss is defined as follows:
Semantic neighbors are those with similar representations. We use the K-means clustering algorithm to identify the semantic neighbors. Each user (or item) is assigned to a cluster, and the centroid of the cluster is used as the prototype to represent the semantic neighbors. We introduce a prototype contrastive loss, denoted as
, to capture the semantic neighborhood information. This loss is defined as follows:
To capture the information from interactions directly, we adopt Bayesian Personalized Ranking (BPR) loss [
19], which is a well-designed ranking objective function for recommendation. Specifically, BPR loss enforces the prediction scores of the observed interactions to be higher than those of the sampled unobserved ones. Formally, the objective function of BPR loss is as follows:
By optimizing the BPR loss , the suggested contrastive learning method enriched with neighborhood elements can capture the interplay between users and items. Nonetheless, higher-order connections among users (or items) are equally important for making recommendations. For instance, users often purchase items that their neighbors have bought. Moving forward, we will introduce two contrastive learning goals to harness the inherent neighborhood connections of both users and items.
3.3. Optimization for NCL
Building upon the NCL method, we introduce several key optimizations to further enhance its effectiveness and efficiency in capturing local neighborhood information for recommendation systems.
3.3.1. Dynamic Neighborhood Structure
In the standard NCL approach, the neighborhood structure is often fixed and predefined. This static approach may not adapt well to the dynamic nature of user–item interactions. To address this, we propose an adaptive neighborhood structure [
20] that evolves during the learning process.
Specifically, we use the K-means [
21] clustering algorithm to dynamically identify the semantic neighbors. The clustering process can be represented as
where
X is the set of embeddings,
C is the set of cluster centroids, and
I is the assignment of each embedding to a cluster. This adaptive neighborhood structure is updated in each iteration of the expectation–maximization algorithm [
22], allowing the model to adapt to changing user–item interactions and capture more accurate neighborhood information.
3.3.2. Unified Optimization of Contrastive Objectives
The original NCL approach optimizes the contrastive objectives separately, which may not be efficient and can lead to suboptimal solutions. We propose a unified optimization framework that balances the trade-off between the structural and semantic neighborhood information.
The unified optimization objective can be represented as
where
is the self-supervised learning loss for structural neighbors,
is the prototype contrastive loss for semantic neighbors, and
and
are weight parameters. This unified optimization can lead to more efficient learning and better performance.
3.3.3. Regularization on Prototypes
To prevent the prototypes from being too close to each other, which can improve the discriminative power of the prototypes, we introduce a regularization term to the prototypes in the prototype contrastive objective. The regularized prototype contrastive loss can be represented as
where
C and
are the current and previous cluster centroids,
is a regularization parameter, and
denotes the Frobenius norm. This regularization encourages the prototypes to move in each iteration, which can enhance the performance of contrastive learning.
In summary, these refinements provide a more flexible and efficient way to capture the neighborhood information in recommendation systems. They offer a new perspective on how to leverage contrastive learning in recommendation systems, leading to improved performance and faster convergence. By integrating these refinements with the SVD-based initialization, our proposed method provides a comprehensive solution for enhancing the performance of recommendation systems.
4. Experiments and Evaluation
4.1. Datasets
We evaluate the performance of our proposed method on five public datasets: MovieLens-1M (ML-1M) [
23], Yelp2018 [
24], Amazon Books, Gowalla, and Alibaba-iFashion [
1]. These datasets vary in domain, scale, and density. For Yelp2018 and Amazon Books, we filter out users and items with fewer than 15 interactions to ensure data quality. The statistics of the datasets are summarized in
Table 1.
The selection of these datasets was driven by a few key considerations:
Variety in Domains: These datasets span multiple domains—movie ratings, restaurant reviews, book reviews, social networking check-ins, and fashion. This wide coverage helps ensure the robustness of the model across diverse domains, which is key to a good machine learning model.
Scale and Density: These datasets vary not only in terms of the number of records (scale), but also in terms of the density of interactions. Some datasets may have a high number of interactions per user/item (dense), whereas others might have fewer interactions per user/item (sparse). Both of these scenarios pose unique challenges in recommendation systems, and dealing with both in training helps ensure the model’s adaptability.
Data Quality: For Yelp2018 and Amazon Books, filters have been applied to exclude users and items with fewer than 15 interactions. This decision is to ensure data quality and reliable signal in the data. It helps avoid cases where the model might overfit to users/items with very few interactions, thus making the evaluation more reliable.
Overall, these datasets were chosen to ensure that the evaluation is both rigorous and representative of various real-world situations.
For each dataset, we randomly select 80% of interactions as training data, 10% of interactions as validation data, and the remaining 10% interactions for performance comparison. We uniformly sample one negative item for each positive instance to form the training set.
4.2. Experiment Setup
4.2.1. Compared Models
We compare our proposed method with the following several state-of-the-art models:
SGL [
25]: Incorporates self-supervised learning to improve recommendation systems. Our chosen model for SGL is SGL-ED.
NGCF [
8]: Leverages the user–item bipartite graph to include high-order connections and employs GNN to bolster CF techniques.
NCL [
3]: Advances graph collaborative filtering using neighborhood-enhanced contrastive learning, which our approach is rooted in. We utilize RUCAIBox/NCL as the model representation of NCL.
4.2.2. Evaluation Metrics
To assess the efficacy of top-N recommendations [
26], we employ the commonly utilized metrics of Recall@N and NDCG@N [
27,
28], with N values of 10, 20, and 50 to maintain uniformity. As per earlier studies, we use the full-ranking approach, ranking all potential items that the user has not engaged with.
4.3. Implementation Details
We utilize the RecBole [
29] open-source framework to develop our model, as well as all baseline algorithms. For a balanced comparison, we employ the Adam optimizer across all methods and meticulously fine-tune the hyperparameters for each baseline. We designate a batch size of 4096 and use the standard Xavier distribution for initializing parameters. The embedding dimensions are configured to 64. To deter overfitting, we apply early stopping after 10 epochs without improvement, using NDCG@10 as the benchmark indicator. We tune the weight hyperparameters
,
in [
,
], temperature hyperparameter
in [0.01, 1], and number of clusters
k in [5, 10,000].
4.4. Overall Performance
Table 2 presents a comprehensive performance comparison of the SGL model, NCL model, NGCF model, and our proposed model LoRA-NCL across various datasets. The results are insightful and reveal several key observations.
Firstly, LoRA-NCL consistently outperforms NCL across most datasets. The superior performance of LoRA-NCL can be attributed to its ability to effectively capture both the global structure and local neighborhood information inherent in the user–item interaction graph. This is achieved through the integration of Singular Value Decomposition (SVD) and an optimized version of NCL, which allows LoRA-NCL to leverage both explicit and implicit feedback from users, thereby enhancing the recommendation performance.
Interestingly, there are instances where NCL outperforms LoRA-NCL. This can be attributed to the inherent differences in the learning mechanisms of the two models. NCL, with its focus on capturing local neighborhood information, might be more effective in scenarios where local patterns and dependencies play a more significant role in user–item interactions. On the other hand, LoRA-NCL, which aims to capture both global and local structures, might be less effective when the global structure is sparse or less informative.
In conclusion, while LoRA-NCL generally outperforms NCL, the choice between the two models should be guided by the specific characteristics of the dataset and the computational resources available.
Table 3 presents the performance comparison of LoRA-NCL with different embedding sizes. The results are insightful and reveal several key observations.
LoRA-NCL (256) and LoRA-NCL (128) outperform LoRA-NCL (64) in most of the metrics across all datasets. For instance, in the MovieLens-1M dataset, LoRA-NCL (256) outperforms LoRA-NCL (64) by approximately 1.2% in Recall@10, 1.1% in NDCG@10, 1.8% in Recall@20, 1.4% in NDCG@20, 1.8% in Recall@50, and 4.6% in NDCG@50. Similar trends can be observed in other datasets such as Yelp, Amazon Books, and Gowalla.
The performance difference between LoRA-NCL (64) and LoRA-NCL (256) can be attributed to the increased capacity of the model with a larger embedding size. A larger embedding size allows the model to capture more nuanced features of the user–item interactions, leading to better performance. However, it is important to note that the improvement comes at the cost of increased computational complexity and memory usage.
The possible reason that higher embedding size led to better performance is that a larger embedding size provides a more expressive representation space for the items and users. This allows the model to capture more complex and subtle patterns in the user–item interactions, which can lead to improved recommendation performance. However, it is important to note that the benefits of a larger embedding size should be weighed against the increased computational cost and the risk of overfitting.
4.5. Further Analysis
In this subsection, we delve deeper into the results of our experiments to gain more insights into the performance of our proposed method.
4.5.1. Performance across Different Datasets
Our method demonstrated varying performance across the different datasets. We used recall and Normalized Discounted Cumulative Gain (NDCG) as our primary performance metrics. Recall, i.e., the proportion of true positives to the combined total of true positives and false negatives, provides insight into the percentage of accurate positive predictions made by our model. NDCG, on the other hand, is a measure of ranking quality, summing up the graded relevance values of all results in the list.
4.5.2. Comparison with Other Methods
Compared to other leading-edge techniques, our suggested approach demonstrated enhanced effectiveness. The NDCG scores of our method were consistently higher than those of the other methods, and the recall scores of our method were always higher than those of other the methods, except for one data point. Our method is more effective at NDCG.
4.6. Impact of Parameter Choices
The performance of our model is influenced by several parameters, one of the most significant being the size of the embeddings. Embedding size refers to the dimensionality of the vectors used to represent items in the recommendation system.
In our experiments, we found that the choice of embedding size had a substantial impact on the performance of our model. Specifically, smaller embedding sizes tended to result in faster training times but at the cost of model accuracy. On the other hand, larger embedding sizes led to more accurate models, but with an increase in computational complexity and training time.
Interestingly, there appeared to be a ‘sweet spot’ for the embedding size. Beyond a certain point, increasing the embedding size did not lead to significant improvements in model performance, and in some cases, it even led to a decrease in performance. This could be due to the model overfitting to the training data when given too many parameters to learn.
Therefore, it is crucial to carefully choose the embedding size when implementing our model. We recommend conducting a thorough parameter tuning process, such as grid search or random search, to find the optimal embedding size for the specific dataset and problem at hand.
4.7. Limitations and Future Directions
While our proposed method shows promising results, it also has several limitations. One of the main limitations is the sensitivity of the model to the choice of embedding size. The performance of our model can vary significantly with different embedding sizes, and finding the optimal size can be a computationally intensive process. Moreover, the optimal embedding size may not be the same for all datasets, adding another layer of complexity to the problem.
Another limitation is related to parameter tuning. Our model has several hyperparameters that need to be carefully tuned to achieve the best performance. However, the optimal set of parameters can vary depending on the specific characteristics of the dataset and the problem at hand, making the tuning process challenging and time-consuming.
Despite these limitations, our research opens up several avenues for future work. One potential direction is to develop more efficient methods for determining the optimal embedding size and tuning the model parameters. This could involve using more advanced optimization techniques or incorporating additional prior knowledge about the problem into the tuning process. Another interesting direction would be to explore ways to make the model less sensitive to the choice of embedding size and other parameters, thereby making it more robust and easier to use. We believe that addressing these limitations and exploring these future directions can further improve the performance and applicability of our method.
5. Conclusions
While our method shows promising results, it has several limitations that provide directions for future work. First, our method assumes that the user–item interaction graph is static, which may not hold in real-world scenarios where user–item interactions are dynamic. Future work could explore how to incorporate temporal information into our method. Second, our method relies on the K-means algorithm to identify semantic neighbors, which may not be optimal for all datasets. Future work could investigate other clustering algorithms or learn the neighborhood structure in an end-to-end manner. Lastly, our method is designed for explicit feedback data. Adapting it to implicit feedback data, where only positive interactions are observed, is another interesting direction for future research.