1. Introduction
In the era of information explosion, an increasing number of social applications are location based, and most people are willing to use location-based social software (such as Facebook, Instagram, and Weibo) to record their daily lives and share their thoughts and experiences. These applications utilize users’ geographical location information to provide functions such as recommendations, check-ins, and social interactions based on real-time location. For example, users can view the activities of nearby friends on social platforms, share their location, or check in at specific places. This indicates that personalized location-based recommendation services are becoming increasingly important for users. To help users discover Point of Interest (POI) from massive location information and recommend appropriate POI in real-time, personalized POI recommendation systems are essential. POI refers to a specific location in geographic space, typically a place that users are interested in visiting, staying at, or interacting with. POIs can include various types of locations such as restaurants, shopping malls, parks, attractions, historical sites, and more. In location-based social software, POI is a key concept because it directly influences user behavior and the performance of recommendation systems. POI data not only include basic information about the location, such as its name, address, and category, but may also include user ratings, reviews, photos, and other interaction data. By analyzing these data, recommendation systems can infer user preferences and recommend suitable POIs. With the development of Geographic Information Systems (GISs) and location data technology, POIs are playing an increasingly important role in personalized recommendations, navigation, tourism, and marketing. Among POI recommendation systems, the next POI recommendation holds a particularly critical position. Simply put, the next POI recommendation provides suitable POI suggestions for users’ next actions based on their check-in data and historical trajectories [
1,
2,
3,
4].
Recently, many researchers have delved into POI recommendation methods based on deep learning networks. These methods include Recurrent Neural Networks (RNNs) [
5,
6,
7,
8], Transformers [
9,
10], Graph Convolutional Networks (GCNs) [
11,
12,
13,
14,
15], and Graph Attention Networks (GATs) [
16,
17]. These approaches capture latent, nonlinear relationships between users and POI. To address implicit feedback, some researchers have proposed deep learning methods to model local and global relationships separately, enabling personalized preference learning [
18] and simulating user choices. While such methods improve recommendation performance to some extent, significant challenges remain.
One key limitation of these methods is the reliance on user–item matrices to model personalized preferences and latent features, which can be flawed. To mitigate the cold-start problem, many researchers have explored user–POI interaction patterns to learn latent relationships between them. While effective to some degree, these methods often overlook high-order collaborative signals, preventing them from capturing the diversity of relationships between users and POI. Graph-based deep learning methods offer an advantage by capturing high-order collaborative structure information. Given the advances in GNNs for modeling complex relationships, several studies [
19,
20,
21] have employed Hypergraph Neural Networks (HGNNs) to learn latent user and POI representations.
For example, the Temporal Graph Convolutional Network (T-GCN) combines graph and temporal convolutional advantages to capture spatiotemporal information, making it suitable for modeling temporal dynamics in POI recommendations. Hypergraph Convolutional Networks (HGCNs), inspired by hypergraph structures, capture high-order collaborative relationships between users and POIs and between POIs themselves, enabling them to model spatiotemporal dependencies comprehensively. Compared to traditional graph models, HGCN handles high-order neighborhood relationships more effectively, alleviating data sparsity and over-smoothing problems. Another approach, GAT, introduces attention mechanisms to dynamically adjust relationships between POIs, enhancing its ability to handle sparse data and improving recommendation quality.
Despite the progress made in next POI recommendation, there remain two critical challenges: (1) existing research often ignores the diversity and dynamic changes in user preferences across contexts, resulting in limited and overly complex user representations. For instance, user behaviors are influenced by spatial and temporal factors and other dimensions. Current graph and hypergraph methods often entangle these preferences, failing to capture multi-dimensional and multi-level user behaviors accurately. (2) Existing methods lack in-depth modeling of collaborative relationships across dimensions, limiting their ability to integrate multi-dimensional representations effectively. Different perspectives should complement and enhance each other to improve the overall recommendation performance.
In this paper, we propose an innovative model called Multi-View Contrastive Fusion Hypergraph Learning (MVHGAT) to address these challenges in the next POI recommendation. The model decouples the complex relationships between users and POIs—such as interaction, trajectory, and geographical relationships—to construct multi-view representations. These relationships are crucial for next POI recommendation. Utilizing hypergraphs, which can effectively represent complex relationships and high-order dependencies that traditional graph methods struggle with, we adopt three hypergraphs, interaction, trajectory, and geographical hypergraphs, to capture global dependencies between nodes from different perspectives.
We design specific hypergraph convolutional networks for each view to encode POIs and learn latent factors for interaction, trajectory, and location views. To further integrate multi-view information, we employ an adaptive fusion strategy to combine user representations dynamically. Additionally, multi-view contrastive learning is used to capture high-order relationships across views, leveraging a self-enhancing mechanism to deepen the complementary recommendation effects among views.
Extensive experiments on three publicly available datasets demonstrate the significant advantages of MVHGAT in next POI recommendation tasks. In summary, our main contributions are as follows:
We address two challenging yet practical issues in next POI recommendation and propose an innovative framework, MVHGAT, to enhance the recommendation performance.
We design three distinct hypergraph convolutional structures—interaction, trajectory, and location hypergraphs—and tailor the hypergraph convolutions to suit different learning needs.
We employ multi-view weighted contrastive learning with self-enhancement to collaboratively supervise across views, addressing the difficulty of capturing complementary recommendation effects during learning.
Extensive experiments on three real-world datasets validate the effectiveness of MVHGAT compared to various state-of-the-art methods.
The structure of the paper is as follows. In
Section 2, we introduce the related work. In
Section 3, we provide a preliminary introduction to the task formulation and the concept of hypergraphs.
Section 4 outlines the methodology, describing the proposed MVHGAT model in detail.
Section 5 presents the experimental setup, including the datasets, evaluation metrics, and performance comparisons with the baseline methods. Finally,
Section 6 concludes the paper and discusses potential future work.
2. Related Work
The next POI recommendation aims to recommend the most suitable next location for users based on their recent visit behaviors. Most existing methods adopt sequential models to address this task, ranging from Markov Chains [
5] to Recurrent Neural Networks (RNNs) and their variants [
6,
8], and more recently, self-attention mechanisms [
9,
10]. For example, the Markov Chain method predicts the next POI by modeling the probability distribution of user visit sequences, but its limitation is that it can only capture short-term dependencies. To overcome this limitation, Recurrent Neural Networks (RNNs) are introduced to model the temporal nature of user behaviors. However, RNNs are prone to the vanishing gradient problem when handling long sequences, so their variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), are widely adopted. Additionally, Spatiotemporal Graph Neural Networks (STGNs) further incorporate the spatiotemporal information, enhancing the accuracy of recommendations. Despite these advancements, these sequential methods primarily focus on modeling individual user trajectories and often neglect non-continuous POI relationships within a trajectory or across different users.
With the rapid development of graph convolutional models, many researchers have begun applying graph learning methods to POI recommendation, ranging from traditional graph learning approaches [
12] to hypergraph learning methods [
22], as well as the recent surge in Graph Neural Networks (GNNs) and Hypergraph Neural Networks (HGNNs) [
16,
19,
20,
21]. For example, Graph-Flashback [
23] empowers POI representations using spatiotemporal knowledge graphs and combines them with RNN-based methods to capture sequential transition patterns in user behaviors. Graph-Flashback constructs a spatiotemporal knowledge graph to transform users’ spatiotemporal behaviors into graph-structured data and utilizes Graph Neural Networks for modeling, thereby better capturing users’ spatiotemporal preferences. This method has achieved significant results across multiple datasets, demonstrating particularly strong performance on the Gowalla and Foursquare datasets. Lai et al. proposed a multi-view spatiotemporal-enhanced hypergraph network to integrate spatiotemporal information and high-order collaborative signals, demonstrating the strong performance of HGNNs in POI recommendation. Hypergraph Neural Networks (HGNNs) introduce hypergraph structures, enabling the more effective modeling of complex relationships and high-order interactions, thereby enhancing the accuracy and personalization of recommendation systems.
However, most graph- or hypergraph-based methods fail to account for differences in user preferences across multiple dimensions, leading to suboptimal user representations and potential confounding. Although a few studies [
24] have attempted to use multi-view learning to model different preferences separately, they often only learn preferences straightforwardly without effectively distinguishing the latent features of different views.
We propose a multi-view hypergraph contrastive learning method to address this issue. This method employs distinct hypergraph convolutional models to learn from different hypergraphs. It aims to decouple user representations across interaction, trajectory, and location views and leverage multi-view contrastive learning to capture the diverse preferences of different users more accurately. Thus, it improves the accuracy and personalization of the next POI recommendations.
3. Preliminary
This section begins by defining the problem of the next POI recommendation, followed by an introduction to the concept of a hypergraph.
3.1. Task Formulation
Let and represent the set of users and Points of Interest, respectively. Each POI is defined by a unique geographic coordinate, denoted by . For each user , their trajectory is represented as , where each pair indicates that user u visited POI at time . Given a target user u and their trajectory sequence , the task of the next POI recommendation is to predict the top-K POIs that the user is most likely to visit at the subsequent timestamp.
3.2. Hypergraph
A hypergraph [
22,
25] is an advanced representation in graph theory that captures more complex relationships than traditional graphs. In a standard graph, each edge connects exactly two vertices, whereas in a hypergraph, each hyperedge can connect two or more vertices. Formally, a hypergraph is defined as
, where
V represents the set of vertices, containing all nodes, and
E represents the set of hyperedges, with each hyperedge capable of connecting multiple vertices. To describe the topology of the hypergraph, an incidence matrix
is introduced. Specifically, if a node
belongs to a hyperedge
, then
; otherwise,
.
We construct three hypergraphs to capture the complex relationships between users and POIs from different perspectives. First, the Interactive Hypergraph models user–POI interactions and user collaboration by treating POIs as nodes and user trajectories as hyperedges. An incidence matrix records user visits to POIs, helping to identify users with similar visiting patterns and enriching preference information for recommendations. Second, the Trajectory Hypergraph focuses on transitions between POIs in user trajectories. Directed hyperedges represent user movements, and an incidence matrix describes these transitions, aiding in predicting the next POI by analyzing user dynamics. Lastly, the Location Hypergraph captures geographical relationships between POIs by connecting those within a distance threshold using hyperedges. An incidence matrix indicates proximity, enhancing recommendations by incorporating users’ geographical preferences.
4. Methodology
This section provides a detailed explanation of our proposed Multi-View Contrastive Fusion Hypergraph Learning method. As illustrated in
Figure 1, we first design tailored hypergraph convolutional networks with adjusted aggregation and propagation strategies for different hypergraph structures. This enables effective hypergraph learning to uncover the latent complex relationships between users and POIs. Subsequently, we employ multi-view contrastive learning to capture complementary recommendation effects across different views, enhancing recommendation performance. Finally, we present our prediction and optimization approach to achieve more accurate POI recommendations.
In the next POI recommendation, complex relationships exist between users and POIs, such as user–POI interactions, POI–POI trajectory relationships, and POI–POI location relationships. The previous methods typically used graphs to represent these relationships, where users and POIs are treated as nodes, and their relationships are modeled as edges [
26,
27]. However, traditional graph structures are limited to pairwise relationships and cannot effectively connect higher-order neighbors within the same semantic context. Inspired by the highly flexible structure of hypergraphs, we innovatively designed three distinct types of hypergraphs: the interaction view hypergraph, the trajectory view hypergraph, and the geographical view hypergraph.
4.1. Anchor Attention Interaction Hypergraph Convolutional Network
In this study, we propose an innovative anchor attention interaction hypergraph convolutional network. The key innovation of this method lies in dynamically weighting node embeddings by leveraging contextual information, thereby enhancing the model’s ability to capture high-order relationships among Points of Interest (POIs). We first outline the basic framework of the hypergraph convolutional network interaction and subsequently introduce the anchor attention mechanism to emphasize critical features, addressing the limitations of traditional Graph Neural Networks (GNNs) in modeling high-order relationships. The proposed approach is built upon an interaction hypergraph convolutional network architecture, which captures high-order relationships between nodes through a two-step information propagation process. In this framework, hyperedges act as intermediaries for node aggregation and facilitate cross-hyperedge propagation. We introduce a contextual anchor attention mechanism to mitigate the potential over-smoothing issue often encountered in traditional graph convolution methods when modeling high-order relationships and improve the expressiveness of node features. This mechanism adaptively adjusts the weights of node embeddings based on global contextual information, enhancing the model’s focus on critical features. The core concept is to generate dynamic attention factors derived from the contextual information of input features, which are then used to weight node embeddings. As an initial step, global average pooling is applied to the input features to extract global contextual information:
Then, dimensionality reduction and activation are performed: the features are reduced to
through a fully connected layer, followed by a nonlinear transformation using an activation function (SiLU):
SiLU is the activation function and FC is the fully connected layer. Next, dimensionality expansion and attention generation are performed: the reduced features are expanded back to the original dimensions through another fully connected layer, followed by the Sigmoid activation function to generate attention factors:
Here, the Sigmoid activation function is used to generate the attention factors. Finally, the generated attention factors are multiplied with the node embeddings to perform weighting:
This allows the model to dynamically adjust the weight of each node’s embedding based on its contextual information, thereby enhancing the expressiveness of key features. We combine the contextual anchor attention mechanism with the interaction hypergraph convolutional network to learn node embeddings. Initially, the POI embeddings are propagated through the graph convolutional network. In each layer, the representation of POI nodes is updated through the node-to-hyperedge aggregation and hyperedge-to-node propagation steps, capturing high-order relationships between users and POIs. Next, the anchor attention mechanism is applied to the outputs of each layer to dynamically adjust the weights of node embeddings using global contextual information, thereby enhancing the model’s focus on key features. Specifically, attention factors are generated through global average pooling, dimensionality reduction activation, and dimensionality expansion and then used to adjust the node feature representations by weighted multiplication with the node embeddings. To avoid the over-smoothing problem, residual connections are applied in each layer, and the embeddings from all layers are fused through mean pooling to generate the final POI node representations. Through this series of operations, the anchor attention interaction hypergraph convolutional network can efficiently and accurately learn the complex relationships within the multi-view interaction hypergraph, thereby improving the performance of the recommendation system.
4.2. Compressed Activation Attention Trajectory Hypergraph Convolutional Network
To learn POI representations from trajectory hypergraphs, we propose an innovative directed hypergraph convolutional network that integrates compressed activation attention. This method combines the trajectory hypergraph structure with the compressed activation attention mechanism, capturing the complex relationships between nodes while enhancing the focus on essential features, thus improving the model’s performance. Traditional hypergraph convolutional networks are mainly used for undirected hypergraphs and cannot effectively model the directed relationships between nodes. However, in practical scenarios such as recommendation systems, interactions between nodes are often directional (the user’s visit trajectory to POIs). To address this, we adopt a trajectory hypergraph convolutional layer, which explicitly models the directed relationships between nodes within the hypergraph structure. We introduce the compressed activation attention mechanism into the trajectory hypergraph convolutional layer to further enhance the model’s focus on essential features in node embeddings. The compressed activation module explicitly models the importance coefficients of each feature, dynamically adjusting their weights to improve the model’s ability to express key features. First, dimensionality reduction and activation are applied to the global information of each channel:
Here,
is the input feature, and
is the dimensionality reduction fully connected layer. Then, the original dimensionality of the channels is restored, and the attention factors are generated:
Here,
is the Sigmoid activation function, and
represents the importance coefficient for each channel. By multiplying the attention factors element-wise with the input features, the weight of each channel is dynamically adjusted:
Here, represents the weighted node embeddings. In the model, we first process the POI node embeddings through the trajectory hypergraph convolutional layer to capture the directed relationships between nodes. Specifically, the input node embeddings are propagated twice through the directed hypergraph structure: first from the source node to the target node , and then from the target node back to the source node . Initially, the target node’s embeddings are aggregated to the source node. Then, the source node’s embeddings are propagated back to the target node, thus effectively modeling the directed interactions between nodes. We introduce the compressed activation attention mechanism to enhance the model’s focus on essential features. By generating attention weights for different features through a fully connected layer, the model can adaptively adjust the feature channels of node embeddings, focusing on more critical channel features. Finally, the weighted node embeddings are output. The entire model effectively captures the directed relationships between nodes and improves the model’s ability to express key features by dynamically adjusting channel weights.
4.3. Depth-Separable Geographical Convolution
In this work, we propose a network architecture that combines geographical information and graph convolution—Depth-Separable Geographical Convolution—for efficiently modeling the update of POI (Point of Interest) embeddings. The network employs a two-stage convolution mechanism: first, information is propagated via geographical graph convolution, and then features are further refined through depth-separable convolution to optimize the POI embeddings. The input to the depth-separable geographical convolution is the embedding representation of POI nodes, , where L is the number of POI nodes, and d is the embedding dimension. Additionally, the adjacency matrix of the geographical graph, , is required, where each element represents the geographical relationship between node i and node j.
POI embeddings propagate information through the product with the geographical graph in each layer of the graph convolution process. Let the POI embedding at the
ℓ-th layer be
, then the information propagation process can be expressed as
Here,
represents the initial POI embedding, and
denotes the embedding at the
ℓ-th layer. In this process, the information of the POI nodes is propagated to neighboring nodes through the connectivity of the geographical graph. To alleviate the over-smoothing problem and enhance the expressive power of the representations, we apply residual connections after each convolution layer. Specifically, the output at the
ℓ-th layer is obtained by adding the output from the previous layer:
This residual connection helps to avoid the excessive smoothing of information while retaining the independent representational capability of each layer. In the final stage of the depth-separable geographical convolution, we incorporate the Depthwise Separable Convolution method to further refine the POI node embedding features and reduce the computational overhead. Depthwise Separable Convolution consists of two stages: Depthwise Convolution and Pointwise Convolution.
Depthwise Convolution: Each input channel is convolved independently, and thus depthwise convolution processes the input features channel by channel. The output after depthwise convolution undergoes a nonlinear transformation through Batch Normalization and the LeakyReLU activation function.
Pointwise Convolution: A convolution is used to adjust the number of channels, followed by Batch Normalization and the LeakyReLU activation function to enhance the feature representation further. The main advantage of depthwise separable convolution is its low computational cost, significantly reducing the number of parameters while retaining complex feature representations.
Let the input to the depthwise separable convolution module be
, and the output is
Here, DepthwiseSeparableConv represents the depthwise separable convolution operation, and its output,
, is the updated POI embedding. After multiple layers of graph convolution and depthwise separable convolution, the embeddings are integrated through average pooling to obtain the final embedding representation:
where
L is the number of graph convolution layers representing information propagation at different levels. The embedding
is the final output of the network and can be used for downstream tasks, such as POI recommendation.
4.4. Multi-View Weighted Contrastive Learning
In this section, we propose a multi-view contrastive learning approach to enhance the specific representations of users and POIs (Points of Interest) in each view by exploring the key collaborative effects between the interaction, trajectory, and geographical perspectives. Specifically, the method optimizes the interactions between different views by introducing self-supervised signals, allowing them to complement each other during the learning process. For instance, in the interaction view, the model can use users’ historical behavior data to generate user representations; in the trajectory view, the users’ movement path information can be incorporated to enrich the representations further; and in the geographical view, the model can take into account users’ geographic preferences and location information.
Through this multi-view contrastive learning, the model captures the shared features of users and POIs across different views and effectively distinguishes between various entities, improving the accuracy and robustness of the recommendation system. For example, the model can better understand users’ diverse needs and preferences in recommendation tasks, providing more personalized and precise recommendations. Additionally, applying self-supervised signals enables the model to adapt flexibly to unsupervised or semi-supervised scenarios.
The core of this method is to enhance the consistency of user and POI representations across different views through contrastive learning. Specifically, we treat the representations of the same user or POI in different views as positive pairs, while those of different users or POIs are treated as negative pairs. In this way, the model learns to generate similar representations across different views, thereby enhancing the consistency of the embeddings. This approach also helps the model distinguish between different entities, improving its accuracy in recognition and classification tasks. Through contrastive learning, the model can better capture the shared features of users and POIs across different views, leading to higher performance and robustness in tasks such as recommendation and analysis. Specifically, we achieve this goal through the following steps.
Firstly, we design specific encoders for each view to extract the feature representations within that view. In each view, we aim to ensure that the representations of the same user or POI are consistent across different views. To this end, we employ a contrastive learning approach to maximize the similarity of representations of the same entity across different views while minimizing the similarity of representations between different entities. Specifically, for each user, we define the following contrastive loss function: we first define the user contrastive loss between the collaborative and trajectory views. For each user
u, we calculate the similarity between their embeddings in the collaborative view
and the trajectory view
, and define the loss through the following formula:
Here,
denotes the cosine similarity,
is the temperature coefficient, and
U represents the set of all users. Similarly, we define the user contrastive loss between the collaborative view and the geographical view:
Similarly, the user contrastive loss between the trajectory view and the geographical view can be expressed as
By introducing the weighting coefficients
and
, the final user contrastive loss can be expressed as a weighted sum. Specifically, the final user contrastive loss is
A similar approach can be used for the embeddings to calculate the weighted contrastive loss. Let the contrastive losses between the collaborative view and the trajectory view, the collaborative view and the geographical view, and the trajectory view and the geographical view for POIs be denoted as
, respectively. Then, the final POI contrastive loss is
Finally, by combining the user and POI contrastive losses, we obtain the overall contrastive learning loss:
During the training process, the above contrastive loss is combined with the loss of the main task (such as the recommendation task) for optimization. By minimizing the total loss, the model not only learns the specific representations within each view but also enhances the consistency of representations across different views through contrastive learning. This approach enables the model to capture the shared features of users and POIs across different views, thereby improving the accuracy and robustness of the recommendation system.
In summary, our multi-view contrastive learning method enhances the model’s ability to understand multi-source data by performing contrastive learning on users and POIs across different views, thereby improving the performance of the recommendation system.
4.5. Prediction and Optimization
In recommendation systems, we learn user and POI (Point of Interest) representations by combining information from different views (e.g., interaction view, trajectory view, and geographical view). We design a new loss function, optimized by PolyLoss, to improve recommendation tasks and accelerate the convergence process. Specifically, we fuse the embeddings of user
u and POI
l into the final embeddings
and
, and compute their interaction score
as follows:
Here,
and
represent the final embeddings of user
u and POI
l, respectively. We use PolyLoss to calculate the interaction score between users and POIs based on the original cross-entropy loss. Specifically, PolyLoss improves the traditional cross-entropy loss by introducing a polynomial weighting factor
p:
where
p is the degree of the polynomial, typically set to a positive integer greater than 1. This loss function becomes the standard cross-entropy loss when
. Based on this polynomial loss formula, the recommendation loss function becomes
We introduce self-supervised contrastive learning loss to enhance the collaborative information between different views. The contrastive loss improves the model’s robustness by maximizing the similarity between different views. Specifically, we compute the contrastive losses between different views, such as the contrastive loss between the interaction view and the trajectory view
, the contrastive loss between the interaction view and the geographical view
, and the contrastive loss between the trajectory view and the geographical view
, which are expressed as follows:
where
denotes cosine similarity, and
is the temperature hyperparameter. Similarly, we can define
and
to compute the contrastive losses between other views. Finally, the weighted sum of all the contrastive losses forms the total loss for self-supervised learning:
where
are hyperparameters that control the weights of the contrastive losses between different views. Finally, we combine the recommendation loss, contrastive loss, and regularization term into a multi-task learning objective function:
5. Experiments
5.1. Experimental Setting
- (1)
Datasets. To validate the effectiveness of our proposed method, we conducted extensive experiments on three publicly available location-based social network (LBSN) datasets, including Foursquare-NYC (referred to as NYC), Foursquare-TKY (referred to as TKY) [
28], and Gowalla [
29]. We preprocessed the datasets by first filtering out less popular POIs to ensure that the POIs in the dataset were representative enough for the user recommendation task. Subsequently, we segmented each user’s trajectory data into multiple daily sessions and removed sessions with excessively short durations, which helped to reduce the impact of noisy data on model training. Finally, we adopted a common train–test split strategy, using the first 80% of each user’s sessions for training and the remaining 20% for testing. This approach ensures that the model can learn users’ behavioral patterns during training while being effectively evaluated on the test set. The statistical details and characteristics of each dataset used in our experiments are summarized in
Table 1.
- (2)
Evaluation Metrics. To ensure consistency with most existing next POI recommendation methods, we adopted two widely used evaluation metrics: Recall@K and Normalized Discounted Cumulative Gain (NDCG@K). Recall@K evaluates the label coverage in the top K recommended items, while NDCG@K measures the ranking quality of the recommendation list. To ensure the fairness of the experimental results, we conducted 10 runs for each metric and reported the average values of Recall@K and NDCG@K for .
- (3)
Baselines. We compared our proposed method with several typical next POI recommendation approaches: (1) a statistical method, UserPop; (2) an RNN-based method, STGN; (3) a self-attention-based method, STAN; (4) methods based on Graph Neural Networks (GNNs) or hypergraphs, including LightGCN, GETNext, and MSTHN; and (5) graph or hypergraph contrastive learning-based methods, such as HCCF and ASTHL.
- (4)
Parameter Settings. We conducted our experiments using PyTorch 1.12.1 on a 24 GB Nvidia RTX 3090 GPU. For the baseline methods, we first followed the settings in the original papers and fine-tuned the hyperparameters of each model on the three datasets. For the MVHGAT model, we used the Adam optimizer [
31] with a learning rate of
, a weight decay of
, and selected the hyperedge dropout rate from
. The dimensions of the user and POI embeddings were both set to 128. Empirically, we chose 1.5 km as the distance threshold for the datasets. The number of layers in the hypergraph convolutional network was selected from 1 to 4. The hyperparameters
and
for the regularization terms were chosen from
to balance the loss.
5.2. Performance Comparison
The experimental results of all methods are reported in
Table 2. Based on these results, we have the following observations:
Our proposed MVHGAT achieves the best performance on all datasets. Across the three datasets, MVHGAT consistently outperforms other baseline methods on all evaluation metrics. To ensure a more comprehensive comparison, we incorporated additional recent works in our evaluation, providing a more thorough analysis of how MVHGAT compares with state-of-the-art models. We attribute these improvements to the following factors.
Firstly, by learning from the interaction, trajectory, and geographical views, MVHGAT effectively captures users’ multi-view personalized preferences, improving its ability to address the data sparsity problem. By introducing tailored hypergraph convolutional networks for different hypergraphs, MVHGAT captures the latent information within each hypergraph. Compared to recent multi-view POI recommendation models, such as MSTHG [
32] and MSHTN [
33], MVHGAT demonstrates superior adaptability in modeling complex user–POI relationships, particularly in sparse data scenarios.
Secondly, MVHGAT employs multi-view contrastive learning to enhance each view’s user and POI representations. This strengthens the signals during self-supervised learning, allowing complementary learning across different views and producing fused recommendations. Among various POI recommendation methods, incorporating spatiotemporal information significantly improves the recommendation performance. For example, the GNN-based GETNext, which leverages spatiotemporal information, performs significantly better than LightGCN, which does not use such information, with Recall@10 improving by approximately 23% across the three datasets. Moreover, contrastive learning-based methods, such as HCCF, have demonstrated improved generalization capability in recommendation tasks. Our results indicate that MVHGAT further enhances these advantages by integrating hypergraph structures with contrastive learning, leading to a 6.2% improvement in NDCG@10 over HCCF.
MSTHN achieves better results than HCCF across all three datasets in Hypergraph Neural Network-based methods. However, MVHGAT further learns the latent high-order relationships between users and POIs, resulting in more personalized representations and outperforming existing hypergraph models, including MSTHN, on most metrics. To validate the significance of these improvements, we conducted paired t-tests on our results. The statistical tests confirm that the improvements of MVHGAT over MSTHN and other baselines are statistically significant (p < 0.05), reinforcing the robustness of our findings. These results highlight the importance of incorporating multi-view representation modeling in recommendations.
Furthermore, the results show that methods using non-sequential information outperform sequence-based methods. For example, the RNN-based STGN performs worse in recommendation tasks compared to STAN, which can handle non-continuous POI sequences. Our MVHGAT better captures high-order collaborative signals between POIs, mitigating data sparsity and over-smoothing issues, surpassing GETNext performance. This aligns with recent findings in sequential POI recommendation research, where non-sequential models such as STAN+ [
34] have demonstrated superior performance in capturing long-range dependencies.
Contrastive learning methods based on graphs or hypergraphs generally achieve better results than traditional GNN methods, such as LightGCN. These methods can capture diverse data features from multiple views, such as user behavior patterns across different times and locations or relationships between different POIs. By integrating this information, these methods provide a more comprehensive understanding of user interests and needs [
35]. Compared to single hypergraph convolutional models, MVHGAT adopts tailored hypergraph networks to learn from different types of hypergraphs, effectively capturing the spatiotemporal and contextual information of POIs. As a result, it outperforms ASTHL in recommendation effectiveness.
Although MVHGAT shows slightly weaker performance on frequent interactions (e.g., TKY) datasets, it significantly improves on sparse datasets, particularly in addressing data sparsity and over-smoothing issues. By combining multi-view contrastive learning and various types of hypergraph convolutional networks, MVHGAT validates the effectiveness of multi-view representation modeling in recommendation systems, significantly enhancing the recommendation performance.
5.3. Ablation Study
In the effectiveness analysis of the MVHGAT model, we conducted ablation experiments to explore the contribution of each component. In the experiments, we removed the following four key components individually:
w/o I: Removing the Anchor Attention Interactive Hypergraph Convolution (Interactive View).
w/o T: Removing the Compressed Activation Attention Trajectory Hypergraph Convolution (Trajectory View).
w/o P: Removing the Depth-Separable Geographical Convolution (Location View).
w/o MCL: Removing Multi-view Weighted Contrastive Learning.
Table 3 presents the experimental results, from which we draw the following conclusions:
First, the model performance drops after removing the interactive view convolution. This demonstrates that this component is crucial for capturing high-order interactive relationships between users and POIs, enhancing the recommendation effectiveness. It also helps capture or view convolution, resulting in a slight performance decline, but its impact is minor compared to the removal of the interactive view convolution. Since the Gowalla dataset is sparser than the NYC and TKY datasets, capturing global trajectory relationships helps alleviate the data sparsity problem. This result aligns with that of GETNext, which also considers global trajectory influences. This indicates that the trajectory view convolution helps capture variations in user behavior paths, though its contribution is relatively weaker than the interactive view.
Removing the location view convolution significantly decreases performance on the NYC and TKY datasets because it fails to learn the latent high-order relationships in the location view effectively. However, its impact is minimal on the Gowalla dataset, likely because Gowalla is a social location check-in application where users are less sensitive to location preferences than spontaneous check-ins. This highlights the importance of the location view convolution when considering the influence of location preferences on user behavior.
Lastly, removing multi-view weighted contrastive learning also leads to performance degradation, indicating that this component effectively enhances the complementary effects between different views and improves the model’s understanding of user and POI preferences. Through these ablation experiments, we observe the importance of each view and component in MVHGAT. They complement each other and collectively contribute to the model’s performance improvement.
5.4. Hyperparameter Analysis
We analyzed the impact of the number of hypergraph convolutional layers and the learning rate on MVHGAT from a qualitative perspective.
Impact of the number of layers: To evaluate the effect of stacking hypergraph convolutional layers, we conducted experiments with the number of layers ranging from 1 to 5. As shown in
Figure 2, when using 3 layers on the NYC and TKY datasets, MVHGAT achieves a good balance in Recall@10 and NDCG@10, demonstrating its ability to capture high-order collaborative signals effectively. On the Gowalla dataset, the model performs best with 1 layer. The performance degradation with more layers may be attributed to the introduction of excessive noise.
Impact of the learning rate: To analyze the impact of the learning rate on the MVHGAT model, we conducted experiments with different learning rates in the range . By observing the model’s performance on the NYC, TKY, and Gowalla datasets under different learning rates, we gained insights into the role of the learning rate during the training process.
The experimental results show that when the learning rate is set to 0.01, MVHGAT performs best in Recall@10 and NDCG@10 on the NYC and TKY datasets. This indicates that a moderate learning rate helps the model converge better during training and effectively capture high-order collaborative signals. A smaller learning rate (e.g., 0.001) results in slow convergence during training, preventing the model from reaching optimal performance. Conversely, a more significant learning rate (e.g., 0.1) causes unstable training, with significant fluctuations during the convergence process, ultimately leading to performance degradation.
5.5. In-Depth Analysis of MVHGAT
To explore the effectiveness of our adjusted hypergraph convolutional network aggregation and propagation methods, we kept other parts of the MVHGAT model unchanged and replaced the hypergraph convolutional network in each view with LightGCN [
30]. According to the experimental results in
Table 4, replacing the specific hypergraph convolutional network in each view with LightGCN results in performance degradation to varying degrees.
When replacing the hypergraph convolutional network in the interaction view with LightGCN, the performance in terms of Recall@10 and NDCG@10 significantly declines. This is likely because LightGCN fails to capture high-order collaborative signals and struggles with the cold start problem. The feature aggregation method in LightGCN uses simple weighted propagation through the adjacency matrix, which prevents it from effectively capturing the higher-order dependencies between users and POIs in complex hypergraph structures, thereby impairing recommendation performance.
Similarly, replacing the hypergraph convolutional network in the trajectory view with LightGCN also leads to a performance drop. The reason is that LightGCN, based on undirected message passing for feature aggregation, cannot thoroughly learn directed trajectory relationships. As a result, using such models instead of hypergraph convolutional networks causes the model to fail in effectively capturing the directed information in the trajectory view, thereby impacting overall performance.
Our designed Compressed Activation Attention Trajectory Hypergraph Convolution can effectively handle directed hypergraph structures by leveraging the high-order connectivity information in hypergraphs to capture complex relationships between nodes. In contrast, undirected models often cannot fully exploit these advantages in such tasks.
6. Conclusions
The Multi-View Contrastive Fusion Hypergraph Learning Model (MVHGAT) proposed in this study effectively addresses the challenges in the next POI recommendation task, particularly in terms of data sparsity and over-smoothing. By constructing and integrating hypergraphs from three perspectives (interaction, trajectory, and geographical), the model captures the complex relationships and high-order dependencies between users and POIs. Compared to existing recommendation methods, MVHGAT enhances the consistency and discriminative power of user and POI representations through multi-view contrastive learning, significantly improving the accuracy and robustness of the recommendation system.
Experimental results demonstrate that MVHGAT outperforms other traditional and Graph Neural Network-based methods across three public datasets. This outcome validates the importance of multi-view representation modeling in recommendation systems, especially in handling complex recommendation tasks with multidimensional user preferences, leading to more personalized and accurate recommendations. By optimizing the collaborative effects between different views, MVHGAT overcomes the limitations of traditional methods that fail to integrate multi-dimensional information, achieving superior recommendation performance.
In future work, the authors plan to explore complementary feature learning methods to better model the latent intentions underlying user–POI interactions, enhance the interpretability of recommendations, and further improve the model’s performance in complex application scenarios.