Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

You, Yuxin; Liu, Zhen; Wen, Xiangchao; Zhang, Yongtao; Ai, Wei

doi:10.3390/math13071147

Open AccessReview

Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

by

Yuxin You

¹,

Zhen Liu

^1,*

,

Xiangchao Wen

¹,

Yongtao Zhang

¹ and

Wei Ai

²

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

54th Research Institute of CETC, Shijiazhuang 050081, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1147; https://doi.org/10.3390/math13071147

Submission received: 16 January 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Advances in Algorithm Design and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Graph mining is an important area in data mining and machine learning that involves extracting valuable information from graph-structured data. In recent years, significant progress has been made in this field through the development of graph neural networks (GNNs). However, GNNs are still deficient in generalizing to diverse graph data. Aiming to this issue, large language models (LLMs) could provide new solutions for graph mining tasks with their superior semantic understanding. In this review, we systematically review the combination and application techniques of LLMs and GNNs and present a novel taxonomy for research in this interdisciplinary field, which involves three main categories: GNN-driving-LLM(GdL), LLM-driving-GNN(LdG), and GNN-LLM-co-driving(GLcd). Within this framework, we reveal the capabilities of LLMs in enhancing graph feature extraction as well as improving the effectiveness of downstream tasks such as node classification, link prediction, and community detection. Although LLMs have demonstrated their great potential in handling graph-structured data, their high computational requirements and complexity remain challenges. Future research needs to continue to explore how to efficiently fuse LLMs and GNNs to achieve more powerful graph learning and reasoning capabilities and provide new impetus for the development of graph mining techniques.

Keywords:

graph mining; large language models; graph neural networks

MSC:

68T07; 90C35

1. Introduction

Graphs are data structures used to describe relationships between objects, and they are widely used in many domains, such as social networks [1], computer networks, and molecular structures. Some of these graphs contain information for hundreds of millions of nodes, but the vast majority are redundant and irrelevant. What graph mining investigates is how to extract relevant or valuable knowledge and information from scaled graph data by using graph mining models.

Over the past decade, graph mining techniques have evolved, yielding many significant results and significantly contributing to the development of the field. Early research was inspired by the word vector technique [2], and Perozzi et al. [3] firstly proposed the random walk-based graph embedding method DeepWalk; Node2Vec [4], proposed by Grover and Leskovec et al., generates node embeddings through biased random walk strategies for tasks such as node classification and link prediction, and Graph2Vec [5], on the other hand, embeds the entire graph into the vector space for graph classification tasks. For more efficient graph representation learning, the graph neural networks (GNNs) [6] proposed by Kipf and Welling pioneered a new research paradigm. GNNs efficiently capture and collect structural information and dependencies in graph-structured data through information propagation and aggregation mechanisms, which enable the model to make precise predictions in complex graph structures. In recent years, various GNN architectures have been developed in information propagation and aggregation methods. For example, Graph Convolutional Networks (GCNs) [7] process graph data through spectral convolution, which is widely used in node classification and graph classification tasks; Graph Attention Networks (GATs) [8] introduce an attention mechanism to adaptively assign the importance of different neighboring nodes, which enhances the expressive power of the model; and GraphSAGE [9] can efficiently process large-scale graph data by sampling and aggregating features from neighboring nodes. In addition, for heterogeneous graph analysis, the Heterogeneous Graph Attention Network (HAN) [10] proposed by Wang et al. aggregates information and learns between different types of nodes and edges through the attention mechanism, while Zhang et al.’s Heterogeneous Graph Neural Network (HetGNN) [11] learns embeddings by sampling various types of nodes and edges, which enables it to manage complex heterogeneous graph structures. These results not only demonstrate the prospect of wide application of graph mining techniques but also promote the development of related research fields.

In 2022, the emergence of large language models (LLMs), represented by ChatGPT [12], revolutionized the field of natural language processing (NLP), as well as the domain of artificial intelligence research. The large language model is pre-trained on a large amount of text and can learn rich language expressions and a huge amount of real-world knowledge and has excellent semantic understanding capabilities [13]. For example, BERT [14] uses a bidirectional attention mechanism to capture the contextual information of the text, which enables the model to perform on various NLP tasks such as question answering, named-entity recognition, and sentence classification. GPT-3 [15], with 175 billion parameters, is pre-trained by an autoregressive approach and is capable of performing a wide range of tasks from text generation and translation to dialog systems. Its strength lies in its ability to perform new tasks without fine-tuning and with a small number of examples or hints, demonstrating impressive zero-shot and few-shot learning capabilities. These LLMs can be applied to various downstream tasks with little additional training. Although large language models were initially developed for natural language processing, researchers have also been exploring the application of LLMs on multimodal data in recent years. For instance, the DALL-E [16] model that combines image and text can generate corresponding images based on the text by performing self-supervised learning on paired image-text corpora. These multimodal models demonstrate outstanding potential for processing and understanding different data types.

Given the disruptive potential of large models, many researchers in graph mining have focused on them in the last two years, expecting them to bring new developments to the field of graph mining. Zhang et al. [17] discusses the challenges and opportunities presented by the combination of graphs and LLMs and showcases the potential of these models across various application domains. Chen et al. [18] explores the potential of utilizing LLMs in graph learning tasks and investigates two possible approaches: LLMs as Enhancers and LLMs as Predictors. Liu et al. [19] introduced the concept of Graph Foundation Models (GFMs) and provided a taxonomy and review of existing work related to GFMs. These studies suggest that integrating LLMs with graph neural networks can enhance various downstream graph mining tasks. The reason is that, although GNNs excel at capturing structural information, they have limitations in terms of expressiveness and generalizability [20], and their semantically constrained embeddings often fail to characterize the node features for complex node information fully. On the contrary, LLMs are good at processing complex texts but are often deficient in structural information processing. Combining the strengths of both can significantly improve the accuracy of graph mining.

To the best of our knowledge, existing reviews on the application of large models in graph mining mainly focus on categorizing LLMs simply as enhancers or predictors [18]. However, such a categorization approach is only a simple and superficial combination of LLMs and GNNs considered as independent models. It fails to delve into the profound fusion potential of LLMs and GNNs in the graph mining domain. Therefore, we propose a new classification framework, as shown in Figure 1, based on the main driving components in graph mining models, which are categorized into three groups: GNN-driving-LLM (GdL), LLM-driving-GNN (LdG), and GNN-LLM-co-driving (GLcd). In this new classification framework, the GdL mode emphasizes the GNN as the central task processing module, and the LLM plays an assisting role in specific tasks or scenarios, such as natural language interpretation or feature extraction. On the contrary, the LdG mode places the LLM at the core and the GNN as an auxiliary tool for processing and guiding the graph-structured data to enhance the model’s performance in complex graph data. In the GLcd mode, on the other hand, the GNN and LLM work closely together to form a kind of interdependent joint model that collaboratively solves the corresponding graph mining tasks. Such a classification not only helps to fully understand the deep integration of LLMs and GNNs but also provides new ideas for future research directions.

We mainly focus on the literature screening of combining large language models with graph neural networks to solve graph mining tasks. The article selection criteria include the following aspects: First, the included literature must explicitly explore the application of large language models and graph neural networks in graph mining. Works that only discuss a single technology or are weakly related to the topic should be excluded. Secondly, the retrieved literature is mainly limited to articles published in the past three years to ensure the timeliness and cutting-edge nature of the research content. However, the time range can be appropriately relaxed for the early literature with high adaptability and theoretical value. In addition, this study also guided the screening and analysis of the literature based on the pre-set research questions, namely, exploring the collaborative mechanism of large language models and graph neural networks in graph mining tasks and the scenarios where their respective advantages are exerted. Through the above strict and precise standards, we aim to build a transparent and representative literature base to provide a solid theoretical and empirical foundation for subsequent research.

2. Preliminary

2.1. Graph Mining

Graph mining tasks are important in data knowledge discovery, aiming to extract valuable information from graph-structured data. Graph-structured data consist of nodes and edges, where nodes represent entities and edges represent relationships between represented entities. It is widely applied in fields such as social network analysis [21], recommender systems [22], chemistry [23], and knowledge graphs [24].

Common graph mining tasks include node classification, link prediction, graph clustering, graph matching, community detection, frequent subgraph mining, etc. Node classification utilizes labeled nodes as training data and predicts the classes of unlabeled nodes in the graph. Common techniques for node classification include random walk-based node embeddings [4] and Graph Convolutional Networks (GCNs) [7]. It is widely applied in social network analysis for user attribute prediction [25] and bioinformatics for protein function prediction [26]. Link prediction is used to predict potential future edges, with important applications in recommender systems (e.g., friend recommendation, item recommendation). Common link prediction algorithms include neighborhood-based algorithms [27], path-dependent models [28], deep learning methods [29], and so on. Graph clustering [30] and graph matching [31] are respectively concerned with the grouping of nodes and subgraph correspondence between different graphs. Graph clustering groups graph nodes such that nodes within the same group are more tightly connected. It is used for user group division in social networks [32] and functional module discovery in biological networks [33]. And graph matching finds corresponding subgraphs between different graphs, which is commonly used in chemical molecular structure comparison [34] and pattern recognition. Community detection [35] identifies subsets or associations in the graph, where nodes within a community are closely connected but have relatively few connections to nodes outside the community; commonly used algorithms include the Girvan–Newman algorithm [36], Louvain’s method [37], and so on. It can be used to discover user groups with similar interests or backgrounds in social network analysis [38]. Frequent subgraph mining [39] spots frequently occurring subgraphs from a set of graphs, which can be used in chemoinformatics to discover common molecular structures. By analyzing a large number of molecular diagrams, frequent subgraph mining can reveal which molecular fragments appear repeatedly across multiple compounds, providing valuable insights for fields such as bioinformatics and chemoinformatics [40].

Graph neural networks are neural networks designed specifically to solve the task of graph-structured data. Different from traditional neural networks, GNNs are able to directly deal with the complex relationships between nodes and edges and capture the structural information in the graph. The core idea is to update the representation of each node by aggregating the information of neighboring nodes layer by layer through the message-passing mechanism between nodes. In each layer, the node representation is updated by the following formula:

h_{u}^{(k + 1)} = ϕ (x_{u}, ⨁_{v \in N_{u}} ψ (x_{u}, x_{v})),

(1)

where

h_{u}^{(k + 1)}

is the representation of node u at level

k + 1

,

x_{u}

is the node feature,

N_{u}

is the set of neighbors of node u,

ϕ

and

ψ

are differentiable functions, and ⨁ is an aggregation operation (e.g., summing, averaging, or maximizing). Here,

ψ

represents a trainable message function that calculates the vector v to be transmitted to u, and the aggregation can be viewed as a type of message passing within the graph [41].

By stacking multiple message-passing layers, the graph neural network gradually expands its receptive field, allowing each node to aggregate information from more distant neighbors. This structure allows GNNs to efficiently learn global and local structural features of graphs and is widely used in tasks such as node classification, graph classification, and link prediction.

2.2. Large Language Model

A large language model consists of a neural network with many parameters (usually billions of weights or more). In recent years, thanks to the introduction of the transformer architecture and its ability to be pre-trained on large-scale text data, it has made significant progress in natural language processing.

The transformer model [42] proposed by Vaswani et al. in 2017 is based on the attention mechanism, which can greatly improve training efficiency and performance. Transformers consist of two main components: the encoder and the decoder. The encoder uses multi-head attention and positional encoding to process the input sequence, while the decoder combines the encoder’s output with its own multi-head attention to generate the output sequence step by step. Its fundamental unit is the multi-head self-attention mechanism, which can be expressed by the following formula:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(2)

where Q, K, and V denote the query, key, and value matrices, respectively, and

d_{k}

is the dimension of the key vector. The multiple head mechanism further computes several different attention values and combines them:

\begin{matrix} MultiHead (Q, K, V) & = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ where {head}_{i} & = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}), \end{matrix}

(3)

where the projections are parameter matrices

W_{i}^{Q} \in R^{d_{model} \times d_{k}}

,

W_{i}^{K} \in R^{d_{model} \times d_{k}}

,

W_{i}^{V} \in R^{d_{model} \times d_{v}}

, and

W_{i}^{O} \in R^{h d_{v} \times d_{model}}

.

The types and applications of LLMs are constantly evolving with the development of architectures and training methods. Current LLMs can be classified into the categories of autoregressive models, masked language models, encoder–decoder models, contrastive learning models, multimodal models, and others. Autoregressive models generate text by predicting the next word in a sequence, and such models can generate coherent text and perform very well in generative tasks such as GPT [43], GPT-2 [44], GPT-3 [15], and GPT-4 [45]. On the other hand, masked language models typically mask the words in a given utterance to train the model to predict these masked words and thus learn the contextual representation. Representative models of masked language models include BERT [14], RoBERTa [46], and XLNet [47]. The encoder–decoder models achieve a high degree of parallelization and dramatically improve computational efficiency by encoding the input text into a contextual representation and decoding it to generate the target text. Classic models of this type include BART [48] and T5 [49]. Contrastive learning models, such as SimCLR [50], train the model to differentiate between positive and negative samples by constructing pairs of positive and negative samples so that the model can capture relevant features and similarities in the data. Finally, multimodal models are able to process information from multiple modalities (including images, videos, and texts) to enhance the performance of the model in multimodal tasks, such as CLIP [51] and DALL-E [16]. The continuous development of large language models not only enhances the ability of computers in natural language processing but also provides new ideas and methods for solving broader and more complex data analysis and processing tasks.

3. Techniques of the LLMs Combined with GNNs

Regarding techniques of the LLMs combined with GNNs, as shown in Figure 2, we propose a novel taxonomy including GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving.

3.1. GNN-Driving-LLM

In graph mining, GNNs are a class of deep learning models specialized in processing graph data. By combining the structural information and node features of the graph, they are able to effectively learn the representation of nodes, edges, and the overall graph in a graph. Text-attributed graphs (TAGs), in which the attributes of nodes exist in the form of text, are ubiquitous in the research of graph machine learning, such as product networks [52], social networks [1,53], and citation networks [54,55], where textual attributes provide key semantic information for the graph mining task. Therefore, when dealing with such graphs, the structure of the graph, the textual information, and their interrelationships must be considered simultaneously.

Traditionally, the processing of node text attributes often relies on shallow embedding methods, such as the classic Cora dataset [56], which only provides embedding features based on bag-of-words models. This approach is coarse in semantic understanding, leading to the limited performance of GNNs in processing textual attribute graphs. However, with the development of LLMs, their powerful text processing and semantic understanding capabilities provide a new solution to this problem. LLMs can extract richer semantic features from the textual attributes of nodes and generate additional auxiliary information, such as attribute interpretations and pseudo-labels, which provide more meaningful semantic support for node embeddings in graph mining tasks. In previous research [57,58], techniques combining language models (LMs) and GNNs have been investigated to use LMs for encoding and providing them as node features to GNNs, and the introduction of extensive language modeling further enhances the effectiveness of this approach.

In the GdL model, GNNs are still used as the central information processing unit, but the text attributes are deeply parsed and embedded by incorporating LLMs, which makes the GNNs more accurate in capturing the semantic information of the nodes and thus improves the performance of the model in the downstream tasks. As shown in Figure 3a, LLMs generate additional information for the textual attributes of the nodes. Next, these features extracted by LLMs are fed into the next layer of the language model to generate enhanced node embeddings. A typical example is the TAPE [59] model, which first predicts the node text (such as paper titles and abstracts) in LLMs and generates the corresponding interpretations. Through task-specific prompts, LLMs generate categorization predictions and explanation texts. The interpreted text generated by the LLMs is then fed into a smaller language model, processed through fine-tuning, and transformed into node features (including original text features

h_{o r i g}

, explanation features

h_{e x p l}

, and prediction features

h_{p r e d}

). Among them,

h_{o r i g}

and

h_{e x p l}

represent the text-embedding matrices obtained by converting the original text

X

and the explanation text

S

generated by the LLMs through pre-trained LMs, and

h_{p r e d}

represents the list matrix of the top k predicted rankings provided by the LLMs for each node.

The combination of these three classes of features as

h_{T A P E} = {h_{o r i g}, h_{e x p l}, h_{p r e d}}

is used for downstream GNNs training. Finally, the GNN models are trained on the generated features to achieve the node classification task. The following formula can describe the process:

\begin{matrix} s_{i} = LLM (x_{i}, p), \\ h_{o r i g} = {LM}_{o r i g} (X), h_{e x p l} = {LM}_{e x p l} (S), \\ \hat{y} = GNN (h_{T A P E}, A), \end{matrix}

(4)

where the LLMs accept a raw text for node i

x_{i} = (x_{i, 1}, x_{i, 2}, \dots, x_{i, q_{i}})

as input and generate a sequence of interpreted text

s_{i} = (s_{i, 1}, s_{i, 2}, \dots, s_{i, m_{i}})

as output, with p is the prompt for the LLMs. And A is the adjacency matrix of the graph,

\hat{y}

is the node representation matrix obtained by the GNNs, which is used to predict node labels. In this article, the LLM is considered as an autoregressive language model designed to predict the next occurrence of a word or sequence of words

s = (s_{1}, s_{2}, \dots, s_{m})

. Its working mechanism can be expressed by the following equation:

p (s | \hat{x}) = \prod_{i = 1}^{m} p (s_{i} | s_{1}, \dots, s_{i - 1}, \hat{x}),

(5)

where

\hat{x} = (p, x_{1}, \dots, x_{q})

is the new sequence formed by concatenating the input sequence containing the prompt p.

Several studies have shown that this strategy of combining LLMs and GNNs has significant advantages in practical applications. For example, LLMRec [60] is the first work on graph enhancement using LLMs by augmenting user-item interaction edges, item node attributes, and user node profiles, which effectively solves the problems of data sparsity and low-quality auxiliary information in recommender systems. Similarly, RLMRec [61] also utilizes LLMs to enhance representation learning in existing recommender systems, which uses LLMs to respectively process textual information of items, as well as user interaction information and textual attributes of related items, to generate item profiles and user profiles. By maximizing mutual information, RLMRec aligns the semantic representations from LLMs with collaborative relationship representations, significantly improving the performance of the recommendation system. PRODIGY [62] is a framework for pre-training context learners on prompt graphs. It leverages the powerful zero-sample capability of LLMs to encode textual information of graph nodes, enabling context learning on graphs. Inspired by the remarkable effectiveness of prompt learning in NLP, ALL-in-one [63] proposes a new multitask prompting approach that unifies graph prompts and language prompts formats with prompt tokens, token structures, and insertion patterns. It enables the NLP prompting concepts to be seamlessly introduced into the graph domain, thus unifying the formats of graph prompting and language prompting.

When using embeddable or open-source LLMs, the text embeddings they generate can be accessed directly. In this case, the LLMs first extract textual features for each node i, and then the textual features

h = {h_{1}, h_{2}, \dots, h_{N}}

are fed into the GNNs as initial node embeddings. Subsequently, the GNNs combine the graph structure information with these augmented node embeddings through their message passing and feature aggregation mechanisms, as shown in Figure 3b. In general, the process can be described by the following equation:

h_{i} = LLM (x_{i}, p), \hat{y} = GNN (h, A) .

(6)

This approach significantly improves the performance of graph mining tasks, especially in downstream tasks such as node classification, link prediction, and graph classification. OFA [64] embeds textual descriptions of graph datasets from different domains into the same feature space, thus becoming the first cross-domain generalized graph classification model. GaLM [65] pre-trains on large-scale heterogeneous graphs and incorporates structural information into the fine-tuning stage of the LLM to generate higher quality node embeddings, thus improving performance for downstream applications in different graph patterns. LEADING [66] is an end-to-end fine-tuning algorithm that significantly improves LLMs’ computational and data efficiency in processing TAGs by reducing coding redundancy and propagation redundancy. GraphEdit [67] enhances LLMs’ reasoning ability on node relationships in graph data through instruction tuning to solve the noise and sparsity problems in graph structures. Specifically, GraphEdit first uses LLMs to generate semantic embeddings of nodes and filter candidate edges with a lightweight edge predictor; then, in combination with the original graph structure, it uses the inference ability of LLMs to optimize edge additions and deletions and generates the improved graph structure. Eventually, the optimized graph structure is used for GNNs training to support downstream tasks such as node classification, thus realizing denoising and global dependency mining of the graph structure.

With their exceptional ability to process text sequences, LLMs perform excellently in handling TAGs. They can extract deep semantic information from textual attributes, providing richer feature representations than traditional methods. In addition, rather than traditional GNNs that need to design different architectures for different datasets, combining LLMs and GNNs can process data from different domains by unifying the feature space and cross-domain embedding methods. This approach, which combines the diversity of language models and the ability to understand the structure of graph neural networks, offers flexibility in dealing with complex attributes and structures.

Beyond that, LLMs can also improve the performance of downstream tasks through data augmentation. For example, LLM-GNN [68] achieves efficient and low-cost label-free node classification through LLM-based zero-shot labeling and GNN-based extended learning. LLM4NG [69] represents a typical application of graph generation learning. It utilizes a large language model to generate new nodes and integrates these nodes into the original graph structure via an edge predictor to generate a new graph structure. This approach significantly improves model performance in few-shot scenarios, demonstrating the effectiveness of augmenting model learning capabilities by generating samples. Similarly, OpenGraph [70] uses LLMs for generating synthetic graph data (e.g., nodes and edges) as well as augmenting the pre-training data. It also employs a unified graph tokenizer and an efficient graph transformer, achieving excellent generalization on multi-domain graph data in zero-shot learning.

Despite the fact that LLMs can enhance model performance, they are extremely demanding of computational resources, especially when processing large amounts of textual data, requiring significant computational resources or frequent API query calls to the LLMs, which comes with a high cost. Moreover, although LLMs provide rich semantic information, their decision-making process is often opaque and lacks interpretability. Therefore, how to efficiently fuse text embedding with graph structure information in the framework of GNNs remains a challenging issue that requires further research.

3.2. LLM-Driving-GNN

In some graph mining tasks, LLMs, with their powerful zero-shot learning capabilities, can directly perform prediction, classification, or reasoning. The computational power of LLMs and the advantages of deep learning algorithms enable them to perform well in these tasks. However, since LLMs can only accept sequential text as input, an additional step is required for graph data: transforming the graph data, which has been defined in various ways in terms of structure and features, into sequential text directly fed into LLMs. In recent years, many studies have explored the applicability of LLMs for tasks downstream of graph structures, yielding some preliminary results [71,72,73]. These results suggest that LLMs have achieved initial success in processing implicit graph structures. Wang et al. [74] proposed a synthetic benchmark, NLGraph, for evaluating the performance of LLMs in the task of reasoning about graph structures. It was found that the capability of LLMs has been demonstrated in simple graph reasoning tasks but is still insufficient when dealing with complex problems. Additionally, several studies have explored the capability of LLMs in processing graph-structured data and their effectiveness in extracting and utilizing graph structural information [75,76]. These studies have opened new avenues for exploiting the capabilities of LLMs in graph applications and further exploring the incorporation of structured data into LLM frameworks. A growing number of researchers are committed to exploring how to better apply LLMs to downstream tasks containing graph structures and further improve their performance.

Typically, to enable LLMs to understand graph data, researchers use specific methods to convert graph data into textual descriptions, which are then used as inputs to LLMs and finally extract predictions from the output of the model. Among these methods, flattening functions can directly convert graphs into textual descriptions, which is more convenient and intuitive, while GNNs have demonstrated excellent ability in understanding graph structures through information propagation and aggregation among nodes. Therefore, researchers have also explored using GNNs to transform graph data with different structures and features into sequential text to fully leverage the structural information in the graph data for prediction. As shown in Figure 4, the method of encoding graph structure data into text for use in LLMs was first comprehensively investigated by Fatemi et al. [77] and can be described by the following equation:

Describe = g (G, A t t r), A = LLM (Describe, p),

(7)

where g represents the flattening function or GNNs, which inputs the graph G and the text attributes

A t t r

on each node or edge of the graph to get the text description, i.e., Describe, to be fed to the LLMs. p denotes the prompts, and A is the answers given by the LLMs, from which the predicted labels of downstream tasks can be extracted in a specific way.

The

g (\cdot)

function and the prompt p in Formula (7) are generally optimized by the following steps. For the question Q on the graph G, the prompt p can be adapted by the question restatement function

q (Q)

to allow the bigram model to generate a response that best matches the expected answer S. The model looks for the optimal graphical encoding

g (\cdot)

and the question rephrasing function

q (\cdot)

to maximize the expected answer score (

{score}_{f}

) on the training dataset D:

max_{g, q} E_{(G, Q, S) \in D} {score}_{f} (g (G, A t t r), q (Q), S) .

(8)

Research has shown that choosing a suitable graph-encoding method can significantly improve the performance of LLMs in graph reasoning tasks. Thus, one of the current research priorities is to explore suitable graph-encoding methods. For example, GPT4Graph [78] converts graph data into a Graph Description Language (GDL) like GraphML [79]. It generates prompts in conjunction with user queries so that the large language model can understand and process the graph-structured data. GraphText [80] constructs a graph syntax tree from graph data and generates graph prompts through traversal of the tree, expressing them in natural language so that LLMs can treat graph reasoning as a text generation task. An alternative approach is to record the graphical data directly in natural language, i.e., to describe the graphical data with a digitally organized list of nodes and edges [81]. GraphGPT [82] also utilizes Chain-of-Thought (CoT) [83] to augment the model’s reasoning capabilities. However, in terms of transforming graph data, GraphGPT trains a lightweight graph–text projector that is able to align representations between text and graph structure, allowing the model to switch seamlessly during processing. Similarly, MoleculeSTM [84] uses a graph encoder as a molecular graph structure encoder. The method enables large language models to align molecular structures to natural language by aligning molecular graphs and textual representations through contrastive training and then employing a lightweight alignment projector to map graph features into the word embedding space. DGTL [85] encodes the raw textual information in TAGs using a frozen LLM and then captures the graph neighborhood information in TAGs with a set of custom disentangled graph neural network layers. Finally, the features learned from these disentangled layers are used to fine-tune the LLMs to help the model better understand the complex graph structure information in the TAGs, thereby improving the final prediction ability of the LLMs. On the other hand, GraphTranslator [86] proposes a mechanism called Producer for creating graph–text alignment data, which enables a large language model to predict graph data based on language instructions. HiGPT [87] adapts to diverse heterogeneous graph learning tasks without downstream fine-tuning through a heterogeneous graph instruction-tuning paradigm.

Another way to enhance the capability of LLMs on graph-structured data is by fine-tuning LLMs to enhance their graph mining capabilities. Several researchers have explored this area and made notable progress. GIMLET [88] fine-tunes LLMs to output predictive labels directly, thus providing accurate predictions without additional parsing steps. MuseGraph [89] generates compact graph descriptions using neighbor nodes and random walks and creates task-based CoT instruction sets to fine-tune the large language model. The method dynamically allocates instruction packages between tasks and datasets to ensure the effectiveness and generalization of the training process. Eventually, the graph structure data are converted into a format suitable for LLMs, which allows the fine-tuned model to fit downstream tasks such as node classification, link prediction, and graph-to-text generation. InstructGLM [90] designed a series of rule-based, highly scalable natural language prompts for describing graph structures and performing graph tasks, and fine-tunes large language models with these instructions, enabling them to understand and process these descriptions to perform graph tasks. GraphLLM [91] adopts an end-to-end approach that integrates a graph learning module (graph transformer) with LLMs. Specifically, the approach uses a textual transformer encoder–decoder to extract the necessary information from the node descriptions, learns the graph structure through the graph transformer, and generates overall graph representations by aggregating the node representations. Ultimately, these graph representations are used to generate graph-enhanced prefixes injected in each LLM attention layer. This approach allows the LLM to work synergistically with the graph transformer to incorporate structural information critical for graph inference and thus improve performance on graph reasoning tasks.

Unlike the above approaches, the Graph-ToolFormer framework [92] uses API calls to invoke external graph inference tools to complete reasoning tasks. First, ChatGPT is utilized to annotate and expand the manually written graph reasoning task prompts to generate a large dataset of prompts containing graph reasoning API calls. Then, the generated dataset is used to fine-tune pre-trained causal LLMs (e.g., GPT-J [93] and LLaMA [94]) and teach them how to use external graph inference tools in the generated output. Finally, the fine-tuned Graph-ToolFormer models are able to automatically add the corresponding graph reasoning API calls to the output statements when they receive input queries and questions. Through the above steps, the Graph-ToolFormer framework realizes the ability to empower existing LLMs to handle complex graph reasoning tasks, effectively addressing the current limitations of LLMs in handling precise computation, multi-step logical reasoning, spatial and topological awareness, etc. LLMs can also be used to generate new GNN architectures. Wang et al. [95] proposed a novel graph neural network architecture search method, GPT4GNAS, which guides GPT-4 to understand the search space and search strategies of GNAS by designing a new type of prompt. These prompts are iteratively run to generate new GNN architectures, and the evaluation results are used as feedback to optimize the generated architectures further. ChatRule [96] utilizes LLMs to mine logical rules as well as learn and represent graph structures by combining semantic and structural information from knowledge graphs (KGs), helping encode and process graph structures. These rules can be regarded as new graph structures to construct new knowledge graphs by capturing the generative rules and patterns of graphs. Meanwhile, GNP [97] extracts valuable knowledge from knowledge graphs, and through graph neural network encoding, cross-modal pooling, and self-supervised learning, it significantly improves the performance of LLMs in common-sense reasoning and biomedical reasoning tasks.

In conclusion, when LLMs dominate downstream tasks such as prediction, classification, and reasoning, they demonstrate significant advantages over traditional GNNs, especially in zero-sample learning and processing textual attributes. LLMs can utilize their powerful text generation and comprehension capabilities to predict and classify the graph data directly without the complex structural processing required by GNNs. However, since every coin has two sides, each approach requires careful trade-offs of the advantages and disadvantages between LLMs and GNNs. Converting graph data into textual descriptions can simplify the processing flow and enable LLMs to utilize their textual processing capabilities for reasoning directly. Nevertheless, due to the input length limitation, this approach may result in the loss of graph structure information, and the text conversion process is so complex that it is unsuitable for processing complex graph data. On the other hand, combining GNNs can fully utilize the information in the graph structure and enhance the processing capability of the model. However, this integration method increases the complexity of the system. Effective integration of GNNs with LLMs requires careful design and tuning to ensure that both can effectively work together to handle complex graph data. To sum up, how to efficiently combine graph structural information with LLMs for more powerful graph learning and reasoning is the core of the research in this area.

3.3. GNN-LLM-Co-Driving

In GdL, LLMs act as the preprocessors of graph data, primarily converting the textual attributes of the graph into rich feature representations through their powerful text comprehension and generation capabilities and then passing these representations to GNNs for further structural processing. In this setup, LLMs play a role in information extraction and feature enhancement. In contrast, in LdG, GNNs are mainly responsible for processing graph structural information and utilizing this structural information to enhance the prediction, classification, reasoning, and generation capabilities of LLMs.

GLcd synthesizes the strengths of GNNs and LLMs, leveraging their collaboration to solve complex tasks. Different from GdL and LdG, the co-driving strategy emphasizes the deep interaction and complementarity between GNNs and LLMs. In this architecture, GNNs and LLMs co-drive the learning process of the model, alternating and complementing each other. GNNs, with their advantage in graph structural information, help LLMs generate semantically deeper features under complex structures; at the same time, LLMs provide GNNs with more accurate node and edge representations through their strong ability in text sequence processing. This bidirectional interaction not only improves the model’s comprehensive processing capability on graph structures and textual information but also enhances the robustness and generalization of the overall model, which can better cope with diverse tasks and datasets.

A typical co-driving model is the GraphFormers framework proposed by Yang et al. [98], as shown in Figure 5a. It creates a unified architecture capable of processing textual and graph structural information simultaneously by embedding GNNs into each transformer layer. The graph is first aggregated in each layer through GNN, bringing neighborhood information to the central node. Then, the enhanced node features are processed by the transformer to generate richer node representations. This allows the text encoding and graph aggregation processes to alternate in the same workflow, allowing nodes to exchange information at each layer to enhance the node representations with the information from neighboring nodes. PATTON [99] adopts the GraphFormers framework and designs two pre-training strategies—Network Contextualized Masked Language Modeling (NMLM) and Masked Node Prediction (MNP)—based on it. By jointly optimizing the two objective functions NMLM and MNP to carry out the pre-training process, this approach enhances the model’s semantic understanding at both the vocabulary and document levels, enabling the pre-trained language model to sufficiently understand and represent the complex semantic information in rich textual networks. Zhao et al. [100] propose the GLEM model, which combines GNNs and LLMs and alternately updates LLMs and GNNs to generate pseudo-labels with each other through a Variational Expectation-Maximization (EM) framework. By integrating textual semantics and graph structural information, GLEM can effectively perform node representation learning on large-scale text-attributed graphs and improve node classification performance. GREASELM [101] uses a modality interaction mechanism to facilitate bidirectional information transfer between each layer of the LM and the GNN, deeply integrating language context with knowledge graph representations to realize joint reasoning to answer complex questions.

Unlike the aforementioned architectures that combine GNN and LLM models, Text2Mol [102] employs two independent encoders for textual and molecular representations. By projecting the data from these two modalities into an aligned semantic embedding space, the model achieves a cross-modal search for retrieving molecules from natural language descriptions. To enhance the interpretability of the model, the Text2Mol framework introduces a cross-modal attention model based on the transformer decoder. The model utilizes the output of SciBERT [103] as the source sequence and the node representations generated by the GCN model as the target sequence, learning the association rules between molecule substructures and text keywords through an attention mechanism. To optimize the model, the authors propose a symmetric contrastive loss function. For two submodel output embeddings m and t of length n, this contrastive loss function is defined as follows:

L (m, t) = CCE (e^{τ} m t^{T}, I_{n}) + CCE (e^{τ} t m^{T}, I_{n}) .

(9)

where

τ

is the temperature parameter learned through training, the unit matrix

I_{n}

is used as the label, and Categorical Cross-Entropy (CCE) is applied along each of the two axes. Specifically, the formula for Categorical Cross-Entropy can be expressed as:

CCE (e^{τ} t m^{T}, I_{n}) = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} I_{i, j} log ({[e^{τ} t m^{T}]}_{i, j}) .

(10)

In practice, this contrastive loss function helps the model learn robust, discriminative representations by encouraging embeddings of matching pairs to be closer while pushing non-matching pairs apart. Thus, it improves the synergy between textual descriptions and molecular graphs for tasks such as property prediction or drug design. With this symmetric contrast loss function, the model effectively uses other samples in the mini-batch as negative samples to maximize the similarity of correctly matched pairs while minimizing the similarity of incorrectly matched pairs, thus improving the model’s performance. The architecture is shown in Figure 5a.

Similarly, methods such as MoleculeSTM [104], CLAMP [105], ConGraT [106], G2P2 [107], GRENADE [108] also employ independent GNN encoders and LLMs encoders to process molecular and textual data separately and then map the embedded representations to a shared joint representation space for contrastive learning. However, these models differ in implementation details or in the specific application scenarios they are applicable to. MoleculeSTM focuses on solving new challenges in drug design, such as structure–text retrieval and text-based molecular editing, and constructs PubChemSTM, a multimodal dataset containing a large number of chemical structure–text pairs. CLAMP, on the other hand, is primarily dedicated to zero-sample bio-activity prediction. Pre-training on large-scale chemical databases (such as PubChem) containing molecular structures, text descriptions, and bioactivity measurement data significantly enhances the generalization ability of activity prediction models. ConGraT connects an adapter module after each of the text encoder and graph node encoders, which consists of two fully connected layers to generate text embeddings and graph node embeddings of the same dimension. G2P2 and GRENADE take a further step by employing contrastive learning strategies. G2P2 enhances the granularity of contrastive learning by jointly training its graph encoder and text encoder with three graph-based contrast strategies (text–node interaction, text–summary interaction, and node–summary interaction) during the pre-training phase to jointly train the graph encoder and text encoder. This allows for the alignment of graph node and text representations in a bimodal embedding space, enabling better capture of fine-grained semantic information in the text while leveraging graph structures to enhance classification model performance. On the other hand, GRENADE jointly optimizes the pre-trained language model encoder and the graph neural network encoder through two self-supervised learning algorithms including graph-centered contrastive learning [109] and graph-centered knowledge alignment.

In addition, we designed a table that selects representative research works in each module to intuitively display the mathematical structure of each model and extract the core mathematical information in the paper. This table not only facilitates the sorting out the mathematical design points of different models but also compares their similarities and differences in mathematical structure.

In order to achieve efficient node classification on textual graphs, GraD [110] employs the concept of knowledge distillation. The core idea is to transfer the graph structure information from the GNN teacher model to the graph-free student model through the distillation process. The student model does not need to use the graph structure during reasoning, thus significantly improving the reasoning efficiency and realizing efficient and accurate node classification. In order to introduce between models with different coupling strengths and flexibility, GraD also proposes three different optimization strategies, namely GraD-Joint, GraD-Alt, and GraD-JKD. Similarly, the THLM [111] framework combines BERT with the heterogeneous graph neural network R-HGNN [112], and through topology-aware pre-training tasks and text augmentation strategies, it pre-trains on Text-Attributed Heterogeneous Graphs (TAHGs). After the pre-training phase, the THLM framework retains only the language model for downstream tasks and no longer relies on the auxiliary HGNN, ensuring efficiency and flexibility in processing downstream tasks. Some studies focus on combining LLMs and GNNs in the context of TAGs to reduce training complexity and memory consumption while maintaining the model’s expressiveness. GraphAdapter [113] combines GNNs and LLMs specifically for processing TAGs. It firstly uses GNNs for each node in the TAG to model its structural information, then integrates the structural information with context-hidden states of LLMs, and finally transforms the original task into a next-word prediction task by adding task-specific prompts. Its lightweight design, residual connections, and task-related prompts enable the method to exhibit high performance in various downstream tasks, validating its effectiveness in TAG modeling. ENGINE [114], through a tunable bypass structure—G-Ladder—combines LLMs and GNNs for efficient fine-tuning and reasoning on TAGs. The framework significantly reduces memory and computational costs while preserving structural information through the introduced lightweight G-Ladder structure that adds tunable parameters next to each layer of LLMs. To further improve efficiency, a caching mechanism is also introduced to precompute the node embeddings during the training process, and dynamic early stop is used to accelerate model inference during the reasoning process.

Compared with GdL and LdG, the GLcd mode emphasizes the deep interaction and complementarity between the GNNs and LLMs. In this mode, the GNN and LLM alternate and enhance each other in the learning process, thereby demonstrating higher robustness and generalization in the integrated processing of graph structure and text information. This co-driving strategy can solve complex tasks efficiently by integrating the graph structure processing capability of GNNs and the text comprehension capability of LLMs.

3.4. The Comparison of the Significance of Embeddings

The integration of LLMs and GNNs has led to significant advances in various tasks, including node classification, link prediction, and cross-domain generalization. A key aspect of this integration is the role of embeddings: LLM embeddings provide open-domain semantic priors (such as common sense knowledge in ChatGPT), while GNN embeddings encode domain-specific structural patterns. These two types of embeddings complement each other through contrastive learning or attention mechanisms. Table 1 presents a structured overview of how embeddings are utilized in various LLM-GNN integration strategies, their impact on performance, and the mechanisms ensuring interoperability between the two modalities.

4. Summary and Discussion Analysis

4.1. Summary

The previous sections have discussed the related work on graph mining domain modeling using the large language models. Through a systematic review, it is evident that the combination of LLMs and GNNs brings new research ideas and directions for graph mining. Specifically, the LLM demonstrates strong capabilities in semantic understanding and text feature extraction, while the GNN is uniquely suited to capture graph structure and complex node relationships. Therefore, various studies have integrated the two models through multiple architectures and strategies to enhance the performance of graph mining tasks. The models listed in Table 2 are categorized and compared according to their main driving approaches, summarizing their performance across tasks, datasets, performance metrics, and applicable scenarios. In addition, the mathematical information of several generations of representative models is summarized in Table 3, including their key characteristics such as input and output forms, parameter scales, and loss functions used. This information provides readers with a clear comparison and helps them better understand the structure of each model.

In the meantime, to facilitate readers to understand the relationships within the literature, we innovatively sketch a citation network graph of the important literature discussed in this review, as shown in Figure 6, using directed edges to represent the citation relationship between articles. In the figure, green, blue, and red nodes represent the literature categorized as “GdL”, “LdG”, and “GLcd”, respectively; the size of the nodes reflects the number of citations. It can be observed that there are certain citation relationships among different categories, which indicates that the research in these three categories has strong intersectionality and complementarity in terms of methodology and application scenarios. It is worth noting that the citation relationship of the green node (GdL) exhibits a relatively scattered citation relationship, which indicates that the studies in this direction emphasize the independent application in specific scenarios, with less interaction with other categories. Each category has at least one representative paper (with the highest number of citations). For instance, the representative work of “LdG” is NLGraph [74], and the representative work of “GLcd” is GREASELM [101]. Overall, the graph intuitively visualizes the synergistic and evolutionary relationships between different research directions in the field, providing a valuable reference for future research.

4.2. Evaluation Metrics

This section will introduce the commonly used evaluation indicators in the models discussed in this review in detail, including accuracy, recall, AUC-ROC, average precision, MSE, text similarity, semantic similarity, etc.

4.2.1. GNN-Based Model Evaluation Metrics

TP, FP, TN, and FN are used to represent true positive, false positive, true negative, and false negative, respectively.

Accuracy. Accuracy refers to the accuracy of the overall prediction results of the model, that is, the proportion of correctly predicted samples to all samples. For example, in the node classification task accuracy represents the proportion of node categories correctly predicted by the model among all nodes, that is, the ratio of the number of correctly classified nodes to the total number of nodes. The calculation formula is:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(11)

Precision. Precision refers to the proportion of samples predicted as positive by the model that are actually positive, and it is used to assess the exactness of the model’s positive predictions. The calculation formula is:

Precision = \frac{T P}{T P + F P}

(12)

Recall. Recall refers to the proportion of all true positive samples correctly predicted as positive by the model. It is used to evaluate the model’s coverage in finding all positive samples in the data. For example, recall in the recommendation task is equal to the proportion of all truly relevant items the model successfully recommends. The calculation formula is:

Recall = \frac{T P}{T P + F N}

(13)

F_{1}

score. The

F_{1}

score is the harmonic mean of precision and recall, which takes into account the exactness and coverage of the model’s prediction results and strikes a balance between precision and recall. When both are high, the

F_{1}

value will also be high. The calculation formula is:

F_{1} = \frac{2 \times Precision \times Recall}{Precision + Recall}

(14)

AUC-ROC. Area Under the Curve (AUC) represents the area under the Receiver Operating Characteristic curve (ROC curve), which reflects the model’s ability to distinguish between positive and negative samples. Its value ranges from 0.5 (random guessing) to 1 (perfect distinction). The larger the value, the better the model performance.

Average Precision. Average precision (AP) measures the precision–recall tradeoff, considering ranking performance by averaging precision across recall levels. With recall as the horizontal axis and precision as the vertical axis, we can obtain the PR curve. The average precision value on the curve is the average precision. The calculation formula is:

AP = \int_{0}^{1} p (r) d r

(15)

Mean Squared Error. Mean squared error (MSE) measures the average squared difference between predicted values

\hat{y}

and actual values y. Lower values indicate better predictions. The calculation formula is:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(16)

R^{2}

score. The

R^{2}

score, also known as the coefficient of determination, is a statistical indicator used to evaluate the goodness of fit of a regression model. It indicates the degree to which the independent variable explains the dependent variable, that is, the proportion of the independent variable that can explain the variation of the dependent variable. The value range of the

R^{2}

score is between 0 and 1. The closer it is to 1, the better the model fit is, and the closer it is to 0, the worse the model fit is. The specific calculation formula is:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(17)

where

{\hat{y}}_{i}

represents the model’s prediction result for the

i

th sample, and

y_{i}

represents the true value.

4.2.2. LLM-Specific Evaluation Metrics

Unlike traditional machine learning models, texts generated by LLMs need to be evaluated not only for correctness but also for relevance, factual accuracy, and other important aspects. To achieve a comprehensive evaluation, researchers have developed various metrics tailored to different aspects of LLM performance. These metrics include n-gram-based measurements to assess lexical overlap; text similarity and semantic similarity metrics to evaluate consistency of surface-level and contextual meanings; logic-based metrics to determine logical consistency; and more. In the following, we will delve into some standard evaluation methods in depth.

N-gram-based Metrics. N-gram-based metrics are widely used to evaluate the quality of generated text by measuring the overlap of n-grams (sequences of n words) between the generated text and the reference text. These metrics are particularly useful for tasks such as machine translation and text summarization, where the goal is to generate text that closely matches the reference in terms of word sequences. One of the most commonly used n-gram-based metrics is Bilingual Evaluation Understudy (BLEU) [115], which computes the precision of n-grams of a specific length (usually 1 to 4) and includes a brevity penalty to avoid translations that are too short. The BLEU score is calculated as the geometric mean of the n-gram precisions, and the brevity penalty is applied as follows:

BLEU = BP \times exp (\sum_{n = 1}^{N} w_{n} log p_{n})

(18)

where BP is the brevity penalty,

p_{n}

is the precision for n-grams of order

n

, and

w_{n}

is the weight for each n-gram precision. BLEU scores range from 0 to 1, where higher values suggest greater n-gram overlap, indicating better performance in generating text similar to the reference.

Text Similarity. Text similarity is a key task in NLP that aims to compare the degree of similarity between two texts. Evaluators for text similarity metrics center around measuring similarity through the comparison of shared words or word sequences within text items. They are beneficial for generating a similarity score when contrasting the predicted output from an LLM with the reference ground truth text. Moreover, these metrics offer insight into the model’s performance across different tasks. The Levenshtein Similarity Ratio is a character-level based text similarity measure [116]. It evaluates the degree of similarity between two strings by calculating their edit distance (Levenshtein Distance). The edit distance is the minimum number of edit operations required to convert one string into another; edit operations include inserting, deleting, and replacing characters. The formula for the Levenshtein Similarity Ratio is given below:

Levenshtein Similarity Ratio (x, y) = 1 - \frac{Levenshtein Distance (x, y)}{max (| x |, | y |)}

(19)

where

Levenshtein Distance (x, y)

is the minimum number of edit operations between two strings x and y, and

| x |

and

| y |

are the lengths of

x

and

y

.

Embedding-based metrics. The basic idea of embedding-based evaluation metrics is to map the text into a high-dimensional embedding space and reflect the semantic or structural proximity of the text by calculating the similarity between the embedding vectors. Specifically, pre-trained models (e.g., BERT, Word2Vec, GloVe, etc.) are first utilized to convert words or sentences in the text into embedding vectors, which are able to capture the semantic information and contextual relationships of the words. Then, similarity measures (e.g., cosine similarity, Euclidean distance, etc.) are used to compare the embedding vectors of the candidate text and the reference text in order to derive the evaluation results. For measuring embedding similarity, BERTScore [117] is a commonly used metric that utilizes contextualized embeddings from pre-trained models like BERT to compare the embedding of text. Instead of relying on exact word matching, BERTScore computes the cosine similarity between the word embeddings of the candidate and reference texts.

BERTScore (p, r) = \frac{1}{N} \sum_{i = 1}^{N} max_{j \in {1, \dots, M}} cosine_similarity (p_{i}, r_{j})

(20)

where

p_{i}

and

r_{j}

are the word embedding vectors in the generated text and the reference text, respectively. N and M represent the number of words in the generated text and the reference text.

Semantic Similarity. Semantic similarity assessment between sentences involves gauging the closeness of their meanings. The process starts by transforming each sentence into a feature vector that encapsulates its semantic essence. A prevalent method involves generating embeddings for the sentences, and then applying cosine similarity to quantify the similarity between the embedding vectors. Specifically, given two embedding vectors, A (for the target sentence) and B (for the reference sentence), cosine similarity is calculated as the dot product of A and B divided by the product of their magnitudes. This method leverages the geometric proximity of vectors in high-dimensional space to reflect semantic relatedness, with higher cosine values indicating closer meanings. This approach is widely used in tasks like machine translation evaluation, text generation assessment, and semantic textual similarity benchmarking, offering a robust way to evaluate how well the meanings of sentences align.

\cos ine similarity = \frac{A \cdot B}{| | A | | | | B | |}

(21)

This metric calculates the cosine of the angle between two non-zero vectors, with values ranging from −1 to 1. A value of 1 indicates identical vectors, while −1 signifies maximum dissimilarity.

Prompt-based Metrics. Prompt-based metrics are a class of automated methods that utilize LLM itself as an evaluator. This method entails guiding LLM to score the generated text through carefully designed prompts, enabling the measurement of dimensions such as fluency, consistency, relevance, and factuality. In contrast to traditional reference-based metrics, prompt-based metrics offer better scalability and yield interpretable results. Common prompt frameworks currently in use include Reason-then-Score (RTS) [118], Multiple Choice Question Scoring (MCQ) [119], Head-to-head Scoring (H2H) [119], and G-Eval [120]. These frameworks enable LLMs to assess text in various ways, generating quantitative scores or text explanations.

Although prompt-based metrics can reduce labor costs in large-scale evaluation tasks and are highly correlated with human evaluation results in some scenarios, they still have certain limitations. For instance, language model evaluation may be influenced by issues like positional bias, length bias, or self-enhancement bias. Furthermore, the effectiveness of prompt design directly impacts the reliability of the assessment results. Therefore, in practical applications, it is often essential to incorporate multiple rounds of prompt optimization (such as Multiple Evidence Calibration, MEC) [121] or manual review to enhance evaluation accuracy.

Quality-based Metrics. Quality-based metrics are utilized to assess the overall quality of generated text, with a focus on criteria like information completeness, fluency, and readability. These metrics are commonly applied in tasks such as text summarization and question-answering systems to determine if the generated text covers essential information and remains relevant to the input text content. In contrast to traditional reference text-based metrics like ROUGE and BLEU, quality assessment metrics emphasize evaluating the effectiveness and coherence of the text rather than mere vocabulary matching.

BLANC [122] is an unsupervised summary evaluation method that aims to assess the quality of summaries without manual annotation. The core idea is that if a summary can enhance BERT’s predictive ability, it indicates that the summary contains essential information and is of high quality. The quality of the summary is evaluated by measuring how much the summary helps the BERT language model perform the masked token prediction, the cloze task. There are two versions of BLANC, referred to as BLANC-help and BLANC-tune. Specifically, BLANC-help calculates the difference in accuracy of BERT’s prediction of the original text with and without the assistance of summaries, while BLANC-tune involves fine-tuning BERT with summaries and then measuring the improvement in its predictive ability. The formula for BLANC-help is defined as:

{BLANC}_{help} = \frac{S_{01} - S_{10}}{S_{00} + S_{11} + S_{01} + S_{10}}

(22)

where the index

i

of

S_{i, j}

corresponds to 0 or 1, representing unsuccessful (0) or successful (1) detection for the filler-input. Similarly, the index

j

denotes the outcome for the summary-input. The value

S_{01}

indicates the number of cases where the filler-input was unsuccessful while the summary-input was successful. The BLANC score ranges from 0 to 0.3 and indicates the helpfulness of the information provided by the summary for text understanding and quality. A higher score signifies more helpful information and etter quality.

Entailment-based Metrics. Entailment-based metrics are a class of methods designed to evaluate the factual consistency of text generation tasks, such as summarization and question answering, by leveraging the principles of Natural Language Inference (NLI). In an NLI task, a model determines the relationship between a given text, known as the premise, and another text, the hypothesis, classifying their relationship into one of three categories: entailment, where the hypothesis is fully supported by the premise and the information remains consistent; contradiction, where the hypothesis conflicts with the premise and introduces contradictory details; or neutral, where the hypothesis neither aligns with nor directly opposes the premise [123]. By applying these inference-based techniques, entailment-based metrics help detect factual inconsistencies in generated text, ensuring that outputs remain faithful to the original information.

Dependency Arc Entailment (DAE) [124] is an entailment-based metric used to better evaluate the factual consistency of text generation. DAE analyzes the dependency arcs of the generated text to determine whether the semantic relationship of each arc is implied by the original text, enabling the more accurate detection of factual errors. The model initially utilizes a pre-trained language model like BERT or ELECTRA to encode the input text and generated text. It then extracts the dependency arc representation and predicts whether it is consistent with the original text through binary classification (entailed/non-entailed). Finally, the factuality score of the generated text

h

is calculated by averaging the entailment scores of all dependency arcs, with the formula as follows:

F (h, x) = \frac{1}{| d (h) |} \sum_{a \in d (h)} F_{a} (a, x)

(23)

where

d (h)

represents the set of dependency arcs in the generated text, and

F_{a} (a, x)

denotes the score indicating the relationship between dependency arc

a

in the original text

x

(1 indicates implication, 0 indicates non-implication).

4.3. Discussion Analysis

When evaluating models that combine LLMs with GNNs, the choice of test environment and test data is crucial to ensure the reliability and fairness of the results. In this section, the test environments of existing models and the selection of test data are summarized and described, and their comparison is shown in Table 4. As different test datasets are used in various studies covering multiple graph mining tasks, such as node classification, link prediction, and graph classification, in order to facilitate the comparison, we try to select a unified dataset with downstream task types as the classification. Due to the variety of graph-level tasks, it is inconvenient to perform comparative analysis, and only the data of node classification and recommendation tasks are selected for presentation here. In addition, to ensure the reproducibility of the results, we only selected the open-source data. The table lists the hardware configurations (e.g., GPU model and number) used for each model, the specific test datasets, as well as the best-performing large language model and graph neural network model in the test. The organization of this information can help readers understand each model’s computational efficiency and applicability and also provides a reliable reference for subsequent research for performance comparison and technology validation under different experimental conditions.

During the compilation process, we observed that all model evaluation indicators (such as accuracy, recall, etc.) have improved compared to the baseline, demonstrating that combining LLMs and GNNs offers notable advantages. Specifically, integrating the robust language understanding and generation capabilities of LLMs with the graph-structured data processing strengths of GNNs results in a significant performance boost. According to the experimental results, the model combining LLM and GNN significantly improved over the model using LLM or GNN alone. For example, in the downstream task of node classification using the ogbn-arxiv dataset, the evaluation indicator is accuracy. For the RevGAT model, the best-performing model in TAPE [59], TAPE-GNN-hTAPE, has a relative improvement of

6.67 %

, OFA-llama2-13b [64] has an improvement of

3.85 %

, and GLEM-GNN [100] has an improvement of

2.95 %

. We reproduced these experiments, and the results were consistent with the original paper’s conclusions, with tiny numerical differences and variances within

0.1 %

. This shows that the experimental results are highly reproducible and further verifies the robustness of the relevant methods. This cross-model synergy not only improves the understanding of complex graph data but also enhances the model’s performance in tasks such as node classification, showing great potential for joint applications. However, this synergy also introduces challenges, including increased computational cost and scalability concerns. Future research should carefully address issues like computational cost and model interpretability to fully realize this synergy.

However, while we observe performance improvements across models for specific tasks, the overall improvement remains modest. For example, the recently proposed TAPE [59] model only improves by

1.6 %

over the highest-performing baseline. This phenomenon may stem from the fact that the current method of combining LLMs and GNNs is too simple, which is more of a “stitching” rather than a true deep fusion. This simple combination fails to fully utilize the respective strengths of the two models, nor does it achieve true complementarity between them in terms of feature extraction and representation learning. As a result, despite some degree of performance improvement, the expected significant optimization is not achieved. This suggests that future research must explore more complex and efficient integration strategies to achieve deep synergy between LLMs and GNNs, thereby driving further improvement in model performance.

5. Future Direction

Based on the above analysis, it is clear that there are still many directions in this research area that have yet to be fully explored and deeply understood. Therefore, this section will further analyze these issues, focusing on the drawbacks and potential research opportunities in the current study, with a view to providing new ideas and insights for future academic exploration.

5.1. Multimodal Graph Data Processing

In graph data, nodes may be enriched with information in multiple modalities, such as text, images, and videos. These modalities may contain rich information, so understanding these multimodal data can help improve graph learning. A number of recent studies have explored the ability of LLMs to process and integrate multimodal data, and these studies have shown that LLMs exhibit significant capabilities in this area [125,126], which makes it possible to apply LLMs to multimodal graph data. Future research will focus on exploring how to design a unified model to jointly encode data in different modalities such as graphs, text, and images. This will be applied to areas such as social network analysis, product recommendation, and molecular modeling to enhance the performance of models in complex, multimodal scenarios.

5.2. Addressing the Hallucination Problem in Large Language Models

While LLMs have shown a fantastic ability to generate text, they are prone to hallucinations and misinformation due to the fact that they tend to generate answers in a single pass and lack the ability to adjust dynamically. Hallucination means that the information generated by the model seems reasonable but is actually inaccurate, deviating from the user input, context, or even from the facts [127]. In specific sophisticated fields, such misinformation is unacceptable. Therefore, future research directions should focus on solving the hallucination problem and reducing the generation of misinformation. On the one hand, the researchers proposed the use of the Retrieval-Augmented Generation (RAG) method [128], which extracts relevant information from an external knowledge base before generating an answer and then combines it with the generation capability of a large language model to reduce hallucinations and improve the accuracy of the answer significantly. On the other hand, taking advantage of graph data also provides a new approach to solving this problem. For example, it can be performed by combining external knowledge graphs so that the big language model can reason step by step in generating answers and refer to reliable structured data sources to verify the accuracy of the information. Furthermore, using multi-hop reasoning and dynamic knowledge retrieval mechanisms enables the model to continuously adjust and correct its output according to the context, thus providing more accurate and trustworthy answers. By employing these strategies, the model will be more stable and reliable, especially in application scenarios that require high accuracy.

5.3. Enhancing the Capability to Solve Complex Graph Tasks

Currently, LLMs are primarily applied to basic graph tasks such as node classification and link prediction, but the remarkable capabilities that LLMs demonstrate in various areas suggest that their potential in graph data extends beyond these tasks. As a result, more and more research has explored their application to more complex graph tasks, such as graph generation [129], question answering over knowledge graph [130], and knowledge graph construction [131]. LLMs can be used to generate novel molecular structures, analyze complex relationship patterns in social networks, or assist in constructing more contextually connected knowledge graphs. The solution to these complex tasks will drive the further development of LLMs in a variety of fields, including biomedicine, social network analysis, and natural language processing.

6. Conclusions

In recent years, significant progress has been made in the application of LLMs in the field of graph mining. This study aims to provide an overview, summarize the research in this area, and provide potential directions for future research. We propose a new taxonomy based on different driving modes: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving. Each mode exhibits unique advantages and application potentials, especially when dealing with complex graph structures and textual information. The combination of LLMs and GNNs has brought new opportunities to graph mining. The semantic understanding capability of LLMs complements the structural information processing capability of GNNs, significantly improving the effectiveness of graph mining tasks. Despite the many opportunities presented by the combination of GNNs and LLMs, their high computational demands and model complexity remain challenges. Future research should explore optimizing the integration model of GNNs and LLMs to achieve more powerful graph learning and reasoning capabilities while ensuring computational efficiency, thus advancing the field of graph mining.

Author Contributions

Conceptualization, Z.L. and Y.Y.; formal analysis, Y.Y.; writing—original draft preparation, Y.Y. and Z.L.; writing—review and editing, Z.L., Y.Z. and Y.Y.; visualization, Y.Y.; investigation, X.W. and Y.Z.; supervision, Z.L. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

Sichuan Provincial Natural Science Foundation (No. 2024NSFSC0517).

Institutional Review Board Statement

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GdL	GNN-driving-LLM
LdG	LLM-driving-GNN
LGcd	GNN-LLM-co-driving
GNN	Graph Neural Network
GCN	Graph Convolutional Network
GAT	Graph Attention Network
HAN	Heterogeneous Graph Attention Network
HetGNN	Heterogeneous Graph Neural Network
LLM	Large Language Model
NLP	Natural Language Processing
GFMs	Graph Foundation Models
GCN	Graph Convolutional Networks
TAGs	Text-Attributed Graphs
LM	Language Model
GDL	Graph Description Language
CoT	Chain-of-Thought
KG	Knowledge Graph
NMLM	Network Contextualized Masked Language Modeling
MNP	Masked Node Prediction
EM	Expectation Maximization
TAHGs	Text-Attributed Heterogeneous Graphs

References

Nguyen, V.-H.; Sugiyama, K.; Nakov, P.; Kan, M.-Y. Fang: Leveraging social context for fake news detection using graph representation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. arXiv 2016, arXiv:1607.00653. [Google Scholar]
Narayanan, A.; Chandramohan, M.; Venkatesan, R.; Chen, L.; Liu, Y.; Jaiswal, S. graph2vec: Learning distributed representations of graphs. arXiv 2017, arXiv:1707.05005. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2017, arXiv:1609.02907. [Google Scholar]
Veli?kovi?, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. arXiv 2018, arXiv:1706.02216. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Cui, P.; Yu, P.; Ye, Y. Heterogeneous graph attention network. arXiv 2021, arXiv:1903.07293. [Google Scholar]
Zhang, C.; Song, D.; Huang, C.; Swami, A.; Chawla, N.V. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 793–803. [Google Scholar] [CrossRef]
Manning, C.D. Human Language Understanding & Reasoning. Daedalus 2022, 151, 127–138. [Google Scholar] [CrossRef]
Petroni, F.; Rockt?schel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A.H.; Riedel, S. Language models as knowledge bases? arXiv 2019, arXiv:1909.01066. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. arXiv 2021, arXiv:2102.12092. [Google Scholar]
Zhang, Z.; Li, H.; Zhang, Z.; Qin, Y.; Wang, X.; Zhu, W. Graph meets llms: Towards large graph models. arXiv 2023, arXiv:2308.14522. [Google Scholar]
Chen, Z.; Mao, H.; Li, H.; Jin, W.; Wen, H.; Wei, X.; Wang, S.; Yin, D.; Fan, W.; Liu, H.; et al. Exploring the potential of large language models (llms) in learning on graphs. arXiv 2024, arXiv:2307.03393. [Google Scholar] [CrossRef]
Liu, J.; Yang, C.; Lu, Z.; Chen, J.; Li, Y.; Zhang, M.; Bai, T.; Fang, Y.; Sun, L.; Yu, P.S.; et al. Towards graph foundation models: A survey and beyond. arXiv 2024, arXiv:2310.11829. [Google Scholar]
Yang, L.; Zheng, J.; Wang, H.; Liu, Z.; Huang, Z.; Hong, S.; Zhang, W.; Cui, B. Individual and structural graph information bottlenecks for out-of-distribution generalization. arXiv 2023, arXiv:2306.15902. [Google Scholar] [CrossRef]
Korshunov, A.; Beloborodov, I.; Buzun, N.; Avanesov, V.; Kuznetsov, S. Social network analysis: Methods and applications. Proc. Inst. Syst. Program. RAS 2014, 26, 439–456. [Google Scholar] [CrossRef]
Isinkaye, F.O.; Folajimi, Y.O.; Ojokoh, B. Recommendation systems: Principles, methods and evaluation. Egypt. Inform. J. 2015, 16, 261–273. [Google Scholar] [CrossRef]
Yan, X.; Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 721–724. [Google Scholar]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef]
Ding, Y.; Yan, S.; Zhang, Y.; Dai, W.; Dong, L. Predicting the attributes of social network users using a graph-based machine learning method. Comput. Commun. 2016, 73, 3–11. [Google Scholar] [CrossRef]
You, R.; Yao, S.; Mamitsuka, H.; Zhu, S. Deepgraphgo: Graph neural network for large-scale, multispecies protein function prediction. Bioinformatics 2021, 37, i262–i271. [Google Scholar] [CrossRef] [PubMed]
Liben-Nowell, D.; Kleinberg, J. The link prediction problem for social networks. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, New Orleans, LA, USA, 3–8 November 2003; pp. 556–559. [Google Scholar]
Backstrom, L.; Leskovec, J. Supervised random walks: Predicting and recommending links in social networks. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China, 9–12 February 2011; pp. 635–644. [Google Scholar]
Zhang, M.; Chen, Y. Link prediction based on graph neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 5171–5181. [Google Scholar]
Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
Conte, D.; Foggia, P.; Sansone, C.; Vento, M. Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 2004, 18, 265–298. [Google Scholar] [CrossRef]
Guo, L.; Dai, Q. Graph clustering via variational graph embedding. Pattern Recognit. 2022, 122, 108334. [Google Scholar] [CrossRef]
Wu, H.; Liang, B.; Chen, Z.; Zhang, H. Multisimnenc: A network representation learning-based module identification method by network embedding and clustering. Comput. Biol. Med. 2023, 156, 106703. [Google Scholar] [CrossRef]
Tian, Y.; Mceachin, R.C.; Santos, C.; States, D.J.; Patel, J.M. Saga: A subgraph matching tool for biological graphs. Bioinformatics 2007, 23, 232–239. [Google Scholar] [CrossRef]
Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
Girvan, M.; Newman, M.E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Bedi, P.; Sharma, C. Community detection in social networks. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2016, 6, 115–135. [Google Scholar]
Inokuchi, A.; Washio, T.; Motoda, H. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery: 4th European Conference, Lyon, France, 13–16 September 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 13–23. [Google Scholar]
Borgelt, C.; Berthold, M.R. Mining molecular fragments: Finding relevant substructures of molecules. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 51–58. [Google Scholar]
Bronstein, M.M.; Bruna, J.; Cohen, T.; Veli?kovi?, P. Geometric deep learning: Grids groups graphs geodesics and gauges. arXiv 2021, arXiv:2104.13478. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. 2018. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; et al. Gpt-4 technical report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv 2020, arXiv:1906.08237. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 165–172. [Google Scholar]
Granovetter, M.S. The strength of weak ties. Am. J. Sociol. 1973, 78, 1360–1380. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7370–7377. [Google Scholar]
Price, D.J.D.S. Networks of scientific papers: The pattern of bibliographic references indicates the nature of the scientific research front. Science 1965, 149, 510–515. [Google Scholar] [PubMed]
Yang, Z.; Cohen, W.W.; Salakhutdinov, R. Revisiting semi-supervised learning with graph embeddings. arXiv 2016, arXiv:1603.08861. [Google Scholar]
Li, C.; Pang, B.; Liu, Y.; Sun, H.; Liu, Z.; Xie, X.; Yang, T.; Cui, Y.; Zhang, L.; Zhang, Q. Adsgnn: Behavior-graph augmented relevance modeling in sponsored search. arXiv 2021, arXiv:2104.12080. [Google Scholar]
Zhu, J.; Cui, Y.; Liu, Y.; Sun, H.; Li, X.; Pelger, M.; Yang, T.; Zhang, L.; Zhang, R.; Zhao, H. Textgnn: Improving text encoder via graph neural network in sponsored search. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021. [Google Scholar] [CrossRef]
He, X.; Bresson, X.; Laurent, T.; Perold, A.; LeCun, Y.; Hooi, B. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. arXiv 2024, arXiv:2305.19523. [Google Scholar]
Wei, W.; Ren, X.; Tang, J.; Wang, Q.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Llmrec: Large language models with graph augmentation for recommendation. arXiv 2024, arXiv:2311.00423. [Google Scholar]
Ren, X.; Wei, W.; Xia, L.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Representation learning with large language models for recommendation. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar] [CrossRef]
Huang, Q.; Ren, H.; Chen, P.; Kr?manc, G.; Zeng, D.; Liang, P.; Leskovec, J. Prodigy: Enabling in-context learning over graphs. arXiv 2023, arXiv:2305.12600. [Google Scholar]
Sun, X.; Cheng, H.; Li, J.; Liu, B.; Guan, J. All in one: Multi-task prompting for graph neural networks. arXiv 2023, arXiv:2307.01504. [Google Scholar]
Liu, H.; Feng, J.; Kong, L.; Liang, N.; Tao, D.; Chen, Y.; Zhang, M. One for all: Towards training one graph model for all classification tasks. arXiv 2024, arXiv:2310.00149. [Google Scholar]
Xie, H.; Zheng, D.; Ma, J.; Zhang, H.; Ioannidis, V.N.; Song, X.; Ping, Q.; Wang, S.; Yang, C.; Xu, Y.; et al. Graph-aware language model pre-training on a large graph corpus can help multiple graph applications. arXiv 2023, arXiv:2306.02592. [Google Scholar]
Xue, R.; Shen, X.; Yu, R.; Liu, X. Efficient large language models fine-tuning on graphs. arXiv 2023, arXiv:2312.04737. [Google Scholar]
Guo, Z.; Xia, L.; Yu, Y.; Wang, Y.; Yang, Z.; Wei, W.; Pang, L.; Chua, T.-S.; Huang, C. Graphedit: Large language models for graph structure learning. arXiv 2024, arXiv:2402.15183. [Google Scholar]
Chen, Z.; Mao, H.; Wen, H.; Han, H.; Jin, W.; Zhang, H.; Liu, H.; Tang, J. Label-free node classification on graphs with large language models (llms). arXiv 2024, arXiv:2310.04668. [Google Scholar]
Yu, J.; Ren, Y.; Gong, C.; Tan, J.; Li, X.; Zhang, X. Leveraging large language models for node generation in few-shot learning on text-attributed graphs. arXiv 2024, arXiv:2310.09872. [Google Scholar]
Xia, L.; Kao, B.; Huang, C. Opengraph: Towards open graph foundation models. arXiv 2024, arXiv:2403.01121. [Google Scholar]
Liu, J.; Liu, C.; Zhou, P.; Lv, R.; Zhou, K.; Zhang, Y. Is chatgpt a good recommender? A preliminary study. arXiv 2023, arXiv:2304.10149. [Google Scholar]
Creswell, A.; Shanahan, M.; Higgins, I. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv 2022, arXiv:2205.09712. [Google Scholar]
Ji, Y.; Gong, Y.; Peng, Y.; Ni, C.; Sun, P.; Pan, D.; Ma, B.; Li, X. Exploring chatgpt’s ability to rank content: A preliminary study on consistency with human preferences. arXiv 2023, arXiv:2303.07610. [Google Scholar]
Wang, H.; Feng, S.; He, T.; Tan, Z.; Han, X.; Tsvetkov, Y. Can language models solve graph problems in natural language? arXiv 2024, arXiv:2305.10037. [Google Scholar]
Huang, J.; Zhang, X.; Mei, Q.; Ma, J. Can llms effectively leverage graph structural information through prompts, and why? arXiv 2024, arXiv:2309.16595. [Google Scholar]
Hu, Y.; Zhang, Z.; Zhao, L. Beyond text: A deep dive into large language models’ ability on understanding graph data. arXiv 2023, arXiv:2310.04944. [Google Scholar]
Fatemi, B.; Halcrow, J.; Perozzi, B. Talk like a graph: Encoding graphs for large language models. arXiv 2023, arXiv:2310.04560. [Google Scholar]
Guo, J.; Du, L.; Liu, H.; Zhou, M.; He, X.; Han, S. Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking. arXiv 2023, arXiv:2305.15066. [Google Scholar]
Brandes, U.; Eiglsperger, M.; Lerner, J.; Pich, C. Graph markup language (GraphML). 2013. [Google Scholar]
Zhao, J.; Zhuo, L.; Shen, Y.; Qu, M.; Liu, K.; Bronstein, M.; Zhu, Z.; Tang, J. Graphtext: Graph reasoning in text space. arXiv 2023, arXiv:2310.01089. [Google Scholar]
Liu, C.; Wu, B. Evaluating large language models on graphs: Performance insights and comparative analysis. arXiv 2023, arXiv:2308.11224. [Google Scholar]
Tang, J.; Yang, Y.; Wei, W.; Shi, L.; Su, L.; Cheng, S.; Yin, D.; Huang, C. Graphgpt: Graph instruction tuning for large language models. arXiv 2024, arXiv:2310.13023. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Cao, H.; Liu, Z.; Lu, X.; Yao, Y.; Li, Y. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv 2023, arXiv:2311.16208. [Google Scholar]
Qin, Y.; Wang, X.; Zhang, Z.; Zhu, W. Disentangled representation learning with large language models for text-attributed graphs. arXiv 2024, arXiv:2310.18152. [Google Scholar]
Zhang, M.; Sun, M.; Wang, P.; Fan, S.; Mo, Y.; Xu, X.; Liu, H.; Yang, C.; Shi, C. Graphtranslator: Aligning graph model to large language model for open-ended tasks. arXiv 2024, arXiv:2402.07197. [Google Scholar]
Tang, J.; Yang, Y.; Wei, W.; Shi, L.; Xia, L.; Yin, D.; Huang, C. Higpt: Heterogeneous graph language model. arXiv 2024, arXiv:2402.16024. [Google Scholar]
Zhao, H.; Liu, S.; Ma, C.; Xu, H.; Fu, J.; Deng, Z.-H.; Kong, L.; Liu, Q. Gimlet: A unified graph-text model for instruction-based molecule zero-shot learning. arXiv 2023, arXiv:2306.13089. [Google Scholar]
Tan, Y.; Lv, H.; Huang, X.; Zhang, J.; Wang, S.; Yang, C. Musegraph: Graph-oriented instruction tuning of large language models for generic graph mining. arXiv 2024, arXiv:2403.04780. [Google Scholar]
Ye, R.; Zhang, C.; Wang, R.; Xu, S.; Zhang, Y. Language is all a graph needs. arXiv 2024, arXiv:2308.07134. [Google Scholar]
Chai, Z.; Zhang, T.; Wu, L.; Han, K.; Hu, X.; Huang, X.; Yang, Y. Graphllm: Boosting graph reasoning ability of large language model. arXiv 2023, arXiv:2310.05845. [Google Scholar]
Zhang, J. Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt. arXiv 2023, arXiv:2304.11116. [Google Scholar]
Wang, B.; Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Available online: https://github.com/kingoflolz/mesh-transformer-jax (accessed on 5 October 2024).
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Wang, H.; Gao, Y.; Zheng, X.; Zhang, P.; Chen, H.; Bu, J.; Yu, P.S. Graph neural architecture search with gpt-4. arXiv 2024, arXiv:2310.01436. [Google Scholar]
Luo, L.; Ju, J.; Xiong, B.; Li, Y.-F.; Haffari, G.; Pan, S. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv 2024, arXiv:2309.01538. [Google Scholar]
Tian, Y.; Song, H.; Wang, Z.; Wang, H.; Hu, Z.; Wang, F.; Chawla, N.V.; Xu, P. Graph neural prompting with large language models. arXiv 2023, arXiv:2309.15427. [Google Scholar]
Yang, J.; Liu, Z.; Xiao, S.; Li, C.; Lian, D.; Agrawal, S.; Singh, A.; Sun, G.; Xie, X. Graphformers: Gnn-nested transformers for representation learning on textual graph. arXiv 2023, arXiv:2105.02605. [Google Scholar]
Jin, B.; Zhang, W.; Zhang, Y.; Meng, Y.; Zhang, X.; Zhu, Q.; Han, J. Patton: Language model pretraining on text-rich networks. arXiv 2023, arXiv:2305.12268. [Google Scholar]
Zhao, J.; Qu, M.; Li, C.; Yan, H.; Liu, Q.; Li, R.; Xie, X.; Tang, J. Learning on large-scale text-attributed graphs via variational inference. arXiv 2023, arXiv:2210.14709. [Google Scholar]
Zhang, X.; Bosselut, A.; Yasunaga, M.; Ren, H.; Liang, P.; Manning, C.D.; Leskovec, J. Greaselm: Graph reasoning enhanced language models for question answering. arXiv 2022, arXiv:2201.08860. [Google Scholar]
Edwards, C.; Zhai, C.; Ji, H. Text2Mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, 7–11 November 2021; pp. 595–607. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. Scibert: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Liu, S.; Nie, W.; Wang, C.; Lu, J.; Qiao, Z.; Liu, L.; Tang, J.; Xiao, C.; Anandkumar, A. Multi-modal molecule structure-text model for text-based retrieval and editing. arXiv 2024, arXiv:2212.10789. [Google Scholar]
Seidl, P.; Vall, A.; Hochreiter, S.; Klambauer, G. Enhancing activity prediction models in drug discovery with the ability to understand human language. arXiv 2023, arXiv:2303.03363. [Google Scholar]
Brannon, W.; Kang, W.; Fulay, S.; Jiang, H.; Roy, B.; Roy, D.; Kabbara, J. Congrat: Self-supervised contrastive pretraining for joint graph and text embeddings. arXiv 2024, arXiv:2305.14321. [Google Scholar]
Wen, Z.; Fang, Y. Prompt tuning on graph-augmented low-resource text classification. arXiv 2024, arXiv:2307.10230. [Google Scholar]
Li, Y.; Ding, K.; Lee, K. Grenade: Graph-centric language model for self-supervised representation learning on text-attributed graphs. arXiv 2023, arXiv:2310.15109. [Google Scholar]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph contrastive learning with augmentations. arXiv 2021, arXiv:2010.13902. [Google Scholar]
Mavromatis, C.; Ioannidis, V.N.; Wang, S.; Zheng, D.; Adeshina, S.; Ma, J.; Zhao, H.; Faloutsos, C.; Karypis, G. Train your own gnn teacher: Graph-aware distillation on textual graphs. arXiv 2023, arXiv:2304.10668. [Google Scholar]
Zou, T.; Yu, L.; Huang, Y.; Sun, L.; Du, B. Pretraining language models with text-attributed heterogeneous graphs. arXiv 2023, arXiv:2310.12580. [Google Scholar]
Yu, L.; Sun, L.; Du, B.; Liu, C.; Lv, W.; Xiong, H. Heterogeneous graph representation learning with relation awareness. IEEE Trans. Knowl. Data Eng. 2022, 35, 5935–5947. [Google Scholar] [CrossRef]
Huang, X.; Han, K.; Yang, Y.; Bao, D.; Tao, Q.; Chai, Z.; Zhu, Q. Can gnn be good adapter for llms? arXiv 2024, arXiv:2402.12984. [Google Scholar]
Zhu, Y.; Wang, Y.; Shi, H.; Tang, S. Efficient tuning and inference for large language models on textual graphs. arXiv 2024, arXiv:2401.15569. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Levenshtein, V.I. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1965, 1, 8–17. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Shen, C.; Cheng, L.; Nguyen, X.-P.; You, Y.; Bing, L. Large language models are not yet human-level evaluators for abstractive summarization. arXiv 2023, arXiv:2305.13091. [Google Scholar]
Li, W.; Xiao, X.; Liu, J.; Wu, H.; Wang, H.; Du, J. Leveraging graph to improve abstractive multi-document summarization. arXiv 2020, arXiv:2005.10043. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar]
Wang, P.; Li, L.; Chen, L.; Cai, Z.; Zhu, D.; Lin, B.; Cao, Y.; Liu, Q.; Liu, T.; Sui, Z. Large language models are not fair evaluators. arXiv 2023, arXiv:2305.17926. [Google Scholar]
Vasilyev, O.; Dharnidharka, V.; Bohannon, J. Fill in the blanc: Human-free quality estimation of document summaries. arXiv 2020, arXiv:2002.09836. [Google Scholar]
Laban, P.; Schnabel, T.; Bennett, P.N.; Hearst, M.A. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 2022, 10, 163–177. [Google Scholar]
Goyal, T.; Durrett, G. Evaluating factuality in generation with dependency-level entailment. arXiv 2020, arXiv:2010.05478. [Google Scholar]
Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.-S. Next-gpt: Any-to-any multimodal llm. arXiv 2024, arXiv:2309.05519. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv 2023, arXiv:2309.01219. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rockt?schel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
Yao, Y.; Wang, X.; Zhang, Z.; Qin, Y.; Zhang, Z.; Chu, X.; Yang, Y.; Zhu, W.; Mei, H. Exploring the potential of large language models in graph generation. arXiv 2024, arXiv:2403.14358. [Google Scholar]
Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, 11–15 February 2019; pp. 105–113. [Google Scholar]
Peng, C.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge graphs: Opportunities and challenges. arXiv 2023, arXiv:2303.13948. [Google Scholar]

Figure 1. Based on the relationship between LLMs and GNNs, we categorize the application scenarios of LLMs in graph mining into three groups: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving. These models are capable of handling a variety of datasets and achieving success in various downstream tasks.

Figure 2. The proposed taxonomy on LLM-GNN-combined techniques.

Figure 3. GNN-driving-LLM: (a) LLMs generate additional information for the textual attributes of the nodes. These features extracted by LLMs are fed into the next layer of the language model to generate enhanced node embeddings, which are finally processed by GNNs for downstream tasks; (b) the textual embeddings exported by LLMs can be directly used as the initial node embeddings for GNNs.

Figure 4. LLM-driving-GNN: Models use spreading functions or GNNs to transform graph data into sequence text so that LLMs can directly understand the graph data and accomplish downstream tasks.

Figure 5. (a) Text encoding and graph aggregation are iteratively executed through hierarchical graph neural networks (GNNs) and transformers (TRMs). (b) Data from these two different modalities are projected into an aligned semantic embedding space, where attention mechanisms are used to learn the correlation rules between molecular substructures and textual keywords.

Figure 6. The citation network uses directed edges to represent the citation relationships between articles. Green, blue, and red nodes represent studies categorized as “GdL”, “LdG”, and “GLcd”, respectively. The size of the nodes reflects the number of citations of each paper.

Table 1. The following table summarizes the significance of embeddings for both LLMs and GNNs in the reviewed works. It highlights their impact on model performance and the interaction mechanisms used to align textual and structural representations.

Method Category	LLM Embedding Role	GNN Embedding Role	Performance Impact	Interoperability Mechanism
GNN-driving-LLM	Generates semantically enhanced text embeddings, such as node descriptions and knowledge entities, to supplement the semantic information missing from the original features.	Aggregates structural information through message passing and preserves topological characteristics.	There are varying degrees of improvement in different downstream tasks. For instance, the accuracy of node classification tasks has increased by up to $32.1 %$ [66], while the ROC-AUC of link prediction has seen an increase of up to $4.9 %$ [65].	Achieving dimensional compatibility between text embeddings and structural embeddings by jointly aligning the embedding space, such as cross-modal pooling.
LLM-driving-GNN	Directly outputs task prediction results (such as node labels, relationship reasoning), avoiding the limitations of traditional GNNs classifiers.	Only acts as a structural feature extractor and does not participate in the final decision.	Leads to a more comprehensive understanding and enhanced performance. For example, in the graph-to-text task, the performance gains of MuseGraph [89] over GNN-based baselines range from 1.02% (achieved in METEOR on WebNLG) to 57.89% (achieved in BLEU-4 on AGENDA).	Encode the graph structure into a serialized input that can be understood by LLMs through prompt engineering, such as Graph Neural Prompting [97].
GNN-LLM-co-driving	Enhances semantic understanding and context information processing in the graph structure learning process by providing interpretable semantic representations and deep semantic features for graph data.	Generates constrained topological embeddings to provide rich structural information through deep modeling. Aggregates graph structures to help LLMs generate semantically deep feature representations in complex graph structures.	Encoding and alignment become much more efficient than ever. For example, G2P2 achieves 2.1∼18.8x speedups against baselines [107].	Design domain projectors to unify embedding distribution or interact with GNNs and LLMs. In the latter approach, nodes exchange information at each layer to facilitate deep integration of structured data and text data.

Table 2. A summary of LLMs used in the field of graph mining, highlighting the model architecture, which includes the LLM model, the GNN model, whether the parameters of the LLM are fine-tuned, the predictor of the architecture, the types of datasets, and the downstream task types. In the “Task” column, “Node” denotes node-level tasks, “Link” denotes edge-level tasks, and “Graph” denotes graph-level tasks.

	Model	Architecture				Graph Data
	Model	LLMs	GNNs	Finetune	Predictor	Dataset Type	Task
GNN-driving-LLM	LLMRec [60]	ChatGPT	LightGCN	×	GNN	General	Recommendation
	RLMRec [61]	ChatGPT, text-embedding-ada-002	LightGCN	✓	GNN	General	Recommendation
	TAPE [59]	ChatGPT, Llama2	RevGAT	×	GNN	TAGs	Node
	PRODIGY [62]	RoBERTa, MPNet	GCN, GAT	✓	GNN	TAGs	Node
	ALL-in-one [63]	GPT-3.5, DeBERTa	RevGAT	✓	GNN/LLM	TAGs	Node
	OFA [64]	llama2, e5-large-v2	R-GCN	×	GNN	TAGs	Node, Link, Graph
	GaLM [65]	BERT	RGCN, RGAT	✓	GNN	heterogeneous	Node, Link, edge classification
	LEADING [66]	BERT, DeBERTa	GCN, GAT	✓	GNN	TAGs	Node
	AdsGNN [57]	BERT	GAT	✓	GNN	Heterogeneous	Relevance Prediction, AD recommendation
	TextGNN [58]	BERT	GAT	✓	GNN	Heterogeneous	Relevance Prediction, AD recommendation
	GraphEdit [67]	Vicuna-v1.5	GCN	✓	Edge Predictor	TAGs	Node, Graph, edge classification
	LLM4NG [69]	ChatGPT	GCN, GAT	×	GNN	TAGs	Node
	LLM-GNN [68]	gpt-3.5-turbo-0613	GCN	×	GNN	TAGs	Node
	OpenGraph [70]	GPT-4	Scalable Graph Transformer	×	Graph Transformer	TAGs, Heterogeneous	Node, Link
LLM-driving-GNN	Fatemi et al. [77]	PaLM/PaLM 2	-	×	LLM	Synthetic	Graph
	GPT4Graph [78]	GPT-3	-	×	LLM	Synthetic, KG	Structural and Semantic Understanding Tasks
	GraphText [80]	GPT-4	-	×	LLM	General, TAGs	Node
	Liu et al. [81]	GPT, CalderaAI/30B-Lazarus, et al.	-	×	LLM	Synthetic	Graph Reasoning
	GraphGPT [82]	Vicuna	Graph Transformer	✓	LLM	TAGs	Node, Link, Graph Match
	MoleculeSTM [84]	Vicuna-7B	GIN	✓	LLM	TAGs	Graph
	DGTL [85]	LLama-2	GCN, RGAT	✓	LLM	TAGs	Node
	graphtranslator [86]	ChatGLM2-6B	GraphSAGE	×	LLM	TAGs	Node, KG Question Answering
	HiGPT [87]	GPT-3.5	HetGNN, HAN, HGT	×	LLM	Heterogeneous	Node, Graph, Relation Prediction
	GIMLET [88]	T5	-	✓	LLM	Synthetic	Molecular Property Prediction
	MuseGraph [89]	GPT-4	-	✓	LLM	General	Node, Link
	InstructGLM [90]	T5, Llama-7b	-	✓	LLM	TAGs	Node, Link
	GraphLLM [91]	LLaMA 2	Graph Transformer	✓	LLM	KG	Graph Reasoning
	Graph-ToolFormer [92]	GPT-J	Graph-Bert, SEG-Bert	✓	LLM	General, Synthetic, TAGs	Graph Reasoning
	GPT4GNAS [95]	GPT-4	GCN, GAT et al.	×	LLM	General	Graph Neural Architecture Search
	ChatRule [96]	ChatGPT	-	×	LLM	KG	KG Reasoning
	GNP [97]	FLAN-T5	GAT	✓	LLM	KG	Commonsense and Biomedical Reasoning
GNN-LLM-co-driving	GraphFormers [98]	UniLM-base	GNN Components	✓	GNN, LLM	TAGs	Link
	PATTON [99]	BERT, SciBERT	GraphFormers	✓	LLM	Heterogeneous	Node, Retrieval, Re-ranking for Link Prediction
	GLEM [100]	DeBERTa	RevGAT	✓	GNN, LLM	TAGs	Node
	GREASELM [101]	RoBERTa-Large, AristoRoBERTa, SapBERT	GAT	✓	GNN, LLM	KG	Multiple Choice Question Answering
	Text2Mol [102]	SciBERT	GCN	✓	GNN, LLM	General	Cross-modal Information Retrieval
	MoleculeSTM [104]	SciBERT	GIN	✓	GNN, LLM	TAGs	Retrieval, Molecular Editing
	CLAMP [105]	BioBERT, Sentence-T5, KV-PLM, etc.	GCN, GIN	✓	GNN, LLM	TAGs	Bioactivity Prediction
	ConGraT [106]	DistilGPT2, all-mpnet-base-v2	GAT	✓	GNN, LLM	General	Representation Learning
	G2P2 [107]	BERT	GCN	✓	GNN, LLM	General	Text Classification
	GRENADE [108]	BERT	SAGE, RevGAT-KD, etc.	✓	GNN, PLM	TAGs	Node, Link
	GraD [110]	BERT, SciBERT	GraphSAGE	✓	LLM	TAGs, Heterogeneous	Node
	THLM [111]	BERT	R-HGNN	✓	LLM	TAHGs	Node, Link
	GraphAdapter [113]	Llama 2, RoBERTa, GPT-2	GraphSAGE	×	GNN, LLM	TAGs	Node
	ENGINE [114]	LLaMA2-7B, e5-large	GCN, SAGE, GAT	×	GNN, LLM	TAGs	Node

Table 3. Summary of mathematical information on selected models.

	Models	Input	Output	Parameter	Loss Function
GdL	TAPE [59]	$G = (V, A, {s_{n}}_{n \in V})$	Lables y	LM: $129 M$ ; GNN: $1.8 M$	$L = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})$
	LLMRec [60]	$E^{+} \subseteq U \times I$ ; Node Information F	$y_{u, i} = e_{u} \cdot e_{i}$	$64 \times (\| U \| + \| I \|)$	$L_{BPR} = \sum_{(u, i^{+}, i^{-}) \in (E^{+} \cup E_{A})} - log (σ (y_{u, i^{+}} - y_{u, i^{-}})) + ω_{2} {∥ Θ ∥}^{2}$
	RLMRec [61]	User Collection U; Text Information $P_{u}$ ; Item Collection V; Text Information $P_{v}$	$y_{u, v} = e_{u} \cdot e_{v}$	$32 \times (\| U \| + \| I \|)$	$L = - \sum_{(u, i^{+}, i^{-}) \in T} ln σ (e_{u} \cdot e_{i^{+}} - e_{u} \cdot e_{i^{-}}) + λ {∥ Θ ∥}^{2}$ $- E_{(s_{i}, e_{i})} ln \frac{exp (sim (σ_{↓} (s_{i}), e_{i}))}{\sum_{s_{j} \in S} exp (sim (σ_{↓} (s_{j}), e_{i}))}$
	ALL-in-one [63]	$G = (V, E)$ , $G_{p} = (P, S)$	$\hat{Y} = f_{π^{*}} (ψ (G, G_{p}))$	Prompt Graph: $\| P \| \times d$	$L_{task} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log {\hat{y}}_{i, c}$ $L_{P r o m p t} = \sum_{τ \in T} L_{D_{q}^{τ}} (f_{θ_{τ}, ϕ_{τ} \| π^{*}})$
	TextGNN [58]	Query Text Q; Keyword Text K; Click Graph $h_{Q}^{}, h_{k}^{}$	$\hat{y} = sigmoid (F ([h_{Q}^{}; h_{K}^{}]))$	Tens of Millions	$L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{s t u} - y_{i}^{t e a c h e r})}^{2}$
LdG	Fatemi et al. [77]	G = (V, E), Query Q	$A = f (g (G), q (Q))$	Thousands	$L = - E_{(G, Q, S) \in D} [log {score}_{f} (g (G), q (Q), S)]$
	GPT4Graph [78]	$G = (V, E)$ , Query Text Q	Answer A	-	-
	GraphGPT [82]	$G (V, E, A, X)$	Text Sequence $X_{O} = [x_{1}, x_{2}, \dots, x_{L}]$	1.3B	$L_{align} = \sum_{i = 1}^{3} \frac{1}{2} λ_{i} (CE (Γ_{i}, y) + CE (Γ_{i}^{⊤}, y))$ $L_{gen} = - log p (X_{O} ∣ X_{G}, X_{I})$
	InstructGLM [90]	Task Instruction P; Graph Structure Description I; Query Q	Lables y	Flan-T5-base: $220 M$ Flan-T5-large: $770 M$ Llama-7b: $7 B$	$L_{θ} = - \sum_{j = 1}^{\| y \|} log P_{θ} (y_{j} ∣ x, y_{< j})$
GLcd	GREASELM [101]	Natural Language Question c; Question q; Candidate Answer A; Knowledge Graph $G_{sub}$	$p (a ∣ q, c)$	LM: $355 M$ ; GNN: $\sim 4 M$ ; MInt: $\sim 0.3 M$	$L = - \sum_{a \in A} y_{a} log p (a ∣ q, c)$
	GLEM [100]	$G_{S} = (V, A, s_{V})$	LM: $q_{θ} (y_{n} \| s_{n})$ ; GNN: $p_{ϕ} (y_{n} \| s_{V}, A)$	LM: $138 M, 405 M$ ; GNN: $\sim 2.2 M$	E-step: $O (q) = α \sum_{n \in U} E_{p} [log q] + (1 - α) \sum_{n \in L} log q$ ; M-step: $O (ϕ) = β \sum_{n \in U} log p + (1 - β) \sum_{n \in L} log p$
	GraphFormers [98]	Node Set x and Its Text Sequence; Adjacency Relationship $N_{x}$	Central Node Embedding $h_{x} = H_{x}^{L} [0]$	$\sim 125 M$	$L = - log \frac{exp (〈 h_{q}, h_{k} 〉)}{exp (〈 h_{q}, h_{k} 〉) + \sum_{r \in R} exp (〈 h_{q}, h_{r} 〉)}$
	Text2Mol [102]	Natural Language Description T Molecular Structure	Similarity Score $S (t, m) = cos (t, m)$	MLP: $110 M$ ; GCN: $112 M$ ; Cross-modal Attention Model: $129 M$	$L (t, m) = CCE (e^{τ} t m^{⊤}, I) + CCE (e^{τ} m t^{⊤}, I)$

Table 4. Summary of experimental setups on selected models.

	Models	LLMs	GNNs	Environment	Datasets	Ratio *	Code
Node Classification	TAPE-GNN- $h_{TAPE}$ [59]	DeBERTa-base	RevGAT	4× Nvidia RTX A5000 24 GB GPUs	ogbn-arxiv	77.50 ± 0.12	https://github.com/XiaoxinHe/TAPE, (accessed on 21 Novermber 2024)
	GLEM-GNN [100]	DeBERTa-base	RevGAT	4× Nvidia RTX A5000 24 GB GPUs	ogbn-arxiv	76.97 ± 0.19	https://github.com/AndyJZhao/GLEM, (accessed on 21 Novermber 2024)
	OFA-llama2-13b [64]	llama2-13b	R-GCN	-	ogbn-arxiv	77.51 ± 0.17	https://github.com/LechengKong/OneForAll, (accessed on 21 Novermber 2024)
	InstructGLM [90]	Llama-7b	-	4× 40G Nvidia A100 GPUs	ogbn-arxiv	75.70 ± 0.12	https://github.com/Graphlet-AI/llm-graph-ai, (accessed on 21 Novermber 2024)
	GraphGPT-7B-v1.5-stage2 [82]	Vicuna	Graph Transformer	4× 40G Nvidia A100 GPUs	ogbn-arxiv	75.11	https://github.com/HKUDS/GraphGPT, (accessed on 21 Novermber 2024)
	GRENADE [108]	BERT	RevGAT-KD	-	ogbn-arxiv	76.21 ± 0.17	https://github.com/bigheiniu/GRENADE, (accessed on 21 Novermber 2024)
	GraDBERT [110]	BERT	GraphSAGE	-	ogbn-arxiv	75.05 ± 0.11	https://github.com/cmavro/GRAD, (accessed on 21 Novermber 2024)
	GraphAdapter [113]	Llama 2	GraphSAGE	Nvidia A800 80 GB GPU	Ogbn-arxiv	77.07 ± 0.15	https://github.com/hxttkl/GraphAdapter, (accessed on 21 Novermber 2024)
	ENGINE [114]	LLaMA2-7B	GCN, SAGE, GAT	6× Nvidia RTX 3090 GPUs	Ogbn-arxiv	76.02 ± 0.29	https://github.com/ZhuYun97/ENGINE, (accessed on 21 Novermber 2024)
	PATTON [99]	BERT, SciBERT	GraphFormers	4× Nvidia A6000 GPUs	Amazon-Sports	78.60 ± 0.15	https://github.com/PeterGriffinJin/Patton, (accessed on 21 Novermber 2024)
	THLM [111]	BERT	R-HGNN	4× Nvidia RTX 3090 GPUs	GoodReads	81.57	https://github.com/Hope-Rita/THLM, (accessed on 21 Novermber 2024)
Recommendation	LLMRec [60]	gpt-3.5-turbo, text-embedding-ada-002	LightGCN	Nvidia RTX 3090 GPU	MovieLens	36.43	https://github.com/HKUDS/LLMRec, (accessed on 23 Novermber 2024)
Recommendation	RLMRec [61]	gpt-3.5-turbo, text-embedding-ada-002	LightGCN	Nvidia RTX 3090 GPU	Amazon-book	14.83	https://github.com/HKUDS/RLMRec, (accessed on 23 Novermber 2024)

* In the ratio, node classification tasks are evaluated by accuracy, while recommendation tasks are evaluated by recall.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

You, Y.; Liu, Z.; Wen, X.; Zhang, Y.; Ai, W. Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining. Mathematics 2025, 13, 1147. https://doi.org/10.3390/math13071147

AMA Style

You Y, Liu Z, Wen X, Zhang Y, Ai W. Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining. Mathematics. 2025; 13(7):1147. https://doi.org/10.3390/math13071147

Chicago/Turabian Style

You, Yuxin, Zhen Liu, Xiangchao Wen, Yongtao Zhang, and Wei Ai. 2025. "Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining" Mathematics 13, no. 7: 1147. https://doi.org/10.3390/math13071147

APA Style

You, Y., Liu, Z., Wen, X., Zhang, Y., & Ai, W. (2025). Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining. Mathematics, 13(7), 1147. https://doi.org/10.3390/math13071147

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

Abstract

1. Introduction

2. Preliminary

2.1. Graph Mining

2.2. Large Language Model

3. Techniques of the LLMs Combined with GNNs

3.1. GNN-Driving-LLM

3.2. LLM-Driving-GNN

3.3. GNN-LLM-Co-Driving

3.4. The Comparison of the Significance of Embeddings

4. Summary and Discussion Analysis

4.1. Summary

4.2. Evaluation Metrics

4.2.1. GNN-Based Model Evaluation Metrics

4.2.2. LLM-Specific Evaluation Metrics

4.3. Discussion Analysis

5. Future Direction

5.1. Multimodal Graph Data Processing

5.2. Addressing the Hallucination Problem in Large Language Models

5.3. Enhancing the Capability to Solve Complex Graph Tasks

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI