Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model

Tran, Quoc-Bao-Huy; Waheed, Aagha Abdul; Chung, Sun-Tae

doi:10.3390/app14177881

Open AccessArticle

Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model

by

Quoc-Bao-Huy Tran

,

Aagha Abdul Waheed

and

Sun-Tae Chung

^*

Department of Intelligent Systems, Soongsil University, Seoul 06978, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7881; https://doi.org/10.3390/app14177881

Submission received: 2 August 2024 / Revised: 2 September 2024 / Accepted: 3 September 2024 / Published: 4 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Graph databases have become essential for managing and analyzing complex data relationships, with Neo4j emerging as a leading player in this domain. Neo4j, a high-performance NoSQL graph database, excels in efficiently handling connected data, offering powerful querying capabilities through its Cypher query language. However, due to Cypher’s complexities, making it more accessible for nonexpert users requires translating natural language queries into Cypher. Thus, in this paper, we propose a text-to-Cypher model to effectively translate natural language queries into Cypher. In our proposed model, we combine several methods to enable nonexpert users to interact with graph databases using the English language. Our approach includes three modules: key-value extraction, relation–properties prediction, and Cypher query generation. For key-value extraction and relation–properties prediction, we leverage BERT and GraphSAGE to extract features from natural language. Finally, we use a Transformer model to generate the Cypher query from these features. Additionally, due to the lack of text-to-Cypher datasets, we introduced a new dataset that contains English questions querying information within a graph database, paired with corresponding Cypher query ground truths. This dataset aids future model learning, validation, and comparison on text-to-Cypher task. Through experiments and evaluations, we demonstrate that our model achieves high accuracy and efficiency when comparing with some well-known seq2seq model such as T5 and GPT2, with an 87.1% exact match score on the dataset.

Keywords:

text-to-Cypher; natural language processing; GraphSAGE; sematic parsing; graph database; CQL; Cypher

1. Introduction

With the exponential growth of digital data, social networks, and the Internet of Things (IoT), many databases have been introduced to meet the demand for handling vast amounts of data in industry and academia [1]. Among them, Neo4j is a promising database like the trendiest product [2]. In Neo4j, the query language Cypher (QLC) is recommended as the optimal language to access the database [3]. However, the syntax and operations of Cypher are complex and require a considerable learning curve for users. Thus, question-to-Cypher query research aims to enhance accessibility by allowing nontechnical users to interact with graph databases using natural language, making data retrieval more intuitive and user-friendly.

The text-to-Cypher model can be useful in practical application. For example, in customer management systems, users can ask questions about customer interactions and behaviors using natural language, such as “Find customers who purchased both products X and Y”. This makes it easier for marketing professionals to obtain the customer information without learning Cypher.

Recently, with the development of artificial intelligence (AI), deep learning (DL) and machine learning (ML) models have been applied to many fields such as speech recognition [4,5], image processing [6,7], network security [8,9], etc. In particular, DL and ML models have also been highly recommended to design and build the language translators [10,11]. Additionally, translating from natural language to query language is a sub-task of language translation. Thus, to convert the natural language into queries, many sequence-to-sequence (Seq2Seq) models have been used, such as long short-term memory (LSTM) [12], ChatGPT [13], and Transformer [14]. In [15], the authors used Neural Attention to convert language to logical form with the program descriptions predefined from [16]. T5-SR model was proposed in [17] to improve the performance of converting natural language queries into SQLs. With the above studies, DL and ML models have also achieved high performance in text-to-query tasks. However, previous studies about text-to-query have not provided any methods and models converting the natural language to Cypher query.

Like SQL, Cypher (QLC) is used to present text query language with a specific focus on graphs. Cypher also includes similar statements, expressions, and keywords such as WHERE, ORDER BY, SKIP LIMIT, AND, etc. However, unlike SQL, which handles table data, Cypher focuses on the data with graph form that include nodes and edges; as mentioned in [18], text-to-Cypher is more challenging due to more complex scenarios, the difficulty of leveraging database contents, and the absence of elements such as schema, primary key, and foreign key. Thus, existing text-to-SQL and text-to-SPARQL models [19,20,21,22,23] are not suitable for translating natural language to Cypher.

Seq2Seq models were among the first to be applied to the text-to-SQL task. These models convert natural language input into an SQL query. However, Seq2Seq models often struggle with complex queries, especially those involving nested structures or multiple tables or when the database schema is complex. Moreover, the experiments illustrate that the Seq2Seq framework yields undesirable results on the text-to-Cypher task, hence pressing for subsequent research that considers the characteristics of the text-to-Cypher task [18]. So, question-to-Cypher query research also finds a suitable strategy for solving the task.

Schema-aware models like [24] incorporate knowledge of the database schema into the query generation process, allowing them to better understand and utilize the structure of the database. However, schema-aware models rely heavily on the quality of the schema representation. They can struggle with databases that have complex, poorly defined or rich information contained schemas. A schema-linking approach is also being used in various text-to-SQL models [25,26]. This method tries to discover which parts of the natural language question (can be words or phrases) refer to which database elements (tables and columns). This method is a good and popular approach in text-to-SQL because SQL is designed for relational databases, which organize data in tables with predefined schemas. The databases structured are created with clear relationships defined by foreign keys while Cypher queries involve pattern matching, where the query looks for specific structures or paths within the graph. This adds another layer of complexity to schema linking, as the model must not only identify the correct schema elements but also understand how they relate to each other within the graph’s structure.

To address the above issue, we proposed a novel model to translate natural language questions into Cypher queries, making it easier for users to interact with graph databases without needing to learn the query language. This model is the combination of bidirectional encoder representations from Transformers (BERT), Graph Sample and AggregatE (GraphSAGE), and the Transformers assigned as CoBGT model. Firstly, due to BERT achieving high performance in word-embedding processing, BERT model was utilized to extract the key value in natural language. Since Cypher is a query language for graph databases, GraphSAGE, which handles graph data, is suitable for exploiting relations and properties in Cypher. Finally, the features from BERT and GraphSAGE models were supplied to Transformer to generate the Cypher query (text-to-Cypher).

In addition, although there was a dataset for text-to-Cypher in [18], this dataset is in the Chinese language. Thus, we have created a comprehensive text-to-Cypher dataset in English, which serves as a critical resource for training and evaluating text-to-Cypher models. This dataset is not only instrumental in developing our model but also provides a valuable benchmark for future research in this domain.

Overall, the contributions of this paper can be summarized as follows:

Provided a novel method to translate natural language into Cypher queries (text-to-Cypher).
Proposed a new dataset for designing text-to-Cypher models.
Analyzed the compatibility of BERT, GraphSAGE, and Transformer in text-to-Cypher translation.
Conducted experiments to verify the performance of the proposed model.

The rest of this paper is organized as follows. In Section 2, we provide some background about Cypher query language, BERT, GraphSAGE, and Transformers. Section 3 describes our proposed model in detail. The simulation results and datasets are presented in Section 4 to evaluate the proposed model. Finally, in Section 5, we present the conclusions and suggestions for future studies.

2. Background

In this paper, our proposed model was designed based on BERT, GraphSAGE, and Transformers models to generate the Cypher query. Thus, in this section, we introduce the background on the syntax of Cypher queries, BERT, GraphSAGE, and Transformers models.

2.1. Cypher

In Neo4j, Cypher is a declarative query language used to access the database and specify schema definitions [27]. The following outlines the properties of Cypher query.

Linear queries: Inputs to the Cypher language are a property graph and the outputs are a table containing patterns found in a graph. Queries are structured linearly, progressing from start to finish. This allows users to think of query processing as starting from the beginning of the query text and then progressing linearly to the end. Each clause in a query is a function that takes a table and outputs a table that can both expand the number of fields and add new tuples [3]. At the end of the query, the RETURN clause specifies the patterns with the desired properties that users want to retrieve—this usage is different from SQL, where the projection is declared at the beginning of the query using the SELECT clause.

Pattern matching: Graph pattern matching is the central concept of Cypher queries. It is the mechanism for navigating, describing, and extracting data from a graph by applying a declarative pattern. Inside a MATCH clause, you can use graph patterns to define the data you are searching for and the data to return. Patterns in Cypher are expressed in a visual form as “ASCII art”, such as (a)-[r]->(b). In this notation, a and b are the node types and r represents the relationship type. Influenced by XPath and SPARQL, Cypher patterns express regular path queries and support matching and returning paths.

Data Modification: Cypher provides clauses for creating (CREATE), deleting (DELETE), and updating (SET) graph entities. MERGE clause matches patterns or creates them if none are found, ensuring uniqueness with database synchronization.

The basic Cypher query includes MATCH clause, WHERE clause, and RETURN clause. The MATCH clause is used to specify patterns of nodes and relationships to be found in the graph. The WHERE clause filters the matched patterns based on certain criteria and the RETURN clause specifies what to return from the query.

Example: below is a Cypher query to retrieve the list of all actors in “The Matrix” movie, the number of those actors, and the producer of the movie from Figure 1.

MATCH (a:Actor)-[:ACTED_IN]->(m:Movie), (p:Producer)-[:PRODUCED]->(m)

WHERE m.title = ‘The Matrix’

RETURN a.name, COUNT(a), p.name

First, in the MATCH clause, the query searched the graph for patterns where there is a node labeled Actor (alias a) that has a relationship of type ACTED_IN to a node labeled Movie (alias m). The relationship (:Actor)-[:ACTED_IN]->(:Movie) is defined in the schema. Another pattern is (p: Producer)-[:PRODUCED]->(m); this pattern finds Producer nodes connected to the same Movie nodes (alias m) with the Actor node (alias a). Next, the WHERE clause specifies the criteria that the Movie’s title should be “The Matrix”. Finally, in the RETURN clause, we obtain the list of actors (a.name), number of actors (COUNT(a)), and name of the producer (p.name).

From the above properties, we can see that key points for converting natural language questions to Cypher queries are the following factors:

Extract Key Values: Identify the essential entities or values (e.g., “The Matrix” in the example).

Match Relevant Relationship Patterns: Determine the relevant relationship patterns that define how entities are connected in the graph (e.g., (:Actor)-[:ACTED_IN]->(:Movie)).

Select Property Nodes for Comparison: Identify which properties to compare using the key values (e.g., comparing the name property of the Person node with “Alice”).

2.2. BERT Model

Bidirectional encoder representations from Transformers (BERT) [28] is a state-of-the-art pretrained language model developed by Google. It leverages a transformer architecture that reads entire sequences of words at once, allowing it to capture context from both directions (left-to-right and right-to-left) simultaneously. This bidirectional approach enables BERT to understand the nuanced meaning of words based on their surrounding context, making it highly effective for various NLP tasks.

BERT is also known as an effective model for extracting key information, such as entity names and values, from text. BERT excels in this task due to its ability to understand the context of words in a sentence through its bidirectional training. Advantages of BERT for key information extraction are provided as follows:

Contextual Understanding: BERT can capture the nuanced meaning of words based on their context. This is crucial for key value extraction where the meaning of a word may vary depending on its surroundings.
Pretraining and Fine-tuning: BERT is pretrained on a large corpus and can be fine-tuned on specific tasks, making it highly adaptable and accurate for extracting specific types of information from text.
Token Classification: BERT can be used with token classification tasks, such as Named Entity Recognition (NER), where each token (word) in the sentence is classified into predefined categories (e.g., entities, dates, and values).

2.3. GraphSAGE

GraphSAGE [29] is a powerful inductive framework for generating low-dimensional embeddings for nodes in a graph.

GraphSAGE is indeed effective for tasks involving learning and predicting relationships within graph-structured data, even with new nodes. This capability is primarily due to its inductive learning approach, which generalizes to unseen nodes (Figure 2).

2.4. Transformers Model

The Transformer model is a neural network architecture that has revolutionized NLP tasks due to its ability to capture long-range dependencies and its parallel processing capability. Introduced by Vaswani [30], the transformer architecture relies entirely on self-attention mechanisms to process input sequences, making it highly efficient and effective for various sequence-to-sequence tasks. They have revolutionized the field of natural language processing (NLP) due to their ability to handle large contexts and capture complex dependencies in text.

In our proposed system, the Transformer model is used to generate the final Cypher query from the extracted key information and the schema relationships. The key features of the Transformer model in our context are:

Encoder–Decoder Architecture: The Transformer consists of an encoder that processes the input question and additional information such as key value and relational context, while the decoder generates the corresponding Cypher query. The encoder captures the context and semantics of the input, while the decoder uses this information to generate the output query.
Output Generation: The Transformer predicts the Cypher query by learning to map the input representations to the correct sequence of Cypher query.

3. Proposed Model

Our proposed model includes three modules: “Key value extraction”, “Relation and properties prediction”, and “Cypher query prediction” (Figure 3).

3.1. Key Value Extraction Module

We utilized a BERT-based model for key information extraction. Bidirectional encoder representations from Transformers (BERT) was a powerful pretrained language model known for its effectiveness in various NLP tasks. This module aims to identify the key value in the question. These key elements often include the name of entities, number value, or properties value relevant to the nodes in Neo4j graph.

Method: We fine-tune a pretrained BERT model using a Begin, Inside, Outside (BIO) tagging strategy. This strategy labels each token in the input text as the beginning of a key entity (B), inside a key entity (I), or outside any key entity (O). This fine-tuning process enables the model to accurately extract useful information from input questions.

We created a training dataset where each token in the input questions is annotated with BIO tags corresponding to the entities, attributes which had a high chance of being used in the WHERE clause in the corresponded Cypher query.

For example, in the question “Can you provide a list of actors who appeared in movies directed by Frank Darabont?”, the Cypher query for this question is “MATCH (a:Actor)-[:ACTED_IN]->(:Movie), (:Movie)<-[:DIRECTED]-(d:Director) WHERE d.name = “Frank Darabont” RETURN DISTINCT a.name”. We can see that, in the WHERE clause, “Frank Darabont” is the value used for searching the director’s name, so the tokens “Frank Darabont” would be tagged as key information (Figure 4).

The detailed process of key-value extraction is illustrated in Figure 5. Firstly, the input question undergoes tokenization, where the question Q is split into individual word terms. These terms are then mapped to unique token ID numbers (a necessary step because the BERT model requires a numeric input). Next, the sequence of token IDs is fed into the BERT model, which generates corresponding embedded vectors (E₁, E₂, …, E_n) for each token. Finally, to classify which tokens represent key values, these embedded vectors are fed into a classification layer (a fully connected neural network). Tokens labeled O indicate non-key values, B indicates the beginning of a key value, and I indicates tokens inside a key value.

3.2. Relation–Properties Prediction Module

This module predicts the matching between the questions and relevant relations or properties in the schema. We used the GraphSAGE [29] neural network; this model represents the connections between tokens in the question and the relevant relationship properties in the Cypher query.

The reason we are choosing Graph SAGE is that, unlike traditional graph embedding methods that require a complete retraining on the entire graph for any new node, Graph SAGE leverages node features and a neighborhood sampling strategy to generate embeddings in an inductive manner. This makes it highly scalable and suitable for dynamic graphs where nodes and edges can frequently change.

Question–schema relationship graph modeling

Before using the GraphSAGE neural network model, we need to construct a graph that illustrates the relationship between the question and the schema’s internal structure. The Graph G = (V, E) consists of nodes V and edges E inside the graph; V represents the node of the graph. In our Question–Schema Relationship Graph, we define two types of nodes: V₁ and V₂. V₁ represents the word terms appearing in the input question, while V₂ represent the relation and properties appearing in the schema.

Word term node (V₁) building: First, we preprocess the input question by removing unnecessary words, such as stop words. The remaining text is then chunked and Part-of-Speech (POS) tagged. These chunks and their corresponding POS tags are added as V₁ node in the graph.

Relations and properties node (V₂) building: Next, for each Cypher query corresponding to the question used for word term node (V1) building, we extract the relations and properties from the query. Each extracted relation or property is added as a separate V₂ node. So, each V₂ node will contain only one relation or property.

Establishing connections between V₁ and V₂: Finally, we establish connections between the V₁ and V₂ nodes. Let Q = {q₁, q₂, q₃, …, q_n} represent the set of input questions and C = {c₁, c₂, c₃, …, c_n} represent the corresponding set of Cypher queries. For each question q_n, we have a list of word term nodes T = {t₁, t₂, …, t_n}. Similarly, from each Cypher query c_n, we have a list of relation–property nodes P = {p₁, p₂, …, p_n}.

Connections are established between each term node in T from q_n and corresponding relation–property node in P from c_n. After that, a comprehensive representation of the relationships between the question and the schema is established. The question–schema relationship graph building example is also described in Figure 6.

After creating the connection between V₁ and V₂ nodes, we also established a connection between V₁ nodes. The condition for the connections of two nodes is that those nodes belong to a question asking about the same schema and template question and two nodes have a text similarity score higher than 0.65 (Figure 7). We employ the [31] model for calculating similarity score.

After building the question–schema relationship graph and establishing the connection between V₁ nodes, we used GraphSAGE to learn the node representation (node embedding). The matching score between two nodes V₁ and V₂ was calculated by executing dot product p(v₁, v₂) = v₁·v₂ (v₁, v₂ is the embedded vector of V₁ and V₂ node). We used the binary cross-entropy loss function:

{L o s s}_{b c e} = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} . \log (p (u_{i}, v_{i})) + (1 - y_{i}) . \log (1 - p (u_{i}, v_{i}))

(1)

Relation–properties prediction process:

The pseudo-code:

Let us call the input question Q and schema S with relation–properties R = {r₁, r₂ … r_m}

Chunk question Q into tokens T = {t₁, t₂ … t_n} and perform POS-TAGING.
If token does not exist in the train data, then compare the similarity score with nodes in the graph which have relation with the current schema. If the score >= threshold (0.65), then create a connection between it with a new token.
Feed those tokens {t₁, t₂ … t_n} and current schema {r₁, r₂ … r_m} into the module.
Achieve the matching score between token T and relation–properties R.
Calculate the matching score between relation–properties R and the question follow Formula (2); if the score is greater than 0, then that relation–properties can be chosen for feeding into the Cypher generation module.

{S c o r e}_{i} = \frac{1}{N} \sum_{j = 1}^{N} s_{i j}

(2)

where:

s_ij: the matching score between relation–properties r_i and token t_i;

N: number of tokens in the question.

3.3. Cypher Query Prediction

The value extracted from the Key Information Extraction module along with the relevant relations and properties identified by the Question–Schema Relationship Modeling module are fed into the Cypher query prediction module to generate the corresponding Cypher query. In our proposed system, we employ a small version of the T5 Transformer model [32] to generate the final Cypher query from the extracted key information and the schema relationships. The key features of the Transformer model in our context are:

Input sequence: The contextual information from the Key Information Extraction and Question–Schema Relationship Modeling modules are fed into the Transformer model.

Output Generation: The Transformer predicts the Cypher query by learning to map the input representations to the correct sequence of query tokens.

The equation used to construct the input sequence is represented as follows:

X = [C L S], D_{1}, R, [S E P], D_{2}, K, [S E P], D_{3}, Q, < E O S >

where:

X : I n p u t s e q u e n c e D_{1} : d i r e c t s e q u e n c e 1 D_{2} : d i r e c t s e q u e n c e 2 D_{3} : d i r e c t s e q u e n c e 3 R : r e l a t i o n - p r o p e r t i e s f e a t u r e K : k e y v a l u e f e a t u r e Q : q u e s t i o n

(3)

To guide the Cypher prediction module in identifying the information and question components, we defined three sequence guidelines: D₁, D₂, and D₃. The guideline is a defined sequence, which helps the model better understand which information is addressed after. Specifically, D₁, D₂, and D₃ guideline was used for relation–properties features, key-value features, and the question, respectively. Therefore, the relation–properties features follow the D₁ guideline, the key-value feature follows D₂, and the question follows D₃. To make the input sequence for the Cypher prediction module, firstly, a special token [CLS] is inserted at the beginning to signal the start of the information. Next, the D₁ guideline is added, followed by the relation–properties information. After that, the pairs key value guideline (D₂, K) and question guideline (D₃, Q) are also included in the sequence. We separated those pairs by adding [SEP] token between them. At the end of the sequence, an <EOS> token is added to indicate the end of the information. In our experiment, we defined D₁ = “With this information”, D₂ = “Key value:”, and D₃ = “Convert this to Cypher query:”. The detail input is illustrated in Figure 8.

By leveraging BERT, GraphSAGE, and Transformer models, our system can generate syntactically and semantically accurate Cypher queries, ensuring precise retrieval of information from the Neo4j graph database.

4. Results

4.1. Dataset

To evaluate our framework model, we built a question-to-Cypher dataset. The dataset includes questions in English, corresponding Cypher queries, and related schema information. It consists of 1240 instances in the test set and 4921 instances in the training set. The dataset was created using three Neo4j graphs: The Movie graph, the Norwin graph, and a modified Movie graph.

The Movie graph is a common example dataset provided by Neo4j to demonstrate its capabilities and features. It represents a simplified movie database containing information about movies, actors, directors, and their relationships. Here is an overview of its structure:

Nodes:

Movie: represents movies, with properties like title, released (year), and tagline.
Person: represents people (actors and directors), with properties like name and born (year).

Relationships:

ACTED_IN: connects a Person node to a Movie node, indicating that the person acted in the movie.
DIRECTED: connects a Person node to a Movie node, indicating that the person directed the movie.
FOLLOWS: connects two Person nodes, indicating that one person follows the other (e.g., on social media).

The modified Movie graph is the more clear version of the Movie graph because it split the People node specifically. In the modified version, we have four nodes: Movie, Actor, Reviewer, and Producer.

The Northwind graph is another example dataset, inspired by the classic Northwind database often used in relational database tutorials. It represents a simplified version of a business database containing information about customers, orders, products, and suppliers.

Nodes:

Customer: represents customers, with properties like customerID, companyName, and contactName.
Order: represents orders, with properties like orderID, orderDate, and shipCity.
Product: represents products, with properties like productID, productName, and unitPrice.
Supplier: represents suppliers, with properties like supplierID, companyName, and contactName.

Relationships:

PLACED: connects a Customer node to an Order node, indicating that the customer placed the order.
CONTAINS: connects an Order node to a Product node, indicating that the order contains the product.
SUPPLIES: connects a Supplier node to a Product node, indicating that the supplier supplies the product.

4.1.1. Dataset Creation

To create the text-to-Cypher dataset, we first compiled a list of template questions that inquire about the information contained in the database. These template questions are structured as follows: “What movies feature the performance of [actor]?”, “Can you provide a list of movies in which both [actor1] and [actor2] appear?”, and so on.

Next, we generated the corresponding Cypher queries for each template. For example, the query for the first template might be MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor) WHERE a.name = [actor] RETURN m.title

After creating the templates and queries, we extracted relevant information from the database to fill in the placeholders in the template questions. This process allows us to generate multiple Cypher queries from a single template.

We then executed the Cypher queries to verify their correctness and to ensure there were no syntax errors.

Finally, we added paraphrased versions of the questions to produce variety into the dataset.

Overall, our dataset consists of 1240 instances in the test set and 4921 instances in the training set with 21 question templates for the Movie graph, 34 question templates for the Northwind graph, and 21 question templates for the Movie modified graph

4.1.2. Dataset Structure

The dataset contains a list of instances for training the text-to-Cypher task. Each instance includes the following fields:

question: a question asking for specific information within the database.

cypher: the Cypher query corresponding to the question.

schema: this field provides information about all the relations and properties present in the graph database.

“key_data”: this field provides the key values extracted from the question. These key values were identified during dataset generation.

“relation_properties”: this field provides the relations and properties relevant to the question, also extracted during dataset generation.

The example of dataset instance is shown in Figure 9.

4.2. Evaluation

We compare our method with some other popular sequence-to-sequence model such as GPT2 [33] or T5 [32]. To evaluate these models on our dataset, we set up the following experiment: the input to each model consists of the schema information (schema field) and the question, while the output is a Cypher query. So, the input of those models will be as follows: “With this schema: [schema]. Convert to Cypher query: [question]”. After training them on the training set, we evaluate their performance by measuring accuracy on the test dataset.

We evaluate the performance of the proposed model using the Logical Form (LF) metric, also known as Exact Set Match Accuracy (EM). This metric is determined by comparing the predicted Cypher query to the ground-truth Cypher query. The calculation is as follows:

{S c o r e}_{L F} (\hat{Y}, Y) = \{\begin{matrix} 1, \hat{Y} = Y \\ 0, \hat{Y} \neq Y \end{matrix}

(4)

where

\hat{Y}

is the predicted Cypher query and Y is the ground truth Cypher query.

E M = \frac{1}{N} \sum_{n = 1}^{N} {S c o r e}_{L F} ({\hat{Y}}_{n}, Y_{n})

(5)

where N is the number of instances in the dataset.

The flowchart illustrating our evaluation process is shown in Figure 10.

First, the input natural language question passes through the Key-Value Extraction module, which extracts key values (such as entity names or numbers) from the question. Next, the question is tokenized into individual words or terms and POS-tagged (we use the spaCy library). These tokens are then fed into the GraphSAGE model along with the schema. GraphSAGE processes the schema to identify relevant relationships and properties. To retain the most relevant relations and properties, matching scores are calculated using a specific formula (as Formula (3)).

Finally, we use the Exact Match Calculation (EM) to compare the generated Cypher query against the ground-truth query, calculating the exact match accuracy.

We simulate our experiment on the hardware as follows:

Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz;
NVIDIA GeForce GTX 4090 24 GB;
Memory 128 GB.

In Table 1, we present the results of the proposed model and compare it with Text-to-Text Transfer Transformer (T5) and Generative Pretrained Transformer 2 (GPT2) models.

In Table 2, we present the result of our proposed model in some study cases: without key extraction module (GraphSAGE–Transformers), without relation–properties extraction module (BERT–Transformers), and without both modules (only Transformers).

In Table 3, we present some predictions output from the proposed model, with the question as the input to the model and the ground truth from the testing dataset.

From Table 4, we can observe that our proposed model achieves a faster processing time than T5 and GPT-2, indicating that our model also has lower complexity or a smaller structure.

5. Discussion

From Table 2, we observe that both the relation–properties and key-value features enhance the model’s performance. However, the relation–properties feature plays a more significant role in this improvement. The key-value extraction module allows the model to focus on the most relevant parts of the question. It also helps reduce the noise in the input data, allowing the Transformers module to process the core components more effectively. Similarly, relation–properties extraction also helps focus on the important relationship and properties in the schema. However, while the key-value feature only provides information with one or few entities, the relation–properties extraction focuses on information between many entities and relationships, so it provides more useful information.

Besides using exact match calculation for evaluation, we could also apply execution accuracy, which compares the outputs of the generated queries. However, with our dataset, there are cases where executing the ground truth Cypher query does not return a value, even though the query is entirely correct. This occurs when the information being queried does not exist in the database. As a result, comparing the outputs of these queries might lead to inaccurate evaluations. Therefore, in this paper, we rely on exact match evaluation as our evaluation method.

Currently, our dataset is limited in size and includes only questions about node information, with no coverage of edge-related queries. We plan to expand this coverage in future work. Additionally, there are few models specifically designed for text-to-Cypher tasks, which restricts our options for performance comparison. We hope our contribution will help advance future research in this area.

Additionally, our dataset focuses solely on converting questions into Cypher queries for retrieving information from graph databases. However, future work could expand this dataset to include commands for modifying the graph database, such as creating, updating, or deleting nodes and relationships. With an appropriately designed training model, this enhanced dataset could enable the development of models capable of not only querying but also managing graph database content. Such advancements could integrate seamlessly into graph database management tools, offering users comprehensive natural language interaction capabilities for both querying and database modification.

6. Conclusions

In this paper, we propose a model for text-to-Cypher task. In this model, we use the BERT model to extract the key value (key-value extraction module) and GraphSAGE to exploit the relation–properties of the database (relation–properties extraction module). Then, these features were fed to the Transformer to generate the Cypher query (Cypher prediction module). Moreover, we provide a small text-to-Cypher dataset for evaluation and comparison on the text-to-Cypher task. The simulation results show that our proposed model achieves higher performance compared with the seq2seq models (T5 and GPT2). Specifically, our proposed model achieved 87.1% EM and outperformed 39.04% for T5 and 35.81% for GPT2. In future work, we plan to create a larger dataset and improve our model to achieve better performance.

Author Contributions

Conceptualization, Q.-B.-H.T. and S.-T.C.; methodology, Q.-B.-H.T. and S.-T.C.; software, Q.-B.-H.T. and A.A.W.; validation, Q.-B.-H.T. and A.A.W.; investigation, Q.-B.-H.T., A.A.W. and S.-T.C.; writing—original draft preparation, Q.-B.-H.T.; writing—review and editing, Q.-B.-H.T. and S.-T.C.; supervision, S.-T.C.; project administration, S.-T.C.; funding acquisition, S.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW (2024-0-00071) supervised by the IITP (Institute of Information & communications Technology Planning & Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Larriba-Pey, J.; Martinez-Bazan, N.; Dominguez-Sal, D. Introduction to graph databases. In Reasoning Web International Summer School; Spinger: Cham, Switzerland, 2014; Volume 8714, pp. 171–194. [Google Scholar]
Cao, R.; Chen, L.; Chen, Z.; Zhao, Y.; Zhu, S.; Yu, K. LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Nadime, F.; Alastair, G.; Paolo, G.; Leonid, L.; Tobias, L.; Victor, M.; Stefan, P.; Mats, R.; Mats, R.; Petra, S.; et al. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; ACM: New York, NY, USA, 2018. [Google Scholar]
Nguyen-Vu, L.; Doan, T.-P.; Bui, M.; Hong, K.; Jung, S. On the Defense of Spoofing Countermeasures against Adversarial Attacks. IEEE Access 2023, 11, 94563–94574. [Google Scholar] [CrossRef]
Cisse, M.; Adi, Y.; Neverova, N.; Keshet, J. Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Long Beach, CA, USA, 2017. [Google Scholar]
Duong, M.-T.; Lee, S.; Hong, M.-C. DMT-Net: Deep Multiple Networks for Low-Light Image Enhancement Based on Retinex Model. IEEE Access 2023, 11, 132147–132161. [Google Scholar] [CrossRef]
Duong, M.-T.; Nguyen, T.-T.; Lee, S.; Hong, M.-C. Multi-Branch Network for Color Image Denoising Using Dilated Convolution and Attention Mechanisms. Sensors 2024, 24, 3608. [Google Scholar] [CrossRef] [PubMed]
Le, H.-D.; Park, M. Enhancing Multi-Class Attack Detection in Graph Neural Network through Feature Rearrangement. Electronics 2024, 13, 2404. [Google Scholar] [CrossRef]
Tran, D.-H.; Park, M. Graph Embedding for Graph Neural Network in Intrusion Detection System. In Proceedings of the International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 17–19 January 2024. [Google Scholar]
Vijaya, J.; Mittal, C.; Singh, C.; Lekhana, M. An Efficient System for Audio-Based Sign Language Translator through MFCC Feature Extraction. In Proceedings of the 2023 International Conference on Sustainable Communication Networks and Application (ICSCNA), Theni, India, 11–13 December 2023. [Google Scholar]
Lavanya, R.; Gautam, A.; Anand, A. Real-Time Translator with Added Features for Cross-Language Communication. In Proceedings of the 10th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 12–14 April 2024. [Google Scholar]
Sak, H.; Senior, A.W.; Beaufays, F. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. In Proceedings of the INTERSPEECH, Singapore, 14–18 September 2014. [Google Scholar]
Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Dong, L.; Lapata, M. Language to Logical Form with Neural Attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Berlin, Germany, 2016. [Google Scholar]
Li, T.; Zhang, S.; Li, Z. SP-NLG: A Semantic-Parsing-Guided Natural Language Generation Framework. Electronics 2023, 12, 1772. [Google Scholar] [CrossRef]
Li, Y.; Su, Z.; Li, H.; Zhang, S.; Wang, S.; Wu, W.; Zhang, Y. T5-SR: A Unified Seq-to-Seq Decoding Strategy for Semantic Parsing. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Islands, 4–10 June 2023; IEEE: Rhodes Island, Greece, 2023. [Google Scholar]
Guo, A.; Li, X.; Xiao, G.; Tan, Z.; Zhao, X. SpCQL: A Semantic Parsing Dataset for Converting Natural Language into Cypher. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; ACM: New York, NY, USA, 2022. [Google Scholar]
Li, J.; Hui, B.; Cheng, R.; Qin, B.; Ma, C.; Huo, N.; Huang, F.; Du, W.; Si, L.; Li, Y. Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Online, 7–14 February 2023; AAAI Press: Washington, DC, USA, 2023. [Google Scholar]
Jeong, G.; Han, M.; Kim, S.; Lee, Y.; Lee, J.; Park, S.; Kim, H. Improving Text-to-SQL with a Hybrid Decoding Method. Entropy 2023, 25, 513. [Google Scholar] [CrossRef] [PubMed]
Scholak, T.; Schucher, N.; Bahdarau, R. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Ochieng, P. PAROT: Translating Natural Language to SPARQL. Expert Syst. Appl. X 2020, 5, 100024. [Google Scholar] [CrossRef]
Rony, M.R.A.H.; Kumar, U.; Teucher, R.; Kovriguina, L.; Lehmann, J. SGPT: A Generative Approach for SPARQL Query Generation from Natural Language Questions. IEEE Access 2022, 10, 70712–70723. [Google Scholar] [CrossRef]
Bogin, B.; Gardner, M.; Berant, J. Representing Schema Structure with Graph Neural Networks for Text to SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Guo, J.; Zhan, Z.; Gao, Y.; Xiao, Y.; Lou, J.G.; Liu, T.; Zhang, D. Towards Complex Text to SQL in Cross-Domain Database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. RAT-SQL: Relation Aware Schema Encoding and Linking for Text to SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Webster, G.; Lancaster, A. SCRIBL: A System for the Semantic Capture of Relationships in Biological Literature. J. Open Source Softw. 2024, 9, 6645. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1. [Google Scholar]
Hamilton, W.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Long Beach, CA, USA, 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Long Beach, CA, USA, 2017. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Hong Kong, China, 2019. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Radford, A.; Wu, J.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 3 September 2019).

Figure 1. Example data graph showing the relationships between movie, actor, and producer.

Figure 2. Visual illustration of the GraphSAGE sample and aggregate approach [29].

Figure 3. Diagram of the proposed model.

Figure 4. Example BIO tagger for value extraction module.

Figure 5. Key value extraction module process.

Figure 6. Graph building result for question “Can you provide a list of movies that have actors who have also worked in films directed by Mike Nichols?” V₁ nodes on the left represent the terms that appear in the question. V₂ nodes on the right represent the relation and properties appearing in the Cypher query.

Figure 7. Building the connection between token nodes (V₁) based on similarity matching score.

Figure 8. Flow process in Cypher query prediction module.

Figure 9. A sample instance of text-to-Cypher dataset.

Figure 10. Evaluation workflow.

Table 1. Comparison of our method with T5 and GPT2.

Method	EM
Proposed model	87.1%
T5 (base)	48.06%
GPT2 (standard)	51.29%

Table 2. Comparison of our method with and without key value extraction and relation–properties extraction.

Method	EM
Proposed model	87.1%
Without key value extraction	72.74%
Without relation–properties extraction	52.26%
Only transformers	47.43%

Table 3. Some output prediction from our model.

Question	Model Output	Ground Truth
Can you provide a list of actors who performed in films released in 1992?	MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released = 1992 RETURN DISTINCT a.name	MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released = 1992 RETURN DISTINCT a.name
Who are the employees responsible for handling orders placed by customers from Venezuela?	MATCH (c:Customer)-[:PLACED]->(:Order), (:Order)<-[:SOLD]-(e:Employee) WHERE c.Country = “Venezuela” RETURN e.FirstName	MATCH (c:Customer)-[:PLACED]->(:Order), (:Order)<-[:SOLD]-(e:Employee) WHERE c.Country = “Venezuela” RETURN distinct e.FirstName
Who are the actors who worked in films directed by Frank Darabont?	MATCH (a:Actor)-[:ACTED_IN]->(m:Movie), (m:Movie)<-[:DIRECTED]-(d:Director) WHERE d.name = “Frank Darabont” RETURN DISTINCT a.name	MATCH (a:Actor)-[:ACTED_IN]->(:Movie), (:Movie)<-[:DIRECTED]-(d:Director) WHERE d.name = “Frank Darabont” RETURN DISTINCT a.name
Which movies include Wil Wheaton as part of the cast?	MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor) WHERE a.name = “Wil Wheaton” RETURN m.title	MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor) WHERE a.name = “Wil Wheaton” RETURN m.title
What are some films helmed by Danny DeVito that premiered before 1993?	MATCH (:Movie)<-[:DIRECTED]-(d:Director) WHERE d.name = “DannyDeVito” RETURN m.released <= 1993 RETURN m.title	MATCH (m:Movie)<-[:DIRECTED]-(d:Director) WHERE d.name = “DannyDeVito” AND m.released < 1993 RETURN m.title

Table 4. Execution time on training and prediction process.

Method	Training Time (s/epoch)	Prediction Time (s)
Proposed model	59.1921	0.7194
T5 (base)	158.2713	1.16944
GPT2 (standard)	223.2184	1.7131

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tran, Q.-B.-H.; Waheed, A.A.; Chung, S.-T. Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model. Appl. Sci. 2024, 14, 7881. https://doi.org/10.3390/app14177881

AMA Style

Tran Q-B-H, Waheed AA, Chung S-T. Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model. Applied Sciences. 2024; 14(17):7881. https://doi.org/10.3390/app14177881

Chicago/Turabian Style

Tran, Quoc-Bao-Huy, Aagha Abdul Waheed, and Sun-Tae Chung. 2024. "Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model" Applied Sciences 14, no. 17: 7881. https://doi.org/10.3390/app14177881

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model

Abstract

1. Introduction

2. Background

2.1. Cypher

2.2. BERT Model

2.3. GraphSAGE

2.4. Transformers Model

3. Proposed Model

3.1. Key Value Extraction Module

3.2. Relation–Properties Prediction Module

3.3. Cypher Query Prediction

4. Results

4.1. Dataset

4.1.1. Dataset Creation

4.1.2. Dataset Structure

4.2. Evaluation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI