Next Article in Journal
A Blockchain-Assisted Federated Learning Framework for Secure and Self-Optimizing Digital Twins in Industrial IoT
Previous Article in Journal
Cache Aging with Learning (CAL): A Freshness-Based Data Caching Method for Information-Centric Networking on the Internet of Things (IoT)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model

1
School of Computer Science, Fudan University, Shanghai 200438, China
2
School of Artificial Intelligence, Shanghai Normal University Tianhua College, No. 1661 Shengxin North Road, Shanghai 201815, China
3
Business Analysis BU, GienTech Technology Co., Ltd., Shanghai 200232, China
4
School of Information and Communication Technology, Hanoi University of Science and Technology, No. 1 Dai Co Viet, Hai Ba Trung, Hanoi 100000, Vietnam
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(1), 12; https://doi.org/10.3390/fi17010012
Submission received: 30 October 2024 / Revised: 24 December 2024 / Accepted: 31 December 2024 / Published: 2 January 2025

Abstract

:
With the rapid development of prominent models, NL2SQL has made many breakthroughs, but customers still hope that the accuracy of NL2SQL can be continuously improved through optimization. The method based on large models has brought revolutionary changes to NL2SQL. This paper innovatively proposes a new NL2SQL method based on a large language model (LLM), which could be adapted to an edge-cloud computing platform. First, natural language is converted into Python language, and then SQL is generated through Python. At the same time, considering the traceability characteristics of financial industry regulatory requirements, this paper uses the open-source big model DeepSeek. After testing on the BIRD dataset, compared with most NL2SQL models based on large language models, EX is at least 2.73% higher than the original method, F1 is at least 3.72 higher than the original method, and VES is 6.34% higher than the original method. Through this innovative algorithm, the accuracy of NL2SQL in the financial industry is greatly improved, which can provide business personnel with a robust database access mode.

1. Introduction

Structured Query Language (SQL) has become the standard language for database querying, but its complexity poses a barrier for non-technical users. The Natural Language to Structured Query Language (NL2SQL) task converts natural language questions into SQL, providing non-specialist users with a way to interact with databases [1]. The NL2SQL problem can be defined as follows: given a natural language query and a corresponding database table, the NL2SQL model generates an SQL statement. Early NL2SQL tasks were primarily based on rule-based or pipeline approaches and used domain-specific, simple datasets, making replication challenging. In recent years, with the release of large-scale annotated datasets and the rapid development of deep learning and natural language processing (NLP) technologies, research in the NL2SQL field has garnered significant attention. Researchers have proposed a range of new methods, continuously advancing this field. With breakthroughs in NLP due to deep learning, applying these methods to NL2SQL tasks has yielded positive results [2].
In 2017, Salesforce introduced the large NL2SQL dataset WikiSQL [3], containing 80,654 query questions and their related SQL statements across 24,241 database tables. Many studies followed, aiming to improve the accuracy of the WikiSQL dataset. The SQL statements in WikiSQL are relatively simple, mainly involving single-table queries.
In 2018, Yale University released the Spider dataset [4], a new NL2SQL dataset presenting a higher difficulty level than WikiSQL. Spider, annotated by 11 Yale students, is a large-scale, complex cross-domain semantic parsing and text-to-SQL dataset. The Spider challenge aims to develop natural language interfaces for cross-domain databases. It contains 10,181 questions and 5693 unique complex SQL queries spanning 200 databases and multiple tables across 138 different domains. In Spider1.0, distinct complex SQL queries and databases appear in both training and testing sets. To perform well on this dataset, systems must generalize effectively to new SQL queries and database schemas.
On 28 August 2024, an early-access version of Spider2.0 (a more realistic and challenging text-to-SQL task) became available. Spider2.0 includes 600 real-world text-to-SQL workflow questions derived from enterprise-level database use cases. The databases in Spider 2.0 come from real data applications, often containing over 1000 columns, stored in cloud-based or on-premise database systems like BigQuery, Snowflake, or PostgreSQL. Solving Spider2.0 tasks typically requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge requires models to interact with complex SQL workflow environments, handle extensive context, perform intricate reasoning, and generate SQL queries with multiple operations, often spanning over 100 lines, far exceeding traditional text-to-SQL challenges. Progress on Spider 2.0 represents a critical step toward developing LLM-based code agents that are more intelligent and autonomous in real enterprise environments. With continuous improvements in datasets and the rapid development of large model technologies, many scholars are using large models to enhance NL2SQL accuracy.
Data management in the financial industry has a history of over a decade. Many enterprises in the financial sector have established their own data platforms, applying data extensively in risk, finance, regulatory, and marketing domains, with data playing an increasingly important role in real business scenarios [5]. As data becomes a production factor, financial enterprises place increasing value on data assets. Reports have traditionally been an important way to demonstrate data’s intrinsic value, yet fixed reports are increasingly insufficient for business needs. Business personnel require the ability to explore valuable data from certain datasets to support decision-making directly and wish to convert data insights into business value quickly. Traditionally, business users would submit data requests to the IT department. Technique experts develop the data requirements and return the data results to the business users. An edge-cloud computing platform could help to finish it. The process can take 1–2 weeks at best and up to 1–2 months at worst, severely impacting business decision-making and subsequent strategies. Business data scientists must update data extraction rules based on query results, adding further complexity. This creates a need for an edge-cloud computing platform system where business personnel can directly use data through natural language, as shown in Figure 1. Here, business personnel express data requirements in natural language, and the NL2SQL engine converts it into SQL to access the database. The resulting data is then returned to the business personnel. This significantly reduces communication costs, and if the data is unsuitable, business users can quickly adjust the input themselves to obtain desired results, thus improving data-driven decision-making efficiency.
The NL2SQL model is the core component of the entire business process and is highly anticipated by users in the financial sector. The financial industry is knowledge-intensive, with business users having more stringent requirements for NL2SQL accuracy and a lower error tolerance than in other sectors. Traditional NL2SQL models are largely general-purpose and cross-industry, with few models tailored specifically for the financial industry. In this context, this paper proposes a novel NL2SQL construction approach for the financial industry, highlighting three key innovations:
  • A financial semantic data model and a knowledge base for the financial industry were developed to enhance the big model’s understanding of the sector through database recall and table linking.
  • An innovative approach was introduced first to convert natural language into Python code and then convert that code into SQL. The model’s accuracy was improved through continuous optimization and iteration during this intermediate process.
  • We tested this innovative method on the BIRD dataset and found that it exceeded most currently popular open-source models.
This paper consists of five sections. The Section 1 mainly describes the research background of NL2SQL; the Section 2 mainly describes the research status of NL2SQL and mainly describes the main methods used in the research process of this paper; the Section 3 mainly describes the method proposed in this paper and the implementation process; the Section 4 mainly describes the experimental process and comparative experimental results of the method proposed in this paper; the Section 5 summarizes the content of this paper and describes future research.

2. Related Work

2.1. Research Status

After significant contributions from numerous experts and scholars and a long period of evolution, the NL2SQL field has seen considerable progress. From an algorithmic perspective, NL2SQL methods have evolved through rule-based approaches, neural network-based approaches, pre-trained models, and large language models (LLMs), each playing a unique role in its respective era.
Rule-based methods rely on predefined rules or semantic parsers [6]. For example, NaLIR [7] translates natural language queries into SQL with a special parser and predefined rules. However, there are a lot of limitations in adaptability, scalability, and generalization, making them difficult to apply broadly. Researchers overcome these limitations with neural networks. Advances in deep neural networks have driven progress in text-to-SQL [8], automatically learning mappings from user questions to corresponding SQL. Some models employ an encoder and encode mechanism to finish NL2SQL. With Transformer [9] and the Spider dataset coming, neural network-based methods win in the field [10]. Pre-trained language models (PLMs), like BERT and T5 [11], became a new paradigm for text-to-SQL systems, demonstrating state-of-the-art results on benchmark datasets.
The rise of ChatGPT, GPT-4, and GPT-4o has profoundly transformed NL2SQL solutions, with LLM-based methods significantly advancing the field [11]. Generally, LLM-based approaches involve either in-context learning (ICL) [12] or fine-tuning (FT) [13], achieving high accuracy through carefully designed frameworks and stronger comprehension than PLMs. For example, DAIL-SQL [14] obtained nice results on the Spider dataset based on GPT-4. CHASE-SQL [15], which uses Gemini and prompt engineering, also achieved high scores on the BIRD dataset. From NL2SQL’s evolution, we can expect LLM- and PLM-based methods to keep on winning the next decade in the area, with LLM-based methods particularly favored by researchers. Therefore, understanding the characteristics of these approaches is essential.
Applying LLMs to NL2SQL typically involves two main methods: in-context learning and fine-tuning LLMs. In-context learning is a technique in which the LLM performs tasks by providing contextual examples [16]. In this approach, the model does not require explicit parameter updates or retraining but is instead guided by directly supplying contextual information for the task. This method leverages the latent abilities of large language models learned from large-scale data, enabling them to infer the appropriate task based on context [17]. A well-designed prompt could help LLMs to work efficiently. Representative prompting methods include RAG [18], few-shot learning [19], and reasoning [20]. RAG involves using inputs to retrieve relevant and up-to-date documents and then constructing prompts with the connected documents and inputs [21]. With RAG, LLMs can acquire recent knowledge, improve response interpretability, and partially avoid hallucinations. Few-shot learning is a machine learning method that aims to learn new tasks or classes with only a few training samples. Compared to traditional machine learning methods, which typically require large amounts of labeled data, few-shot learning significantly reduces data needs and maintains good performance even with limited data. Finally, researchers have discovered LLM’s reasoning capabilities [22], where reasoning refers to guiding an LLM to perform multi-step reasoning through prompting. This approach uses language models to generate coherent, logical steps, helping complete complex tasks like mathematical problems, logical reasoning, and common-sense questions [23]. The core of prompt-based reasoning is designing effective prompts to guide LLM to think in steps and derive correct answers rather than relying solely on direct output. This allows the LLM to “think” through certain problems with sequential and logical reasoning.
Additionally, new techniques worth noting include CoT [24], Self-Consistency [25], ToT [26], RAP [27], ReAct [28], Self-Refine [29], Reflexion [30], and LATS [31].
Another approach to applying LLMs is fine-tuning. Fine-tuning allows the model to achieve greater accuracy and effectiveness in specific domains. This process leverages the general language knowledge acquired by the pre-trained model and then adjusts it on a smaller, domain-specific dataset. In NL2SQL, using an in-context learning strategy does not involve optimizing LLM parameters, treating LLMs as ready-made tools. However, if users have lots of corpora or computing power to adjust LLM parameters, fine-tuning for specific NL2SQL tasks can improve model performance and accuracy. Fine-tuning tasks involve two key stages, pre-training and fine-tuning, expressed as follows:
LLM-New = FT (F-Ptrain (LLM, DS), SDS)
In the pre-training phase, the LLM is adjusted on many datasets (DS) to develop strong comprehension abilities. Then, the pre-trained model is further updated to a more specific dataset (SDS) closely related to the task. The process aims to define the model’s abilities to interpret and generate SQL more effectively according to questions. Fine-tuning addresses privacy challenges, although it is costly due to the high computational requirements for adjusting LLM parameters. Several fine-tuning methods are popular, such as SFT, RLHF, and PEFT [32]. These methods have different characteristics and usage scenarios. Prompt tuning focuses on designing the model reasoning process rather than modifying the model’s internal weight structure [33].
Due to hardware constraints and cost limitations, most current NL2SQL methods employ in-context methods, with few directly fine-tuning large models. The table below lists the top five models in accuracy on the BIRD and Spider datasets, which are mainly based on in-context learning, particularly achieving good results with GPT-4, such as Table 1.
Previously, most NL2SQL methods involved linking patterns or semantic optimization in closed-source models, which poses minimal issues for other industries but significant risks for the financial sector. Financial regulations require traceable results and prohibit black-box situations, necessitating open-source models. Furthermore, the complexity of financial business makes it challenging for standard NL2SQL models to interpret specialized financial terminology accurately, complicating natural language-to-SQL conversion. This paper arose in response to this need.

2.2. Preliminaries

To better understand the content of this paper, a brief introduction to some key technologies used in this study is provided below.

2.2.1. DeepSeek

Considering the unique requirements of the financial industry, this paper uses the open-source large language model DeepSeek. DeepSeek is a powerful open-source large language model (LLM) that excels in natural language processing (NLP), particularly in natural language-to-SQL tasks. The core architecture of DeepSeek draws from the LLaMA model, employing an autoregressive Transformer decoder architecture. It utilizes Multi-Head Attention (MHA) and Grouped Query Attention (GQA) techniques, effectively enhancing the model’s performance and efficiency. DeepSeek is trained from scratch on an extensive dataset containing two trillion Chinese and English tokens, giving it strong bilingual processing capabilities. The DeepSeek model can be applied to various scenarios, including but not limited to real-time data queries, the capability enhancement of traditional BI tools, and basic data mining and insights.
DeepSeek and LLaMA are currently popular open-source large language models with distinct strengths and characteristics. DeepSeek innovates in its model architecture, introducing a new Multi-Head Latent Attention (MLA) architecture, which significantly reduces memory usage and employs a self-developed sparse MoE technology to reduce computational load further and improve inference efficiency. DeepSeek-V2 is fully open-source and commercially available, following the MIT open-source license. DeepSeek V2.5 demonstrates strong performance in Chinese comprehension (AlignBench), competing alongside several closed-source models. In summary, both DeepSeek and LLaMA are powerful language models with their respective advantages. DeepSeek shows remarkable efficiency in cost, architectural innovation, and Chinese language processing, while LLaMA offers versatility in model scalability and value for open-source research. Developers and researchers can choose the appropriate model based on their needs. This paper primarily focuses on the financial Chinese domain, so the DeepSeek V2.5 model was selected.
DeepSeek V2.5, developed by DeepSeek-AI, is an advanced AI model focused on general language tasks and programming applications. This version improves upon its predecessors, combining natural language understanding and generation capabilities with strong programming functionality. DeepSeek V2.5 features exceptional writing skills, enhanced instruction execution, and programming integration. It supports both local deployment and cloud-based usage, ensuring flexibility for different users. This paper is based on DeepSeek V2.5.

2.2.2. Prompt Engineering

This paper leverages DeepSeek and prompt engineering to enhance the accuracy of NL2SQL, utilizing prompt engineering in both the NL2PY (Natural Language to Python) and PY2SQL (Python to SQL) components. The logical flowchart of prompt engineering used in this paper is shown in Figure 2. It includes the foundational architecture, supplementary knowledge, sample selection, and reasoning approach. Considering the specifics of the financial industry, we have incorporated financial knowledge and financial data models into the supplementary knowledge section. This greatly improves the model’s understanding of the financial industry’s characteristics and enhances the matching between database tables and Chinese language inputs.
Compared with directly converting natural language to SQL (NL2SQL), the method of converting natural language to Python code (NL2PY) and then converting Python code to SQL (PY2SQL) has the following advantages: First, as a general programming language, Python has mighty expressive power and can more flexibly implement complex business logic and algorithms, including advanced features such as loops, conditional judgments, and functions. Second, Python has a wealth of third-party libraries and modules, which facilitates data processing and analysis and improves the reusability and maintainability of the code. In addition, Python’s debugging and testing tools are more complete, which is convenient for developers to verify the code and troubleshoot errors. The limitations of NL2SQL are mainly reflected in the following aspects: First, it is difficult to directly express complex logic and algorithms, especially when it involves multi-table associations, nested queries, or dynamic calculations. Second, the generated SQL statements may be too complex and need better structure and readability, which challenges the maintenance and understanding of the code. Third, SQL lacks advanced data processing capabilities and flexible error handling mechanisms and cannot integrate various data sources and processing processes as quickly as Python. Finally, SQL debugging tools are relatively scarce, and it is not easy to debug complex queries. In summary, the method of NL2PY and then PY2SQL can fully utilize the advantages of Python, make up for the shortcomings of NL2SQL, and better meet complex business needs.

2.2.3. Pandas

Pandas is an open-source Python data analysis library that offers high-performance, user-friendly data structures and analysis tools. The main data structures in Pandas are the Series (a one-dimensional array) and the DataFrame (a two-dimensional table), both designed to make data manipulation simple, intuitive, and efficient. The Series is a one-dimensional data structure containing any data type, such as integers, strings, floats, or Python objects, and resembles a single column of data. The DataFrame is a two-dimensional data structure, essentially a table made up of multiple Series, similar to an Excel sheet or SQL table. Pandas provides rich data manipulation functionalities, including filtering, sorting, merging, grouping, and aggregating. It supports complex data operations, such as time series analysis, window functions, and rolling statistics.
Pandas also includes various tools for handling missing data, outliers, and duplicates—common tasks in data cleaning. It can import data from multiple sources, including CSV, Excel, SQL databases, and JSON, and export data to these formats. As one of the core libraries in the Python data science ecosystem, Pandas works in conjunction with libraries like NumPy, SciPy, Matplotlib, and Scikit-learn to offer a robust toolkit for data analysis, cleaning, visualization, and machine learning. In this paper, we employ Pandas methods for data manipulation, continuously optimizing them based on different scenarios.

2.2.4. Financial Industry Data Model and Corpus

A financial data model is a structured representation of common tables and fields used in the financial sector, designed through business abstraction to capture operational realities. Using entity-relationship diagrams, this model reflects various financial business scenarios. Drawing on years of financial sector development, this paper builds on the Teradata FS-LDM to enhance a comprehensive, enterprise-level data model tailored for the financial industry, covering approximately 80% of banking business data.
The model comprises ten themes: Party, Product, Agreement, Event, Asset, Finance, Institution, Region, Marketing, and Channel. Key themes like Customer, Agreement, Event, Asset, and Finance are central to the model and are supported by rich data sources and references in the source systems. The goal is to ensure completeness and richness, adding custom data elements unique to banks where necessary. Non-core themes such as Applications, Marketing Activities, Channels, Institutions, and Products generally have minimal data sources or references. The objective here is to maintain structural integrity and scalability, with adjustments as needed over time.
Most financial institutions’ data platforms are designed following enterprise-level data models, as providing a predefined data model can enhance a large model’s understanding of financial industry database tables and fields, ultimately improving NL2SQL’s accuracy. The financial data model includes tables, columns, foreign key relationships, and code tables. Tables represent business entities and their relationships, with each table having both Chinese and English names and interconnected relationships. Column names and data types are clearly defined, allowing customization to fit specific needs. The primary and foreign keys determine the relationship between tables.
Additionally, the model includes code tables to store commonly used codes in the financial industry. For example, gender may include male and female, while birthplace lists all provinces in the country.
Considering the financial sector’s unique characteristics, this paper establishes a financial industry corpus consisting of 12 types of financial sector-specific language resources, as illustrated. These resources help large models better understand financial natural language’s linguistic characteristics and meanings, supporting improved comprehension of business users’ requirements, such as Figure 3

3. Model Design and Implementation

This section mainly describes the proposed method structure of NL2PY2SQL, Pre-Processing, NL2PY and Revise Python, NL2SQL and Revise SQL, and Post-Processing.

3.1. Overview of FI-NL2PY2SQL

The overall workflow of this model is illustrated in the diagram, which includes Pre-Processing, NL2PY (Natural Language to Python), PY2SQL (Python to SQL), and Post-Processing. First, we preprocess the query questions, evidence, database information, financial data models, and financial corpora to identify potential database tables. Next, we convert natural language to Python code using prompt-based methods and DeepSeek. The Python code is then iteratively optimized based on database and sample information. With the Python-generated code and database knowledge, SQL is generated using Prompt and DeepSeek, and the SQL queries are continually executed and refined by troubleshooting issues. Finally, the entire SQL is validated against checkpoints and then outputted. As shown in Figure 4.
As shown in Figure 5, the input data mainly includes natural language queries and database schema information. Natural language queries are encoded as vector representations, and database schema information (table and column names, etc.) are encoded as embeddings as auxiliary context input. The model combines encoded information and generates corresponding SQL queries using a deep learning architecture. The model can understand the query intent and generate corresponding SQL statements by learning the mapping relationship between natural language and SQL. The encoding and fusion of input data help the model accurately understand and operate the database structure when generating SQL.
The sample CSV converts each data table of all databases directly into a CSV file. This allows Python to read each data table file directly and use the Pandas data frame to represent each data table. In the future, the large model can generate Python code and directly calculate the data to check whether the final data meets the natural language question.
The output of the large model is the SQL statement generated for the natural language question. Since there may be many variants of SQL for the same result, we cannot directly judge whether the SQL statement is precisely the same. Therefore, the final method to evaluate the SQL accuracy is to compare whether the query result is consistent with the result of the accurate SQL statement by executing the SQL statement. If the result set r0 generated by the generated SQL statement is precisely the same as the result set r1 generated by the correct SQL, the result generated by the SQL is considered proper.

3.2. Pre-Processing

In the NL2SQL (Natural Language to SQL) model, the data preprocessing and prompt generation steps are key steps to ensure the model can understand and correctly generate SQL queries.
Data preprocessing aims to convert natural language queries and database structure information into a format that the model can understand and process. The steps are as follows:
  • Natural language cleaning and standardization;
Denoising: remove irrelevant characters, punctuation, extra spaces, etc.
Standardization: unify the format of dates, numbers, etc.
  • Word segmentation and part-of-speech tagging;
Word segmentation of natural language queries and splitting sentences into words or subwords helps the model identify keywords, entities, and operations.
Part-of-speech tagging (POS Tagging) helps the model understand the grammatical role of each word.
  • Named Entity Recognition (NER);
Entity recognition: identify entities such as table and column names in the query.
  • Syntactic analysis and dependency analysis;
Through syntactic analysis, understand the grammatical relationship between words in the query, such as the subject-verb-object relationship.
Dependency analysis helps identify the dependency relationship between words.
  • Database schema mapping;
Obtain the database structure information, such as table names, column names, and their data types. This step is key to matching the entities in the natural language query with the actual tables and columns in the database.
  • Semantic understanding and constraint analysis;
Identify the conditions, constraints, and goals in the query to help the model understand the business logic of the query.
After completing data preprocessing, generating appropriate prompts is key to passing natural language queries and database schema information to the model. The purpose of this stage is to guide the model in generating correct SQL statements by constructing effective prompts. The steps are as follows:
  • Constructing prompt templates;
In order to help the model understand the task, prompts are usually constructed using templates. Common template forms include basic templates and templates combined with database structure.
  • Combined with database schema information;
When generating prompts, not only natural language queries but also database table structure information should be passed in so that the model understands the context of the query.
  • Context-enhanced prompts;
Use additional contextual information to enhance prompts, especially in complex query scenarios.
Guide the model in understanding the query intent.
The prompt should clearly express the query’s intent, such as specifying operations (SELECT, WHERE) or constraints.
  • Prompt optimization;
Improve the model’s performance through different prompt structures or diversified SQL query templates.
  • Adjustment and optimization.
Through experiments and model tuning, adjust the format or content of prompts to improve the accuracy of generated SQL. For example, try to use more detailed table structure information or more specific business context to enhance the effectiveness of prompts. The data preprocessing and prompt generation algorithms are shown in Algorithms 1 and 2.
Algorithm 1: Algorithms for Data Preprocessing
Input: query and schema
Output: processed_query and processed_schema
1function preprocess (query, schema):
2 cleaned_query = clean_and_standardize(query)
3 tokenized_query = tokenize_and_pos_tag(cleaned_query)
4 entities = named_entity_recognition(tokenized_query)
5 syntactic_structure = syntax_and_dependency_analysis(tokenized_query)
6 mapped_schema = map_to_schema(entities, schema)
7 constraints = semantic_analysis(entities, mapped_schema)
8 return tokenized_query, mapped_schema, constraints
In Algorithm 1, starting from the input natural language query and database schema information, the information is finally converted into a format suitable for the model after a series of processing such as cleaning, word segmentation, entity recognition, syntactic analysis, schema mapping, and semantic analysis.
Algorithm 2: Algorithms for prompt generation
Input: processed_query, processed_schema and constraints
Output: prompt
1function generate_prompt(processed_query, processed_schema, constraints):
2 prompt_template = “Translate the following question into SQL: {query}”
3 prompt_with_schema = “Given the following table schema: {schema}, “ + prompt_template
4 prompt_with_schema = prompt_with_schema.format (query = processed_query, schema = processed_schema)
5 if constraints:
prompt_with_constraints = prompt_with_schema + “ Apply the following constraints: {constraints}”
prompt_with_constraints = prompt_with_constraints.format (constraints = constraints)
6 else:
7 prompt_with_constraints = prompt_with_schema
8 return prompt_with_constraints
In Algorithm 2, we combine the preprocessed query and database schema information to construct hints with table structure and query intent to help the model understand the query and generate correct SQL query statements.

3.3. NL2PY and Iteration

Based on the generated JSON file, we designed a prompt for DeepSeek to generate Python code. When generating code, ensure that Python’s Pandas module is imported, as Pandas will be used for data manipulation. Using the problem description, prompt information, and extracted field values from the table, create a prompt that guides the model to generate the Pandas query code for answering the question. The LLM should follow a chain-of-thought prompting method, generating Python code that adheres to a specific style, including both reasoning steps and the final Python code.
The generated Python code is then executed to verify and optimize its functionality. During execution, the code undergoes checks by the large model to identify potential syntax errors, confirm that a result is generated, and ensure the result satisfies the question requirements. If any errors occur or the results do not meet expectations, the error message and related context are sent back to the LLM, requesting a corrected version of the Python code. This iterative process can run up to a maximum of N cycles, defaulting to 5 in this case. If the question is successfully answered, the process exits early.
In this process, we decompose complex natural language queries into multiple subtasks and generate intermediate code for each subtask. Then, we gradually deduce the logic of each subtask and verify the validity of the code. Finally, we integrate and optimize the various intermediate code fragments into complete Python. This step-by-step reasoning and verification method effectively improves code generation’s accuracy, readability, and maintainability. The chain of thought (CoT) reasoning process is shown in Algorithm 3.
Algorithm 3: Algorithms for CoT reasoning process
Input: processed_query, entities and syntactic_structure
Output: code_fragments
1function generate_python_code_with_CoT (query, entities, syntactic_structure):
2 sub_tasks = decompose_query (query)
3 code_fragments = []
4 for task in sub_tasks:
intermediate_code = generate_intermediate_code(task, entities)
code_fragments.append (intermediate_code)
7 validate_code (intermediate_code)
8 final_code = integrate_code_fragments (code_fragments)
optimized_code = optimize_code (final_code)
return optimized_code
In Algorithm 3, Python code is generated step by step by decomposing a complex task into multiple subtasks. First, the query is decomposed into independent subtasks, and intermediate code is generated and verified for each subtask. Then, all intermediate code fragments are integrated into the complete code, and its simplicity and readability are optimized. Through step-by-step reasoning and verification, the code of each subtask is carefully checked to ensure that the final output code is accurate, efficient, and meets user requirements.

3.4. NL2SQL and Iteration

Using the generated Python code, we provide the question, relevant information, and Python code as inputs to the large model DeepSeek. With a Chain of Thought (CoT) approach, craft prompts to guide DeepSeek in generating an SQL statement based on database and contextual information. The output includes both the reasoning process and the generated SQL query. Various factors must be considered in designing the prompt to ensure the model fully understands the goal.
As shown in Figure 6, the large language model combines the context and prompt information to generate SQL statements according to relevant requirements. These requirements are mainly for some common problems in NL2SQL, and special requirements and specifications are made. For example, order by + limit one is preferred instead of max/min, ID is selected, and necessary fields are displayed cautiously.
After generating the initial SQL, the code is executed in the database. If issues arise, they are returned to LLM, which may include errors in execution, successful execution with empty results, or results that do not satisfy the query requirements. For execution errors, the system captures the error message and sends it back to the model, prompting it to refine the code and generate a new SQL statement. This iterative process can run up to N cycles (defaulting to 5 in this case) and can be adjusted as needed. The process stops immediately if the query successfully answers the question.

3.5. Post-Processing

After SQL generation, further refinements are necessary to handle common errors and improve accuracy. Typical issues to address include displaying an extra column, selecting incorrect fields, omitting specified fields, handling null values in calculations, duplicate values, unclear metric definitions, decimal conversions, required aggregate calculations, and order-based computations. Based on these checkpoints, a final prompt is crafted to guide the model in performing effective validation and adjustments, ultimately returning a more accurate SQL query, as shown in Figure 7.
In Figure 7, first, a syntax check is performed to ensure that the generated SQL query complies with the SQL syntax specification. Next, query optimization is performed to simplify redundant expressions or subqueries to improve query efficiency. Finally, database-specific adjustments are performed to ensure the generated SQL can be correctly executed in a specific database to avoid potential errors or performance issues. These steps ensure that the final SQL query can accurately and efficiently meet user needs. This final prompt ensures that DeepSeek evaluates the SQL query against each of these criteria, making targeted adjustments as needed. The revised SQL should then better align with the problem requirements and provide a more reliable answer.
When using the NL2PY followed by the PY2SQL method, the iterative processing steps may result in significant computational overhead, especially in scenarios with large-scale data sets. Each transformation step (from natural language to Python code to SQL query) consumes computational resources, and as the complexity of the query increases, so do the model’s processing time and memory requirements. This overhead is particularly prominent in systems with high real-time requirements and may affect the overall query efficiency and system response time. To solve this problem, we first reduce unnecessary intermediate steps by optimizing the conversion process. Secondly, parallel processing and distributed computing are used to distribute tasks to multiple nodes to improve processing efficiency. For common query patterns, caching and precomputation can avoid repeated calculations and reduce the computational cost of each query. Finally, according to the complexity of the query, an adaptive simplification strategy is adopted to save computing resources further. Through these methods, the computational overhead is balanced while improving accuracy, and the system’s efficiency is ensured.

3.6. Application Cases

The NL2PY2SQL method we proposed in this article has been effectively used in application scenarios, demonstrating its significant advantages in processing complex business logic, improving query efficiency, and reducing maintenance costs. The details are shown in Figure 8.
As shown in Figure 8, by converting natural language into Python code and then generating SQL from the Python code, this method makes full use of the flexibility and efficiency of Python to achieve accurate expression of business needs and in-depth optimization of data queries. It is especially suitable for finance; fields such as medical and legal fields that require complex data processing have further verified its broad applicability and innovative value.
The NL2PY2SQL method proposed in this paper and NL2SQL have characteristics in applications outside of finance (such as education and supply chain). NL2PY2SQL first converts natural language into Python code, using Python’s flexibility and rich library support to achieve complex logic expression and data analysis. For example, it can be used in education to design dynamic learning paths and evaluate student performance; in the supply chain, it can handle complex inventory management algorithms, transportation optimization problems, etc. Python code is easy to debug, expand, and modularize, can more intuitively reflect business logic, and convert logic into efficient database operations through PY2SQL. In contrast, NL2SQL directly generates SQL from natural language, which is suitable for quickly executing simple queries. Still, dealing with complex multi-table associations, nested logic, and dynamic calculation tasks is difficult. In addition, the SQL statements generated by NL2SQL need more precise readability and have high maintenance costs. Through two-step processing, NL2PY2SQL ensures support for complex logic and considers performance and scalability, providing a more flexible and reliable solution for diversified scenarios such as education and supply chain.

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Metrics

Considering that the BIRD dataset is more complex and resembles real-world scenarios more closely, including financial contexts, this paper uses the BIRD development dataset as the test dataset, which aligns with the benchmarks chosen by similar models.
The BIRD dataset was released on 21 September 2023. It represents a pioneering cross-domain dataset to study the impact of extensive database content on text-to-SQL parsing. BIRD comprises over 12,751 unique question-SQL pairs, 95 large databases totaling 33.4 GB, and 37 specialized domains such as blockchain, hockey, healthcare, and education. The BIRD dataset is broader, not a specialized dataset for finance. It covers games, sports, coding, finance, biology, etc.
Compared with the Spider 2.0 dataset, the BIRD dataset has apparent advantages in NL2SQL experiments in the financial field. First, the BIRD dataset focuses on the complex table structure and business logic unique to the financial industry, covering many financial statements, transaction records, and customer data, which can more realistically reflect the characteristics of financial data. Secondly, the queries in the BIRD dataset often contain more specific operations related to the financial field, such as financial calculations, risk assessments, and dynamic queries, which enables the model to better adapt to the needs of practical applications. In contrast, although the Spider 2.0 dataset covers multiple fields, it lacks dedicated data and scenarios for the financial industry, so its performance in financial tasks may not be as good as the BIRD dataset. The following are the databases used for the NL2SQL question test and the number of questions in each database.
In Table 2, the input data is the BIRD development dataset dev (https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip, (accessed on 8 October 2024)). This dataset has 1533 natural language query questions. We select EX (Execution Accuracy), F1 (Partial Match Score), and VES (Value Execution Score) metrics to facilitate better comparative testing and conduct experiments on the dataset’s Simple, Moderate, and Challenging types, comparing the results with 11 typical NL2SQL methods.
1.
EX (Execution Accuracy)
EX refers to Execution Accuracy, which indicates whether the results returned by the new SQL queries are the same as those of the target queries when executed on the database. If the results match those of target SQL queries, they are counted as correct. EX mainly focuses on the accuracy of the query execution results rather than whether the specific structure of the SQL queries matches exactly.
Total EX (Total Execution Accuracy) is usually used to measure the accuracy of SQL queries generated by the NL2SQL model when executed. It indicates the proportion of SQL queries generated by the model in all input samples that are correctly executed and return expected results. The calculation formula is as follows:
T o t a l   E X = N u m b e r   o f   C o r r e c t   E x e c u t i o n s T o t a l   N u m b e r   o f   S a m p l e s
In Formula (2), the Number of Correct Executions refers to the number of samples in which the generated SQL query can correctly return the expected results during actual execution. The total Number of Samples refers to the total number of samples of all-natural language queries in the test set. The total EX value is between 0 and 1, usually expressed as a percentage. It is used to evaluate the model’s overall performance in generating SQL queries in the test set and correctly returning results.
2.
F1 (Partial Match Score)
The F1 score is used to measure the similarity between the SQL generated by the model and the target SQL. By comparing the overlap in syntax and structure between the generated SQL and the target SQL, the F1 metric provides a finer-grained assessment of the SQL generation effectiveness. Typically, it calculates the precision and recall between the generated SQL and the target SQL and then computes the F1 value as a comprehensive evaluation.
3.
VES (Value Execution Score)
VES refers to Value Execution Score, a new metric in NL2SQL that focuses on whether the generated SQL queries can correctly retrieve specific values while satisfying the query conditions. For example, if the query results contain specific numbers, strings, or dates, VES evaluates the accuracy of the generated SQL queries in terms of condition expression and value matching, further examining the execution performance of the generated SQL in detail.
These three metrics assess the performance of the NL2SQL model from different perspectives: EX focuses on the consistency of execution results, F1 emphasizes structural similarity, and VES concerns the precision of value matching. This multi-dimensional evaluation system helps researchers gain a more comprehensive understanding of the model’s strengths and weaknesses.
Ablation studies are used to analyze the contribution and importance of various components within a model or algorithm. Specifically, in fields like machine learning and deep learning, ablation studies involve gradually removing certain model components and observing performance changes to determine each part’s impact on overall performance. To better identify the critical steps in the entire process, this paper designs ablation experiments, including the removal of Revise PY, the exclusion of NL2PY, and the non-use of CoT (Chain of Thought), among other aspects.

4.2. Prediction Results

4.2.1. EX Prediction Results

To deeply analyze the performance of different models in NL2SQL tasks, we used SuperSQL, SFT_CodeS_15B_EK, SFT_CodeS_7B_EK, DAILSQL_SC, SFT_CodeS_3B_EK, DAILSQL, SFT_CodeS_1B_EK, C3_SQL, RESDSQL_3B_EK, RESDSQL_Large_EK, RESDSQL_Base_EK, etc. in the experiment. Comparative experiments were conducted on the model. These models have significant differences in architecture, training data, parameter scale, etc., which can help us comprehensively evaluate the model’s accuracy, efficiency, and adaptability. Large-scale models such as SFT_CodeS_15B_EK and RESDSQL_Large_EK can handle more complex queries but may bring higher computational overhead. In contrast, small-scale models such as SFT_CodeS_1B_EK can provide faster inference speed and are suitable for real-time query scenarios. Models such as DAILSQL and RESDSQL can better adapt to the data characteristics of the financial field and handle complex queries such as financial statements and transaction data by optimizing specific database structures. By comparing the performance of different models, we can better understand their advantages and disadvantages in financial applications and ultimately help select the most suitable model to improve the accuracy and efficiency of the NL2SQL system.
As shown in Table 3, in the Simple examples, the model improved by at least 1.83% and by as much as 27.24% compared to subsequent models. In the Moderate examples, it achieved a minimum improvement of 4.31% and a maximum improvement of 30.82% over other models. In the Challenging examples, the model improved by at least 3.45% and up to 32.42% compared to other models. Overall, the model showed an improvement of at least 2.73% and a maximum of 28.81% compared to others.

4.2.2. F1 Prediction Results

From the F1 experimental results, in the Simple examples, the model improved by at least 2.28% and by as much as 28.35% compared to subsequent models. In the Moderate examples, it achieved a minimum improvement of 6.17% and a maximum improvement of 33.64% over other models. In the Challenging examples, the model improved by at least 5.13% and up to 36.36% compared to other models. Overall, the model showed an improvement of at least 3.72% and a maximum of 30.7% compared to others. As shown in Table 4.

4.2.3. VES Prediction Results

From the VES experimental results, in the Simple examples, the model improved by at least 5.87% and by as much as 33.49% compared to subsequent models.
As shown in Table 5, in the Moderate examples, it achieved a minimum improvement of 7.6% and a maximum improvement of 35.93% over other models. In the Challenging examples, the model improved by at least 5.37% and up to 38.07% compared to other models. Overall, the model showed an improvement of at least 6.34% and a maximum of 34.66% compared to others.

4.2.4. Statistical Significance Test

Statistical significance testing provides a scientific basis for decision-making, optimizes resource allocation, and improves the credibility of results by verifying assumptions, assessing causality, and reducing subjective bias. It is widely used in multi-field data analysis and verification. To understand the effect of our proposed NL2PY2SQL method, we conducted a significance test, and the results are shown in Table 6.
As can be seen from Table 6, the EX, F1, and VES indicators of NL2PY2SQL and other compared models, P value of Ex, P Value of F1, and P Value of VES. P values for pairwise statistical significance tests of NL2PY2SQL relative to the following models. Except that the P value of EX between NL2PY2SQL and SUPERSQL did not reach below 0.05, other indicators all reached below 0.05, indicating that NL2PY2SQL exceeds the indicators of different models under specific indicators and is statistically significant.

4.2.5. Error Analysis

In the NL2SQL experiment using the BIRD dataset, the errors mainly focused on entity recognition, join operations, and processing complex conditions. First, entity recognition errors occurred frequently, especially when the query involved multiple tables and columns. Many column names and table names in the BIRD dataset have similarities or synonyms, and the model sometimes fails to correctly identify these entities, resulting in references to the wrong table or column, thereby generating invalid SQL queries. For example, the user query may mention “total sales”, but the model may incorrectly map it to “sales quantity”, resulting in logical misunderstanding.
Second, joint operation errors are also a common problem. There are complex relationships between the tables in the BIRD dataset. When the model generates SQL queries for multi-table joins, it sometimes fails to correctly understand the association conditions between the tables, resulting in the generated query omitting the necessary join conditions, leading to Cartesian products or incorrect dataset results. In addition, the model’s processing of complex nested queries is also insufficient. When the user’s natural language query contains nested conditions, the subquery structure generated by the model sometimes has errors, resulting in the inability to execute the SQL query or the results not being different.
Finally, the errors of entity recognition and complex joins were solved using the NL2PY and then the PY2SQL method. First, the natural language was converted into highly readable Python code to clarify the logic, and then the Python code was converted into SQL queries. This gradual conversion reduced ambiguity and ensured that the model had clear context and logical relationships when understanding and generating SQL, especially in processing multi-table joins and nested conditions.

4.3. Ablation Experiment Prediction Results

From the ablation comparison results of EX, removing CoT has the greatest impact on the results, causing an accuracy decrease of 11.27%. In fact, removing NL2PY leads to a decrease in accuracy of 1.95%. From the ablation comparison results of F1, removing CoT also has the greatest impact, resulting in an F1 score decrease of 11.94%. The next largest impact comes from removing Revise SQL, which decreases F1 by 1.47%. From the ablation comparison results of VES, removing CoT again has the greatest impact, causing a VES decrease of 12.01%. In contrast, removing NL2PY results in a decrease of 2.6% in VES. The analysis above shows that CoT and NL2PY significantly affect the entire process, suggesting that further research can be conducted on these two components.
As shown in Figure 9, the Chain Thinking (CoT) and NL2PY each have their unique contributions to results in NL2SQL, and their synergy can significantly improve the accuracy and logic of query generation. CoT (Chain of Thought) uses step-by-step reasoning to help the model handle complex logic and conditions more accurately when generating SQL queries. It breaks down the problem and gradually expands thinking to avoid jumping directly to the final answer, thereby reducing the possibility of errors. For example, when processing complex multi-table joins or nested queries, CoT helps the model clarify the reasoning process at each step, making the final SQL query more interpretable and accurate. NL2PY further clarifies the query logic by converting natural language into Python code. As an intermediate expression form, Python code makes the execution of query steps more controllable and precise and provides a more accurate basis for subsequent SQL conversion. Through this step-by-step transformation, NL2PY reduces the complexity of natural language understanding, allowing the model to generate SQL queries more accurately. Combining the two, CoT is responsible for the refined reasoning process, while NL2PY ensures the logical operability of queries, thus improving the overall performance and accuracy of the NL2SQL system.
The method proposed in this article first converts natural language to Python and then to SQL, which can flexibly implement complex transaction logic and risk models in the financial field and use Python data analysis and modeling to process large-scale data efficiently, thus significantly improving the accuracy and reliability of queries. Interpretability and decision-making efficiency significantly reduce maintenance costs. In the medical and legal fields, the method of NL2PY combined with PY2SQL has demonstrated strong adaptability. Using Python to express complex logic, such as medical diagnosis paths, legal case reasoning, etc., and then converting it to SQL to achieve efficient queries and data operations can simplify multi-source data integration and support personalized analysis and decision-making. At the same time, the rich Python library can meet the industry’s specific needs and promote intelligent upgrades.
Financial applications based on LLM may face bias risks and compliance issues. For example, the model may inherit biases in the training data, leading to unfair decisions in credit approval, risk assessment, etc. In addition, the generated content may conflict with financial regulations or fail to meet regulatory requirements. To address these challenges, it is necessary to strengthen data review, introduce bias correction mechanisms, and combine strict model output monitoring and compliance audits to ensure fairness and legality while protecting customer rights and reducing legal and reputational risks. In order to comply with GDPR or financial data privacy laws, LLM should embed privacy protection mechanisms such as data minimization, de-identification, and encryption technologies to ensure that user data is not abused during training and inference. Model development must follow privacy design principles and conduct regular compliance audits and risk assessments. In addition, access control and output filtering can be set to prevent the generation of sensitive information. Transparent explanations of model behavior and user data permission management are crucial to meeting regulatory requirements and enhancing user trust.

5. Conclusions

This paper thoroughly analyzes previous advancements and challenges in NL2SQL, considers the unique characteristics of the financial industry, and proposes a tailored NL2SQL approach specifically designed for this sector on the edge-cloud computing platform. Based on the inherent structure of large models, an innovative two-step method is introduced: first, converting natural language to Python, then translating Python to SQL. Accuracy is further enhanced by iteratively optimizing database semantics and supplementing knowledge. Utilizing the open-source DeepSeek model, along with the unique features of Python and SQL, and making full use of prompt engineering, various modes were designed and iteratively refined, achieving significant improvements in NL2SQL accuracy. Specifically, EX improved by at least 2.73%, F1 by at least 3.72%, and VES by 6.34% over existing methods, ultimately achieving strong business performance. However, given the high demands of the financial industry, current accuracy levels are still slightly insufficient, indicating the need for further optimization. During experimentation, it was also observed that the industry’s high level of expertise and extensive corpus could significantly benefit NL2SQL accuracy through continued corpus expansion and the creation of industry-specific datasets, areas for future work to further refine and enhance performance.
Although the NL2PY2SQL method performs well in natural language to SQL conversion, it still has some limitations, especially regarding scalability and applicability. First, the NL2PY2SQL method relies on Python intermediate representation, which may make it face performance bottlenecks when processing complex or diverse queries. For large databases and highly dynamic query scenarios, the intermediate representation generated by Python may not fully capture complex grammatical structures or data dependencies, resulting in limited optimization space. In addition, the current method may have difficulty adapting to unstructured natural language input, especially queries containing fuzzy and ambiguous content. Future research can focus on improving the method’s scalability, such as by optimizing Python intermediate representation generation and more accurate SQL syntax mapping, especially in complex SQL scenarios such as multi-table joins and nested queries. In addition, enhancing the cross-domain adaptability of the method, especially when facing different database systems and application scenarios, can improve the model’s generalization ability through broader corpus and pre-trained models. At the same time, further combining reinforcement learning and automatic tuning mechanisms is expected to make the method more robust and flexible in practical applications.

Author Contributions

X.D., S.H. and F.Z. wrote the main manuscript text. C.W. and B.M.N. provided the idea and prepared the data and figures. All authors reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The work of this paper is supported by the National Key Research and Development Program of China (2019YFB1405000) and the National Natural Science Foundation of China under Grant (No. 61873309, 92046024, 92146002).

Data Availability Statement

The dataset is from https://bird-bench.github.io (accessed 15 July 2024).

Conflicts of Interest

Author Cheng Wang was employed by the company GienTech Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y.; et al. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv 2024, arXiv:2408.05109. [Google Scholar]
  2. Shi, L.; Tang, Z.; Zhang, N.; Zhang, X.; Yang, Z. A Survey on Employing Large Language Models for Text-to-SQL Tasks. arXiv 2024, arXiv:2407.15186. [Google Scholar]
  3. Yavuz, S.; Gür, I.; Su, Y.; Yan, X. What it takes to achieve 100% condition accuracy on WikiSQL. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1702–1711. [Google Scholar]
  4. Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv 2018, arXiv:1809.08887. [Google Scholar]
  5. Adewusi, A.O.; Okoli, U.I.; Adaga, E.; Olorunsogo, T.; Asuzu, O.F.; Daraojimba, D.O. Business intelligence in the era of big data: A review of analytical tools and competitive advantage. Comput. Sci. IT Res. J. 2024, 5, 415–431. [Google Scholar] [CrossRef]
  6. Katsogiannis-Meimarakis, G.; Koutrika, G. A survey on deep learning approaches for text-to-SQL. VLDB J. 2023, 32, 905–936. [Google Scholar] [CrossRef]
  7. Li, F.; Jagadish, H.V. NaLIR: An interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2014; pp. 709–712. [Google Scholar]
  8. Zhao, Y.; Jiang, J.; Hu, Y.; Lan, W.; Zhu, H.; Chauhan, A.; Li, A.; Pan, L.; Wang, J.; Hang, C.-W.; et al. Importance of synthesizing high-quality data for text-to-sql parsing. arXiv 2022, arXiv:2212.08785. [Google Scholar]
  9. Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  10. Talaei, S.; Pourreza, M.; Chang, Y.C.; Mirhoseini, A.; Saberi, A. Chess: Contextual harnessing for efficient sql synthesis. arXiv 2024, arXiv:2405.16755. [Google Scholar]
  11. Li, H.; Zhang, J.; Liu, H.; Fan, J.; Zhang, X.; Zhu, J.; Wei, R.; Pan, H.; Li, C.; Chen, H. Codes: Towards building open-source language models for text-to-sql. Proc. ACM Manag. Data 2024, 2, 1–28. [Google Scholar] [CrossRef]
  12. Pourreza, M.; Rafiei, D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 36. [Google Scholar]
  13. Caferoğlu, H.A.; Ulusoy, Ö. E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL. arXiv 2024, arXiv:2409.16751. [Google Scholar]
  14. Gu, Z.; Fan, J.; Tang, N.; Cao, L.; Jia, B.; Madden, S.; Du, X. Few-shot text-to-sql translation using structure and content prompt learning. Proc. ACM Manag. Data 2023, 1, 1–28. [Google Scholar] [CrossRef]
  15. Pourreza, M.; Li, H.; Sun, R.; Chung, Y.; Talaei, S.; Kakkar, G.T.; Gan, Y.; Saberi, A.; Ozcan, F.; Arik, S.O. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. arXiv 2024, arXiv:2410.01943. [Google Scholar]
  16. Mao, W.; Wang, R.; Guo, J.; Zeng, J.; Gao, C.; Han, P.; Liu, C. Enhancing Text-to-SQL Parsing through Question Rewriting and Execution-Guided Refinement. In Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: New Brunswick, NJ, USA, 2024; pp. 2009–2024. [Google Scholar]
  17. Hong, Z.; Yuan, Z.; Zhang, Q.; Chen, H.; Dong, J.; Huang, F.; Huang, X. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv 2024, arXiv:2406.08426. [Google Scholar]
  18. Yang, J.; Hui, B.; Yang, M.; Yang, J.; Lin, J.; Zhou, C. Synthesizing text-to-sql data from weak and strong llms. arXiv 2024, arXiv:2408.03256. [Google Scholar]
  19. Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv 2023, arXiv:2308.15363. [Google Scholar] [CrossRef]
  20. Yang, R.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023, 2, 255–263. [Google Scholar] [CrossRef]
  21. Pourreza, M.; Rafiei, D. Dts-sql: Decomposed text-to-sql with small large language models. arXiv 2024, arXiv:2402.01117. [Google Scholar]
  22. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv 2024, arXiv:2406.00515. [Google Scholar]
  23. DeepSeek-AI, Q.Z.; Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu, Y.; Li, Y.; Gao, H.; et al. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv 2024, arXiv:2406.11931. [Google Scholar]
  24. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24824–24837. [Google Scholar]
  25. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  26. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 36. [Google Scholar]
  27. Hao, S.; Gu, Y.; Ma, H.; Hong, J.J.; Wang, Z.; Wang, D.Z.; Hu, Z. Reasoning with language model is planning with world model. arXiv 2023, arXiv:2305.14992. [Google Scholar]
  28. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. arXiv 2022, arXiv:2210.03629. [Google Scholar]
  29. Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 36. [Google Scholar]
  30. Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 36. [Google Scholar]
  31. Zhou, A.; Yan, K.; Shlapentokh-Rothman, M.; Wang, H.; Wang, Y.X. Language agent tree search unifies reasoning acting and planning in language models. arXiv 2023, arXiv:2310.04406. [Google Scholar]
  32. Ghaemi, H.; Alizadehsani, Z.; Shahraki, A.; Corchado, J.M. Transformers in source code generation: A comprehensive survey. J. Syst. Archit. 2024, 153, 103193. [Google Scholar] [CrossRef]
  33. Fu, Y.; Panda, R.; Niu, X.; Yue, X.; Hajishirzi, H.; Kim, Y.; Peng, H. Data engineering for scaling language models to 128k context. arXiv 2024, arXiv:2402.10171. [Google Scholar]
Figure 1. Typical workflow diagram of NL2SQL applications on the edge-cloud computing platform. Business users give the query. NL2SQL engine translates to SQL according to semantic lab, financial data model, and database.
Figure 1. Typical workflow diagram of NL2SQL applications on the edge-cloud computing platform. Business users give the query. NL2SQL engine translates to SQL according to semantic lab, financial data model, and database.
Futureinternet 17 00012 g001
Figure 2. NL2SQL prompt in the finance industry. We consider basic structure, supplementary knowledge, example selection, and reasoning. These prompts provide valuable hints for LLMs.
Figure 2. NL2SQL prompt in the finance industry. We consider basic structure, supplementary knowledge, example selection, and reasoning. These prompts provide valuable hints for LLMs.
Futureinternet 17 00012 g002
Figure 3. Financial industry corpus diagram. It includes almost all types of financial corpus, such as regulations, financial news, market analysis, etc.
Figure 3. Financial industry corpus diagram. It includes almost all types of financial corpus, such as regulations, financial news, market analysis, etc.
Futureinternet 17 00012 g003
Figure 4. Overview pipeline of FI-NL2PY2SQL. It shows the model process, from data to final SQL output.
Figure 4. Overview pipeline of FI-NL2PY2SQL. It shows the model process, from data to final SQL output.
Futureinternet 17 00012 g004
Figure 5. Overview pipeline of NL2PY2SQL.
Figure 5. Overview pipeline of NL2PY2SQL.
Futureinternet 17 00012 g005
Figure 6. Generating SQL prompts diagram. We design the prompts on the experiences and financial rules.
Figure 6. Generating SQL prompts diagram. We design the prompts on the experiences and financial rules.
Futureinternet 17 00012 g006
Figure 7. The checkpoints of Post-Process. We must check the generated SQL in columns, fields, functions, numbers, etc.
Figure 7. The checkpoints of Post-Process. We must check the generated SQL in columns, fields, functions, numbers, etc.
Futureinternet 17 00012 g007
Figure 8. Application scenario.
Figure 8. Application scenario.
Futureinternet 17 00012 g008
Figure 9. Comparison of ablation experiment prediction results on BIRD dataset. Multiple ablation experiments were conducted, during which COT, NL2PY, SQL Iteration, and PY Iteration were removed.
Figure 9. Comparison of ablation experiment prediction results on BIRD dataset. Multiple ablation experiments were conducted, during which COT, NL2PY, SQL Iteration, and PY Iteration were removed.
Futureinternet 17 00012 g009
Table 1. Top five models in accuracy on the BIRD and Spider datasets. They are almost based on LLMs such as GPT-4 or GTP-4o. The scores on Spider are higher than on BIRD.
Table 1. Top five models in accuracy on the BIRD and Spider datasets. They are almost based on LLMs such as GPT-4 or GTP-4o. The scores on Spider are higher than on BIRD.
DatasetModelEX Execution with Values or Execution Accuracy (Test)Organizaiton
SpiderMiniSeek91.2Anonymous
SpiderDAL-SQL + GPT-4 + Consistencey86.6Alibaba Group
SpiderDAL-SQL + GPT-486.2Alibaba Group
SpiderDPG-SQL + GPT-4 + Self85.6Anonymous
SpiderDin-SQL + GPT-485.3University of Aliberta
BirdCHASE-SQL + Gemini73Google Cloud
BirdAskData + GPT-4o72.39AT&T-CDO
BirdOpenSearch-SQL + GPT-4o72.28Alibaba Cloud
BirdDistillery + GPT-4o71.83Distyl AI Research
BirdExSL + granite-34b-code70.37IBM Research AI
Table 2. Databases and number of questions per database.
Table 2. Databases and number of questions per database.
DatabaseNumber of Questions
card_games191
codebase_community186
formula_1174
thrombosis_prediction163
student_club158
toxicology145
superhero129
european_football_2129
financial106
california_schools89
debit_card_specializing64
Table 3. Comparison of EX prediction results on BIRD dataset, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL EX comparison results are significantly better than those of other models.
Table 3. Comparison of EX prediction results on BIRD dataset, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL EX comparison results are significantly better than those of other models.
NL2SQL ModelSimple EXModerate EXChallenging EX Total EX
NL2PY2SQL67.78%48.49%46.90%59.97%
SuperSQL65.95%44.18%43.45%57.24%
SFT_CodeS_15B_EK64.97%44.83%40.00%56.52%
SFT_CodeS_7B_EK63.68%42.46%38.62%54.89%
DAILSQL_SC62.49%42.67%42.76%54.63%
SFT_CodeS_3B_EK61.84%42.24%35.86%53.46%
DAILSQL62.05%39.87%35.86%52.87%
SFT_CodeS_1B_EK58.05%34.48%34.48%48.70%
C3_SQL58.05%34.91%29.66%48.37%
RESDSQL_3B_EK52.43%30.60%15.86%42.37%
RESDSQL_Large_EK45.95%23.92%20.00%36.83%
RESDSQL_Base_EK40.54%17.67%14.48%31.16%
Table 4. Comparison of F1 prediction results on BIRD dataset, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL F1 comparison results are significantly better than those of other models.
Table 4. Comparison of F1 prediction results on BIRD dataset, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL F1 comparison results are significantly better than those of other models.
NL2SQL ModelSimple F1Moderate F1 Challenging F1 Total F1
NL2PY2SQL70.79%53.52%52.33%63.82%
SuperSQL68.51%47.35%47.20%60.10%
SFT_CodeS_15B_EK67.26%48.06%44.19%59.27%
DAILSQL_SC65.26%45.23%44.84%57.27%
SFT_CodeS_7B_EK65.66%44.62%40.24%56.89%
SFT_CodeS_3B_EK64.34%45.16%39.73%56.21%
DAILSQL64.11%42.62%38.60%55.20%
C3_SQL60.38%38.07%32.95%51.04%
SFT_CodeS_1B_EK60.29%36.78%37.06%50.98%
RESDSQL_3B_EK54.70%32.90%17.79%44.62%
RESDSQL_Large_EK48.62%25.29%21.61%39.01%
RESDSQL_Base_EK42.44%19.88%15.97%33.12%
Table 5. Comparison of VES prediction results, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL VES comparison results are significantly better than those of other models.
Table 5. Comparison of VES prediction results, compared with 11 NL2SQL models on Challenging, Moderate, and Simple datasets. Overall, the FI-NL2PY2SQL VES comparison results are significantly better than those of other models.
NL2SQL ModelSimple VESModerate VESChallenging VESTotal VES
NL2PY2SQL73.00%54.84%52.23%65.54%
SuperSQL67.13%47.24%46.86%59.20%
SFT_CodeS_15B_EK66.17%46.38%47.97%58.46%
SFT_CodeS_7B_EK64.70%44.78%55.83%57.84%
SFT_CodeS_3B_EK63.45%47.57%36.05%56.05%
DAILSQL_SC64.31%42.13%44.55%55.73%
DAILSQL63.63%39.32%37.73%53.83%
SFT_CodeS_1B_EK58.94%36.28%33.91%49.72%
C3_SQL58.13%36.11%28.83%48.70%
RESDSQL_3B_EK50.91%32.12%27.35%43.00%
RESDSQL_Large_EK44.82%25.10%29.73%37.43%
RESDSQL_Base_EK39.51%18.91%14.16%30.88%
Table 6. Significance test results.
Table 6. Significance test results.
Model NameTotal EXTotal F1 ScoreTotal VESP Value of EXP Value of F1P Value of VES
NL2PY2SQL59.9763.5765.54---
SuperSQL57.2460.159.20.1249 0.0480 0.0003
SFT_CodeS_15B_EK56.5259.2758.460.0528 0.0145 0.0001
SFT_CodeS_7B_EK54.8956.8957.840.0044 0.0002 0.0000
DAILSQL_SC54.6357.2755.730.0028 0.0004 0.0000
SFT_CodeS_3B_EK53.4656.2156.050.0003 0.0000 0.0000
DAILSQL52.8755.253.830.0001 0.0000 0.0000
SFT_CodeS_1B_EK48.750.9849.720.0000 0.0000 0.0000
C3_SQL48.3751.0448.70.0000 0.0000 0.0000
RESDSQL_3B_EK42.3744.62430.0000 0.0000 0.0000
RESDSQL_Large_EK36.8339.0137.430.0000 0.0000 0.0000
RESDSQL_Base_EK31.1633.1230.880.0000 0.0000 0.0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, X.; Hu, S.; Zhou, F.; Wang, C.; Nguyen, B.M. FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model. Future Internet 2025, 17, 12. https://doi.org/10.3390/fi17010012

AMA Style

Du X, Hu S, Zhou F, Wang C, Nguyen BM. FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model. Future Internet. 2025; 17(1):12. https://doi.org/10.3390/fi17010012

Chicago/Turabian Style

Du, Xiaozheng, Shijing Hu, Feng Zhou, Cheng Wang, and Binh Minh Nguyen. 2025. "FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model" Future Internet 17, no. 1: 12. https://doi.org/10.3390/fi17010012

APA Style

Du, X., Hu, S., Zhou, F., Wang, C., & Nguyen, B. M. (2025). FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model. Future Internet, 17(1), 12. https://doi.org/10.3390/fi17010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop