1. Introduction
Structured Query Language (SQL) has become the standard language for database querying, but its complexity poses a barrier for non-technical users. The Natural Language to Structured Query Language (NL2SQL) task converts natural language questions into SQL, providing non-specialist users with a way to interact with databases [
1]. The NL2SQL problem can be defined as follows: given a natural language query and a corresponding database table, the NL2SQL model generates an SQL statement. Early NL2SQL tasks were primarily based on rule-based or pipeline approaches and used domain-specific, simple datasets, making replication challenging. In recent years, with the release of large-scale annotated datasets and the rapid development of deep learning and natural language processing (NLP) technologies, research in the NL2SQL field has garnered significant attention. Researchers have proposed a range of new methods, continuously advancing this field. With breakthroughs in NLP due to deep learning, applying these methods to NL2SQL tasks has yielded positive results [
2].
In 2017, Salesforce introduced the large NL2SQL dataset WikiSQL [
3], containing 80,654 query questions and their related SQL statements across 24,241 database tables. Many studies followed, aiming to improve the accuracy of the WikiSQL dataset. The SQL statements in WikiSQL are relatively simple, mainly involving single-table queries.
In 2018, Yale University released the Spider dataset [
4], a new NL2SQL dataset presenting a higher difficulty level than WikiSQL. Spider, annotated by 11 Yale students, is a large-scale, complex cross-domain semantic parsing and text-to-SQL dataset. The Spider challenge aims to develop natural language interfaces for cross-domain databases. It contains 10,181 questions and 5693 unique complex SQL queries spanning 200 databases and multiple tables across 138 different domains. In Spider1.0, distinct complex SQL queries and databases appear in both training and testing sets. To perform well on this dataset, systems must generalize effectively to new SQL queries and database schemas.
On 28 August 2024, an early-access version of Spider2.0 (a more realistic and challenging text-to-SQL task) became available. Spider2.0 includes 600 real-world text-to-SQL workflow questions derived from enterprise-level database use cases. The databases in Spider 2.0 come from real data applications, often containing over 1000 columns, stored in cloud-based or on-premise database systems like BigQuery, Snowflake, or PostgreSQL. Solving Spider2.0 tasks typically requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge requires models to interact with complex SQL workflow environments, handle extensive context, perform intricate reasoning, and generate SQL queries with multiple operations, often spanning over 100 lines, far exceeding traditional text-to-SQL challenges. Progress on Spider 2.0 represents a critical step toward developing LLM-based code agents that are more intelligent and autonomous in real enterprise environments. With continuous improvements in datasets and the rapid development of large model technologies, many scholars are using large models to enhance NL2SQL accuracy.
Data management in the financial industry has a history of over a decade. Many enterprises in the financial sector have established their own data platforms, applying data extensively in risk, finance, regulatory, and marketing domains, with data playing an increasingly important role in real business scenarios [
5]. As data becomes a production factor, financial enterprises place increasing value on data assets. Reports have traditionally been an important way to demonstrate data’s intrinsic value, yet fixed reports are increasingly insufficient for business needs. Business personnel require the ability to explore valuable data from certain datasets to support decision-making directly and wish to convert data insights into business value quickly. Traditionally, business users would submit data requests to the IT department. Technique experts develop the data requirements and return the data results to the business users. An edge-cloud computing platform could help to finish it. The process can take 1–2 weeks at best and up to 1–2 months at worst, severely impacting business decision-making and subsequent strategies. Business data scientists must update data extraction rules based on query results, adding further complexity. This creates a need for an edge-cloud computing platform system where business personnel can directly use data through natural language, as shown in
Figure 1. Here, business personnel express data requirements in natural language, and the NL2SQL engine converts it into SQL to access the database. The resulting data is then returned to the business personnel. This significantly reduces communication costs, and if the data is unsuitable, business users can quickly adjust the input themselves to obtain desired results, thus improving data-driven decision-making efficiency.
The NL2SQL model is the core component of the entire business process and is highly anticipated by users in the financial sector. The financial industry is knowledge-intensive, with business users having more stringent requirements for NL2SQL accuracy and a lower error tolerance than in other sectors. Traditional NL2SQL models are largely general-purpose and cross-industry, with few models tailored specifically for the financial industry. In this context, this paper proposes a novel NL2SQL construction approach for the financial industry, highlighting three key innovations:
A financial semantic data model and a knowledge base for the financial industry were developed to enhance the big model’s understanding of the sector through database recall and table linking.
An innovative approach was introduced first to convert natural language into Python code and then convert that code into SQL. The model’s accuracy was improved through continuous optimization and iteration during this intermediate process.
We tested this innovative method on the BIRD dataset and found that it exceeded most currently popular open-source models.
This paper consists of five sections. The
Section 1 mainly describes the research background of NL2SQL; the
Section 2 mainly describes the research status of NL2SQL and mainly describes the main methods used in the research process of this paper; the
Section 3 mainly describes the method proposed in this paper and the implementation process; the
Section 4 mainly describes the experimental process and comparative experimental results of the method proposed in this paper; the
Section 5 summarizes the content of this paper and describes future research.
3. Model Design and Implementation
This section mainly describes the proposed method structure of NL2PY2SQL, Pre-Processing, NL2PY and Revise Python, NL2SQL and Revise SQL, and Post-Processing.
3.1. Overview of FI-NL2PY2SQL
The overall workflow of this model is illustrated in the diagram, which includes Pre-Processing, NL2PY (Natural Language to Python), PY2SQL (Python to SQL), and Post-Processing. First, we preprocess the query questions, evidence, database information, financial data models, and financial corpora to identify potential database tables. Next, we convert natural language to Python code using prompt-based methods and DeepSeek. The Python code is then iteratively optimized based on database and sample information. With the Python-generated code and database knowledge, SQL is generated using Prompt and DeepSeek, and the SQL queries are continually executed and refined by troubleshooting issues. Finally, the entire SQL is validated against checkpoints and then outputted. As shown in
Figure 4.
As shown in
Figure 5, the input data mainly includes natural language queries and database schema information. Natural language queries are encoded as vector representations, and database schema information (table and column names, etc.) are encoded as embeddings as auxiliary context input. The model combines encoded information and generates corresponding SQL queries using a deep learning architecture. The model can understand the query intent and generate corresponding SQL statements by learning the mapping relationship between natural language and SQL. The encoding and fusion of input data help the model accurately understand and operate the database structure when generating SQL.
The sample CSV converts each data table of all databases directly into a CSV file. This allows Python to read each data table file directly and use the Pandas data frame to represent each data table. In the future, the large model can generate Python code and directly calculate the data to check whether the final data meets the natural language question.
The output of the large model is the SQL statement generated for the natural language question. Since there may be many variants of SQL for the same result, we cannot directly judge whether the SQL statement is precisely the same. Therefore, the final method to evaluate the SQL accuracy is to compare whether the query result is consistent with the result of the accurate SQL statement by executing the SQL statement. If the result set r0 generated by the generated SQL statement is precisely the same as the result set r1 generated by the correct SQL, the result generated by the SQL is considered proper.
3.2. Pre-Processing
In the NL2SQL (Natural Language to SQL) model, the data preprocessing and prompt generation steps are key steps to ensure the model can understand and correctly generate SQL queries.
Data preprocessing aims to convert natural language queries and database structure information into a format that the model can understand and process. The steps are as follows:
Denoising: remove irrelevant characters, punctuation, extra spaces, etc.
Standardization: unify the format of dates, numbers, etc.
Word segmentation of natural language queries and splitting sentences into words or subwords helps the model identify keywords, entities, and operations.
Part-of-speech tagging (POS Tagging) helps the model understand the grammatical role of each word.
Entity recognition: identify entities such as table and column names in the query.
Through syntactic analysis, understand the grammatical relationship between words in the query, such as the subject-verb-object relationship.
Dependency analysis helps identify the dependency relationship between words.
Obtain the database structure information, such as table names, column names, and their data types. This step is key to matching the entities in the natural language query with the actual tables and columns in the database.
Identify the conditions, constraints, and goals in the query to help the model understand the business logic of the query.
After completing data preprocessing, generating appropriate prompts is key to passing natural language queries and database schema information to the model. The purpose of this stage is to guide the model in generating correct SQL statements by constructing effective prompts. The steps are as follows:
In order to help the model understand the task, prompts are usually constructed using templates. Common template forms include basic templates and templates combined with database structure.
When generating prompts, not only natural language queries but also database table structure information should be passed in so that the model understands the context of the query.
Use additional contextual information to enhance prompts, especially in complex query scenarios.
Guide the model in understanding the query intent.
The prompt should clearly express the query’s intent, such as specifying operations (SELECT, WHERE) or constraints.
Improve the model’s performance through different prompt structures or diversified SQL query templates.
Through experiments and model tuning, adjust the format or content of prompts to improve the accuracy of generated SQL. For example, try to use more detailed table structure information or more specific business context to enhance the effectiveness of prompts. The data preprocessing and prompt generation algorithms are shown in Algorithms 1 and 2.
Algorithm 1: Algorithms for Data Preprocessing |
Input: query and schema |
Output: processed_query and processed_schema |
1 | function preprocess (query, schema): |
2 | | cleaned_query = clean_and_standardize(query) |
3 | | tokenized_query = tokenize_and_pos_tag(cleaned_query) |
4 | | entities = named_entity_recognition(tokenized_query) |
5 | | syntactic_structure = syntax_and_dependency_analysis(tokenized_query) |
6 | | mapped_schema = map_to_schema(entities, schema) |
7 | | constraints = semantic_analysis(entities, mapped_schema) |
8 | | return tokenized_query, mapped_schema, constraints |
In Algorithm 1, starting from the input natural language query and database schema information, the information is finally converted into a format suitable for the model after a series of processing such as cleaning, word segmentation, entity recognition, syntactic analysis, schema mapping, and semantic analysis.
Algorithm 2: Algorithms for prompt generation |
Input: processed_query, processed_schema and constraints |
Output: prompt |
1 | function generate_prompt(processed_query, processed_schema, constraints): |
2 | | prompt_template = “Translate the following question into SQL: {query}” |
3 | | prompt_with_schema = “Given the following table schema: {schema}, “ + prompt_template |
4 | | prompt_with_schema = prompt_with_schema.format (query = processed_query, schema = processed_schema) |
5 | | if constraints: |
| | | prompt_with_constraints = prompt_with_schema + “ Apply the following constraints: {constraints}” |
| | | prompt_with_constraints = prompt_with_constraints.format (constraints = constraints) |
6 | | else: |
7 | | | prompt_with_constraints = prompt_with_schema |
8 | | return prompt_with_constraints |
In Algorithm 2, we combine the preprocessed query and database schema information to construct hints with table structure and query intent to help the model understand the query and generate correct SQL query statements.
3.3. NL2PY and Iteration
Based on the generated JSON file, we designed a prompt for DeepSeek to generate Python code. When generating code, ensure that Python’s Pandas module is imported, as Pandas will be used for data manipulation. Using the problem description, prompt information, and extracted field values from the table, create a prompt that guides the model to generate the Pandas query code for answering the question. The LLM should follow a chain-of-thought prompting method, generating Python code that adheres to a specific style, including both reasoning steps and the final Python code.
The generated Python code is then executed to verify and optimize its functionality. During execution, the code undergoes checks by the large model to identify potential syntax errors, confirm that a result is generated, and ensure the result satisfies the question requirements. If any errors occur or the results do not meet expectations, the error message and related context are sent back to the LLM, requesting a corrected version of the Python code. This iterative process can run up to a maximum of N cycles, defaulting to 5 in this case. If the question is successfully answered, the process exits early.
In this process, we decompose complex natural language queries into multiple subtasks and generate intermediate code for each subtask. Then, we gradually deduce the logic of each subtask and verify the validity of the code. Finally, we integrate and optimize the various intermediate code fragments into complete Python. This step-by-step reasoning and verification method effectively improves code generation’s accuracy, readability, and maintainability. The chain of thought (CoT) reasoning process is shown in Algorithm 3.
Algorithm 3: Algorithms for CoT reasoning process |
Input: processed_query, entities and syntactic_structure |
Output: code_fragments |
1 | function generate_python_code_with_CoT (query, entities, syntactic_structure): |
2 | | sub_tasks = decompose_query (query) |
3 | | code_fragments = [] |
4 | | for task in sub_tasks: |
| | | intermediate_code = generate_intermediate_code(task, entities) |
| | | code_fragments.append (intermediate_code) |
7 | | | validate_code (intermediate_code) |
8 | | final_code = integrate_code_fragments (code_fragments) |
| | optimized_code = optimize_code (final_code) |
| | return optimized_code |
In Algorithm 3, Python code is generated step by step by decomposing a complex task into multiple subtasks. First, the query is decomposed into independent subtasks, and intermediate code is generated and verified for each subtask. Then, all intermediate code fragments are integrated into the complete code, and its simplicity and readability are optimized. Through step-by-step reasoning and verification, the code of each subtask is carefully checked to ensure that the final output code is accurate, efficient, and meets user requirements.
3.4. NL2SQL and Iteration
Using the generated Python code, we provide the question, relevant information, and Python code as inputs to the large model DeepSeek. With a Chain of Thought (CoT) approach, craft prompts to guide DeepSeek in generating an SQL statement based on database and contextual information. The output includes both the reasoning process and the generated SQL query. Various factors must be considered in designing the prompt to ensure the model fully understands the goal.
As shown in
Figure 6, the large language model combines the context and prompt information to generate SQL statements according to relevant requirements. These requirements are mainly for some common problems in NL2SQL, and special requirements and specifications are made. For example, order by + limit one is preferred instead of max/min, ID is selected, and necessary fields are displayed cautiously.
After generating the initial SQL, the code is executed in the database. If issues arise, they are returned to LLM, which may include errors in execution, successful execution with empty results, or results that do not satisfy the query requirements. For execution errors, the system captures the error message and sends it back to the model, prompting it to refine the code and generate a new SQL statement. This iterative process can run up to N cycles (defaulting to 5 in this case) and can be adjusted as needed. The process stops immediately if the query successfully answers the question.
3.5. Post-Processing
After SQL generation, further refinements are necessary to handle common errors and improve accuracy. Typical issues to address include displaying an extra column, selecting incorrect fields, omitting specified fields, handling null values in calculations, duplicate values, unclear metric definitions, decimal conversions, required aggregate calculations, and order-based computations. Based on these checkpoints, a final prompt is crafted to guide the model in performing effective validation and adjustments, ultimately returning a more accurate SQL query, as shown in
Figure 7.
In
Figure 7, first, a syntax check is performed to ensure that the generated SQL query complies with the SQL syntax specification. Next, query optimization is performed to simplify redundant expressions or subqueries to improve query efficiency. Finally, database-specific adjustments are performed to ensure the generated SQL can be correctly executed in a specific database to avoid potential errors or performance issues. These steps ensure that the final SQL query can accurately and efficiently meet user needs. This final prompt ensures that DeepSeek evaluates the SQL query against each of these criteria, making targeted adjustments as needed. The revised SQL should then better align with the problem requirements and provide a more reliable answer.
When using the NL2PY followed by the PY2SQL method, the iterative processing steps may result in significant computational overhead, especially in scenarios with large-scale data sets. Each transformation step (from natural language to Python code to SQL query) consumes computational resources, and as the complexity of the query increases, so do the model’s processing time and memory requirements. This overhead is particularly prominent in systems with high real-time requirements and may affect the overall query efficiency and system response time. To solve this problem, we first reduce unnecessary intermediate steps by optimizing the conversion process. Secondly, parallel processing and distributed computing are used to distribute tasks to multiple nodes to improve processing efficiency. For common query patterns, caching and precomputation can avoid repeated calculations and reduce the computational cost of each query. Finally, according to the complexity of the query, an adaptive simplification strategy is adopted to save computing resources further. Through these methods, the computational overhead is balanced while improving accuracy, and the system’s efficiency is ensured.
3.6. Application Cases
The NL2PY2SQL method we proposed in this article has been effectively used in application scenarios, demonstrating its significant advantages in processing complex business logic, improving query efficiency, and reducing maintenance costs. The details are shown in
Figure 8.
As shown in
Figure 8, by converting natural language into Python code and then generating SQL from the Python code, this method makes full use of the flexibility and efficiency of Python to achieve accurate expression of business needs and in-depth optimization of data queries. It is especially suitable for finance; fields such as medical and legal fields that require complex data processing have further verified its broad applicability and innovative value.
The NL2PY2SQL method proposed in this paper and NL2SQL have characteristics in applications outside of finance (such as education and supply chain). NL2PY2SQL first converts natural language into Python code, using Python’s flexibility and rich library support to achieve complex logic expression and data analysis. For example, it can be used in education to design dynamic learning paths and evaluate student performance; in the supply chain, it can handle complex inventory management algorithms, transportation optimization problems, etc. Python code is easy to debug, expand, and modularize, can more intuitively reflect business logic, and convert logic into efficient database operations through PY2SQL. In contrast, NL2SQL directly generates SQL from natural language, which is suitable for quickly executing simple queries. Still, dealing with complex multi-table associations, nested logic, and dynamic calculation tasks is difficult. In addition, the SQL statements generated by NL2SQL need more precise readability and have high maintenance costs. Through two-step processing, NL2PY2SQL ensures support for complex logic and considers performance and scalability, providing a more flexible and reliable solution for diversified scenarios such as education and supply chain.
4. Experimental Results and Analysis
4.1. Datasets and Evaluation Metrics
Considering that the BIRD dataset is more complex and resembles real-world scenarios more closely, including financial contexts, this paper uses the BIRD development dataset as the test dataset, which aligns with the benchmarks chosen by similar models.
The BIRD dataset was released on 21 September 2023. It represents a pioneering cross-domain dataset to study the impact of extensive database content on text-to-SQL parsing. BIRD comprises over 12,751 unique question-SQL pairs, 95 large databases totaling 33.4 GB, and 37 specialized domains such as blockchain, hockey, healthcare, and education. The BIRD dataset is broader, not a specialized dataset for finance. It covers games, sports, coding, finance, biology, etc.
Compared with the Spider 2.0 dataset, the BIRD dataset has apparent advantages in NL2SQL experiments in the financial field. First, the BIRD dataset focuses on the complex table structure and business logic unique to the financial industry, covering many financial statements, transaction records, and customer data, which can more realistically reflect the characteristics of financial data. Secondly, the queries in the BIRD dataset often contain more specific operations related to the financial field, such as financial calculations, risk assessments, and dynamic queries, which enables the model to better adapt to the needs of practical applications. In contrast, although the Spider 2.0 dataset covers multiple fields, it lacks dedicated data and scenarios for the financial industry, so its performance in financial tasks may not be as good as the BIRD dataset. The following are the databases used for the NL2SQL question test and the number of questions in each database.
In
Table 2, the input data is the BIRD development dataset dev (
https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip, (accessed on 8 October 2024)). This dataset has 1533 natural language query questions. We select EX (Execution Accuracy), F1 (Partial Match Score), and VES (Value Execution Score) metrics to facilitate better comparative testing and conduct experiments on the dataset’s Simple, Moderate, and Challenging types, comparing the results with 11 typical NL2SQL methods.
- 1.
EX (Execution Accuracy)
EX refers to Execution Accuracy, which indicates whether the results returned by the new SQL queries are the same as those of the target queries when executed on the database. If the results match those of target SQL queries, they are counted as correct. EX mainly focuses on the accuracy of the query execution results rather than whether the specific structure of the SQL queries matches exactly.
Total EX (Total Execution Accuracy) is usually used to measure the accuracy of SQL queries generated by the NL2SQL model when executed. It indicates the proportion of SQL queries generated by the model in all input samples that are correctly executed and return expected results. The calculation formula is as follows:
In Formula (2), the Number of Correct Executions refers to the number of samples in which the generated SQL query can correctly return the expected results during actual execution. The total Number of Samples refers to the total number of samples of all-natural language queries in the test set. The total EX value is between 0 and 1, usually expressed as a percentage. It is used to evaluate the model’s overall performance in generating SQL queries in the test set and correctly returning results.
- 2.
F1 (Partial Match Score)
The F1 score is used to measure the similarity between the SQL generated by the model and the target SQL. By comparing the overlap in syntax and structure between the generated SQL and the target SQL, the F1 metric provides a finer-grained assessment of the SQL generation effectiveness. Typically, it calculates the precision and recall between the generated SQL and the target SQL and then computes the F1 value as a comprehensive evaluation.
- 3.
VES (Value Execution Score)
VES refers to Value Execution Score, a new metric in NL2SQL that focuses on whether the generated SQL queries can correctly retrieve specific values while satisfying the query conditions. For example, if the query results contain specific numbers, strings, or dates, VES evaluates the accuracy of the generated SQL queries in terms of condition expression and value matching, further examining the execution performance of the generated SQL in detail.
These three metrics assess the performance of the NL2SQL model from different perspectives: EX focuses on the consistency of execution results, F1 emphasizes structural similarity, and VES concerns the precision of value matching. This multi-dimensional evaluation system helps researchers gain a more comprehensive understanding of the model’s strengths and weaknesses.
Ablation studies are used to analyze the contribution and importance of various components within a model or algorithm. Specifically, in fields like machine learning and deep learning, ablation studies involve gradually removing certain model components and observing performance changes to determine each part’s impact on overall performance. To better identify the critical steps in the entire process, this paper designs ablation experiments, including the removal of Revise PY, the exclusion of NL2PY, and the non-use of CoT (Chain of Thought), among other aspects.
4.2. Prediction Results
4.2.1. EX Prediction Results
To deeply analyze the performance of different models in NL2SQL tasks, we used SuperSQL, SFT_CodeS_15B_EK, SFT_CodeS_7B_EK, DAILSQL_SC, SFT_CodeS_3B_EK, DAILSQL, SFT_CodeS_1B_EK, C3_SQL, RESDSQL_3B_EK, RESDSQL_Large_EK, RESDSQL_Base_EK, etc. in the experiment. Comparative experiments were conducted on the model. These models have significant differences in architecture, training data, parameter scale, etc., which can help us comprehensively evaluate the model’s accuracy, efficiency, and adaptability. Large-scale models such as SFT_CodeS_15B_EK and RESDSQL_Large_EK can handle more complex queries but may bring higher computational overhead. In contrast, small-scale models such as SFT_CodeS_1B_EK can provide faster inference speed and are suitable for real-time query scenarios. Models such as DAILSQL and RESDSQL can better adapt to the data characteristics of the financial field and handle complex queries such as financial statements and transaction data by optimizing specific database structures. By comparing the performance of different models, we can better understand their advantages and disadvantages in financial applications and ultimately help select the most suitable model to improve the accuracy and efficiency of the NL2SQL system.
As shown in
Table 3, in the Simple examples, the model improved by at least 1.83% and by as much as 27.24% compared to subsequent models. In the Moderate examples, it achieved a minimum improvement of 4.31% and a maximum improvement of 30.82% over other models. In the Challenging examples, the model improved by at least 3.45% and up to 32.42% compared to other models. Overall, the model showed an improvement of at least 2.73% and a maximum of 28.81% compared to others.
4.2.2. F1 Prediction Results
From the F1 experimental results, in the Simple examples, the model improved by at least 2.28% and by as much as 28.35% compared to subsequent models. In the Moderate examples, it achieved a minimum improvement of 6.17% and a maximum improvement of 33.64% over other models. In the Challenging examples, the model improved by at least 5.13% and up to 36.36% compared to other models. Overall, the model showed an improvement of at least 3.72% and a maximum of 30.7% compared to others. As shown in
Table 4.
4.2.3. VES Prediction Results
From the VES experimental results, in the Simple examples, the model improved by at least 5.87% and by as much as 33.49% compared to subsequent models.
As shown in
Table 5, in the Moderate examples, it achieved a minimum improvement of 7.6% and a maximum improvement of 35.93% over other models. In the Challenging examples, the model improved by at least 5.37% and up to 38.07% compared to other models. Overall, the model showed an improvement of at least 6.34% and a maximum of 34.66% compared to others.
4.2.4. Statistical Significance Test
Statistical significance testing provides a scientific basis for decision-making, optimizes resource allocation, and improves the credibility of results by verifying assumptions, assessing causality, and reducing subjective bias. It is widely used in multi-field data analysis and verification. To understand the effect of our proposed NL2PY2SQL method, we conducted a significance test, and the results are shown in
Table 6.
As can be seen from
Table 6, the EX, F1, and VES indicators of NL2PY2SQL and other compared models, P value of Ex, P Value of F1, and P Value of VES. P values for pairwise statistical significance tests of NL2PY2SQL relative to the following models. Except that the P value of EX between NL2PY2SQL and SUPERSQL did not reach below 0.05, other indicators all reached below 0.05, indicating that NL2PY2SQL exceeds the indicators of different models under specific indicators and is statistically significant.
4.2.5. Error Analysis
In the NL2SQL experiment using the BIRD dataset, the errors mainly focused on entity recognition, join operations, and processing complex conditions. First, entity recognition errors occurred frequently, especially when the query involved multiple tables and columns. Many column names and table names in the BIRD dataset have similarities or synonyms, and the model sometimes fails to correctly identify these entities, resulting in references to the wrong table or column, thereby generating invalid SQL queries. For example, the user query may mention “total sales”, but the model may incorrectly map it to “sales quantity”, resulting in logical misunderstanding.
Second, joint operation errors are also a common problem. There are complex relationships between the tables in the BIRD dataset. When the model generates SQL queries for multi-table joins, it sometimes fails to correctly understand the association conditions between the tables, resulting in the generated query omitting the necessary join conditions, leading to Cartesian products or incorrect dataset results. In addition, the model’s processing of complex nested queries is also insufficient. When the user’s natural language query contains nested conditions, the subquery structure generated by the model sometimes has errors, resulting in the inability to execute the SQL query or the results not being different.
Finally, the errors of entity recognition and complex joins were solved using the NL2PY and then the PY2SQL method. First, the natural language was converted into highly readable Python code to clarify the logic, and then the Python code was converted into SQL queries. This gradual conversion reduced ambiguity and ensured that the model had clear context and logical relationships when understanding and generating SQL, especially in processing multi-table joins and nested conditions.
4.3. Ablation Experiment Prediction Results
From the ablation comparison results of EX, removing CoT has the greatest impact on the results, causing an accuracy decrease of 11.27%. In fact, removing NL2PY leads to a decrease in accuracy of 1.95%. From the ablation comparison results of F1, removing CoT also has the greatest impact, resulting in an F1 score decrease of 11.94%. The next largest impact comes from removing Revise SQL, which decreases F1 by 1.47%. From the ablation comparison results of VES, removing CoT again has the greatest impact, causing a VES decrease of 12.01%. In contrast, removing NL2PY results in a decrease of 2.6% in VES. The analysis above shows that CoT and NL2PY significantly affect the entire process, suggesting that further research can be conducted on these two components.
As shown in
Figure 9, the Chain Thinking (CoT) and NL2PY each have their unique contributions to results in NL2SQL, and their synergy can significantly improve the accuracy and logic of query generation. CoT (Chain of Thought) uses step-by-step reasoning to help the model handle complex logic and conditions more accurately when generating SQL queries. It breaks down the problem and gradually expands thinking to avoid jumping directly to the final answer, thereby reducing the possibility of errors. For example, when processing complex multi-table joins or nested queries, CoT helps the model clarify the reasoning process at each step, making the final SQL query more interpretable and accurate. NL2PY further clarifies the query logic by converting natural language into Python code. As an intermediate expression form, Python code makes the execution of query steps more controllable and precise and provides a more accurate basis for subsequent SQL conversion. Through this step-by-step transformation, NL2PY reduces the complexity of natural language understanding, allowing the model to generate SQL queries more accurately. Combining the two, CoT is responsible for the refined reasoning process, while NL2PY ensures the logical operability of queries, thus improving the overall performance and accuracy of the NL2SQL system.
The method proposed in this article first converts natural language to Python and then to SQL, which can flexibly implement complex transaction logic and risk models in the financial field and use Python data analysis and modeling to process large-scale data efficiently, thus significantly improving the accuracy and reliability of queries. Interpretability and decision-making efficiency significantly reduce maintenance costs. In the medical and legal fields, the method of NL2PY combined with PY2SQL has demonstrated strong adaptability. Using Python to express complex logic, such as medical diagnosis paths, legal case reasoning, etc., and then converting it to SQL to achieve efficient queries and data operations can simplify multi-source data integration and support personalized analysis and decision-making. At the same time, the rich Python library can meet the industry’s specific needs and promote intelligent upgrades.
Financial applications based on LLM may face bias risks and compliance issues. For example, the model may inherit biases in the training data, leading to unfair decisions in credit approval, risk assessment, etc. In addition, the generated content may conflict with financial regulations or fail to meet regulatory requirements. To address these challenges, it is necessary to strengthen data review, introduce bias correction mechanisms, and combine strict model output monitoring and compliance audits to ensure fairness and legality while protecting customer rights and reducing legal and reputational risks. In order to comply with GDPR or financial data privacy laws, LLM should embed privacy protection mechanisms such as data minimization, de-identification, and encryption technologies to ensure that user data is not abused during training and inference. Model development must follow privacy design principles and conduct regular compliance audits and risk assessments. In addition, access control and output filtering can be set to prevent the generation of sensitive information. Transparent explanations of model behavior and user data permission management are crucial to meeting regulatory requirements and enhancing user trust.
5. Conclusions
This paper thoroughly analyzes previous advancements and challenges in NL2SQL, considers the unique characteristics of the financial industry, and proposes a tailored NL2SQL approach specifically designed for this sector on the edge-cloud computing platform. Based on the inherent structure of large models, an innovative two-step method is introduced: first, converting natural language to Python, then translating Python to SQL. Accuracy is further enhanced by iteratively optimizing database semantics and supplementing knowledge. Utilizing the open-source DeepSeek model, along with the unique features of Python and SQL, and making full use of prompt engineering, various modes were designed and iteratively refined, achieving significant improvements in NL2SQL accuracy. Specifically, EX improved by at least 2.73%, F1 by at least 3.72%, and VES by 6.34% over existing methods, ultimately achieving strong business performance. However, given the high demands of the financial industry, current accuracy levels are still slightly insufficient, indicating the need for further optimization. During experimentation, it was also observed that the industry’s high level of expertise and extensive corpus could significantly benefit NL2SQL accuracy through continued corpus expansion and the creation of industry-specific datasets, areas for future work to further refine and enhance performance.
Although the NL2PY2SQL method performs well in natural language to SQL conversion, it still has some limitations, especially regarding scalability and applicability. First, the NL2PY2SQL method relies on Python intermediate representation, which may make it face performance bottlenecks when processing complex or diverse queries. For large databases and highly dynamic query scenarios, the intermediate representation generated by Python may not fully capture complex grammatical structures or data dependencies, resulting in limited optimization space. In addition, the current method may have difficulty adapting to unstructured natural language input, especially queries containing fuzzy and ambiguous content. Future research can focus on improving the method’s scalability, such as by optimizing Python intermediate representation generation and more accurate SQL syntax mapping, especially in complex SQL scenarios such as multi-table joins and nested queries. In addition, enhancing the cross-domain adaptability of the method, especially when facing different database systems and application scenarios, can improve the model’s generalization ability through broader corpus and pre-trained models. At the same time, further combining reinforcement learning and automatic tuning mechanisms is expected to make the method more robust and flexible in practical applications.