1. Introduction
Text-to-SQL parsing [
1] involves translating user queries in natural language into executable SQL queries that can be run on databases. Text-to-SQL allows users, even those without knowledge of SQL or database techniques, to interact with databases using natural language, and it has attracted increasing attention from both the database and natural language processing communities. With the advent of large language models (LLMs), numerous methods [
2,
3,
4] have demonstrated that LLMs can significantly improve the accuracy of text-to-SQL parsers. However, all of these methods are based on closed-source models, such as GPT-4 [
5], raising concerns about data leakage in private scenarios.
Recently, the emergence of a large number of open-source LLMs [
6,
7] has attracted widespread attention because these models show capabilities comparable to those of closed-source models in a wide range of natural language processing (NLP) tasks. Recent studies [
8,
9,
10] have demonstrated that open-source LLMs with fewer parameters, when fine-tuned on human-annotated data, can achieve SQL generation accuracy comparable to large closed-source LLMs. This makes fine-tuning open-source LLMs a viable solution for private scenarios. While effective, these methods are costly due to their reliance on manually annotated data. To address this, researchers are exploring alternative fine-tuning approaches that reduce the human dependency. This motivates us to study the fine-tuning of LLMs without the need for extra human-annotated data to convert weak models to strong models in text-to-SQL tasks.
In response to the challenge of data scarcity, numerous studies [
11,
12,
13,
14] have explored self-play mechanisms to iteratively improve models by engaging in “competitive interactions” with their own generated instances. This approach has been proven successful in systems like AlphaGo Zero [
11] and AlphaZero [
12]. To enhance the performance of weak LLMs, SPIN [
14] proposes improving models by allowing them to play against themselves, without requiring direct supervision. This approach autonomously generates data, progressively enhancing the model’s capabilities while maximizing the utility of corrected labeled examples in supervised fine-tuning (SFT).
Although SPIN enhances the data diversity, its application to text-to-SQL tasks remains unexplored. As shown in
Figure 1, directly applying SPIN to text-to-SQL models leads to a significant performance drop. In some cases, the performance falls below that of the original model and the model fine-tuned via SFT. This performance degradation primarily arises from two key issues. First, in the text-to-SQL task, SQL queries have well-defined execution results, but SPIN does not leverage this clear feedback. Instead, it uses the semantic similarity between SQL queries to distinguish between positive and negative samples. However, in practice, two semantically similar SQL queries may yield different execution results, making it challenging for the model to generate correct SQL queries. Second, the data synthesized by SPIN neglect the implicit link between natural language questions and database schemas [
15], preventing the model from learning the true relationships within the data.
To address these challenges, we propose a novel framework, ExSPIN, which incorporates explicit information—specifically, SQL execution feedback and schema information—into self-play fine-tuning. First, before each round of iterative training, we introduce an explicit schema integration method that incorporates schema information relevant to natural language questions. This method constructs display prompts to bridge the mismatch between natural language queries and database schemas in the large model. In each training round, we also propose an execution feedback fine-tuning method. This method changes the opponent model’s strategy for distinguishing between positive and negative samples. Instead of relying on semantic similarity, it focuses on the execution results of SQL queries predicted by the opponent model. This explicit feedback improves the opponent model’s ability to evaluate synthetic data’s quality, enhancing the primary model’s performance. We evaluate ExSPIN on two real-world datasets: SPIDER and BIRD. The experimental results show that LLMs fine-tuned with ExSPIN outperform state-of-the-art (SOTA) methods.
Our main contributions are as follows.
- 1.
We propose a SPIN framework that explicitly integrates SQL execution feedback and schema information into the training process, enhancing the model’s understanding of the data and, consequently, improving its accuracy.
- 2.
We present a method for explicit schema integration in self-play training, which accurately extracts target schema information and mitigates the influence of noise.
- 3.
We introduce an execution feedback fine-tuning method that incorporates SQL execution results into the model’s parameter update process, thereby improving the performance of self-play fine-tuning.
- 4.
We conduct experiments on two real-world datasets, and the results demonstrate that ExSPIN effectively enhances model performance, surpassing the state-of-the-art (SOTA) methods.
3. Problem Setting and Preliminaries
In the SPIN framework, the primary player model aims to distinguish between responses generated by the LLM and those generated by humans, while the adversary model strives to generate responses that are indistinguishable from human responses. The core of SPIN lies in its self-adversarial mechanism, where both the primary player and the adversary are instances of the same LLM but from different iterations. Specifically, the adversary is the version of the LLM from the previous iteration, while the primary player is the LLM being trained in the current iteration. In iteration
, the adversary LLM from the previous iteration, denoted as
, generates a response
from the prompt
x according to
. Therefore, the optimization objective of the SPIN in iteration
is
The objective of the primary player
is to maximize the expected difference between the scores assigned to human responses
y and adversary responses
. SPIN defines the closed-form solution for this optimization objective
as follows:
where
is the parameter space of the considered LLM. Given
in Equation (2), we obtain the parameterized function for
in SPIN:
Here, and represent the similarity between the responses generated by the model and the human responses y. The capability of the main model is evaluated by calculating the logarithmic ratio of these two similarities. If the main model can generate responses that align more closely with the human responses compared to the adversary model, the value of will be higher.
Substituting this parameterized function into the optimization objective, we derive an end-to-end training objective and the update rule for
:
where the expectation is computed over the distribution:
Formula (
4) represents the final training objective of SPIN. It uses the loss function
ℓ to measure the difference between the output of the primary player
and the target. Specifically, it calculates the difference in the log-probability ratios between the preferences of the two models for human responses
y and adversary responses
. By optimizing this difference through
ℓ, the objective function guides the primary model to generate responses that are more similar to human responses and reduces the probability of generating responses similar to those from the adversary model.
However, in the text-to-SQL task, the SPIN framework primarily suffers from the following two issues. First, the SPIN adversarial model ignores the implicit link between natural language questions and database schemas. Specifically, in the SPIN optimization function, the model only increases the reward for the correct SQL and imposes a penalty on the responses generated by the adversary model : , to enhance the primary model. However, in the text-to-SQL task, the accuracy of SQL generation often depends on the model’s ability to understand the relationship between the question and the database. The traditional SPIN framework can only implicitly learn the correspondence between the question and the database by favoring correct SQL queries, making it challenging for the model to deepen its understanding of schema-linking features. As a result, this leads to suboptimal performance in SQL generation tasks.
Second, SPIN overemphasizes exact SQL matching. Unlike text generation tasks, which often focus on human preferences, the text-to-SQL task typically requires the model to generate SQL statements that align with the query intent, rather than producing SQL statements that are identical to human-annotated ones. Achieving exact matches with annotated SQL is not only difficult but also undermines the model’s robustness. Instead, we prefer the model to generate diverse SQL statements that still retrieve the correct data, which would significantly enhance its generalization ability. However, in the traditional SPIN framework, the model is trained to generate SQL statements identical to the annotated ones, while dismissing its own generated statements that are correct. Specifically, for the correctly executable SQL statements
generated by the adversary model, the update for the primary player model becomes
This optimization function penalizes , causing the model to modify its output. However, when the output is correct, this penalty harms the model’s robustness, leading to a decline in the accuracy of the generated SQL.
4. ExSPIN
To address the challenges faced by traditional SPIN methods in text-to-SQL parsing tasks, we propose an explicit feedback-based self-play fine-tuning (ExSPIN) framework. The framework consists of three stages: supervised fine-tuning (SFT), explicit schema integration, and execution feedback self-play fine-tuning. As shown in
Figure 2, we first fine-tune a text-to-SQL model using supervised fine-tuning (SFT) techniques, enabling it to generate SQL queries based on natural language instructions. Next, we apply explicit schema integration to construct training prompts from natural language questions, injecting relevant database schema information to capture the intent of the query. Finally, execution feedback fine-tuning incorporates the execution results of the SQL queries generated by the adversarial model into the parameter update process. The following sections provide a detailed description of each stage of the process.
4.1. Supervised Fine-Tuning
In this subsection, we introduce the first step of ExSPIN. The goal of SFT is to fine-tune an LLM with preliminary SQL generation capabilities according to the natural language question. During the fine-tuning process, the model maps natural language queries to corresponding SQL statements through supervised learning. Given
, we apply the standard supervised fine-tuning (SFT) objective on the base model
P with parameters
:
where
denotes the
i-th input, consisting of a concatenated instruction and user query.
4.2. Explicit Schema Integration
Explicit schema integration aims to embed schema-linking features between natural language questions and database schemas into the model’s input during training. These well-structured prompts help the model to better understand the relationships between database elements. This significantly improves its ability to generate accurate SQL queries. Our method focuses on three key components: schema filtering, value retrieval, and database metadata integration.
Figure 3 provides a practical example output of explicit schema integration. Given the query “What is the average number of Mubi users who love movies directed by Stanley Kubrick?”, through schema filtering, relevant tables and columns are identified; value retrieval extracts the specific query-related value “Stanley Kubrick”; and database metadata integration incorporates column types, representative values, and key relationships.
Schema Filtering. To ensure that the database schemas are closely aligned with the input query, we retrieve the most relevant tables and columns to generate the target SQL query. To achieve this, we design a schema filter
f that evaluates the relevance of database tables
T and columns
C with respect to a given query
Q. The filter assigns a relevance score as follows:
We include the top
tables and
columns with the highest relevance scores in the training prompt. In cases where fewer than
tables are deemed relevant, we supplement the prompt with randomly selected tables from the database. This strategy minimizes token usage while preserving the essential schema information needed to generate accurate SQL queries.
Value Retrieval. Incorporating query-specific values from the database into the prompt is crucial for accurate SQL generation. For example, given the query “How many people live in Shanghai?”, the database column “city.name” containing the value “Shanghai” should result in the condition “city.name = ‘Shanghai”’ being included in the prompt. However, directly retrieving values from a large database can be computationally expensive. To address this challenge, we employ a two-stage approach. First, we construct a BM25 index [
33] to perform a coarse search, narrowing down the potential matches. Then, we apply the longest common substring (LCS) algorithm to identify the specific values corresponding to the query. This coarse-to-fine strategy effectively reduces the computational overhead while maintaining high retrieval accuracy.
Database Metadata Integration. To further enrich schema-linking, we incorporate key metadata elements into the prompt as follows.
Primary and Foreign Keys: Incorporating primary and foreign key information can help the model to deduce the appropriate join path. We extract this information to establish relationships between tables and guide the model in constructing accurate JOIN operations.
Representative Column Values: By including representative values for each column (e.g., human.gender = F|M), we enhance the model’s understanding of the column content and format.
Column Data Types: The types of column data dictate the validation rules and permissible operations, ensuring accurate SQL query formulation. For instance, numeric columns require casting when performing arithmetic operations if stored as strings.
Comments: Database comments help to clarify ambiguities in schema elements and enable the LLM to perform precise schema-linking. For example, by adding the comment “time consumed per training round”, the model can understand the true meaning of the column name “tcpr”.
Algorithm 1 summarizes the procedures of our explicit schema integration process. By building these high-quality prompts, we significantly improve the model’s capacity to generate precise SQL queries during SPIN.
Algorithm 1: Explicit Schema Integration |
Input: User question Q, schema item classifier model f, database schema , database metadata , database index I, maximum table and column numbers , Output: Database prompt // Schema Filtering , ; ; // Value Retrieval ; ; // Metadata Integration and Prompt Concatenation ; ; ; return ;
|
4.3. Execution Feedback Fine-Tuning
After explicit schema integration, we obtain the training input prompt of the SPIN. In the
-th iteration, we concatenate the database prompt text
P with the query input
x, denoted as
, which is represented as
With this enhanced input, we train the main model and the adversary model. In the SPIN training process, the goal of the main model is to distinguish between the outputs of the adversary model and the original dataset outputs. After enhancing the input with the database prompt text, the main model
maximizes the expected gap between the target data distribution
and the adversary model distribution
:
where the expectation is computed over the following distributions:
,
, and
.
Subsequently, we use an executor to filter the responses generated by the adversary model in each round on the database:
By discarding the correct SQL statements
generated by the adversary model, we retain only the incorrect SQL statements
and use them to update the objective function of the main model:
For a given
and the response
y to
, the value
reflects the main model’s ability to distinguish the outputs of the adversary model. Ideally, when
,
should output a higher value, whereas, when
, it should output a lower value. After obtaining the discrimination results from the main model, we update the parameters
of the adversary model. The objective function for updating the adversary model is as follows:
Finally, we integrate the two aforementioned steps into an end-to-end training objective and derive our final loss function:
This formula represents the ultimate training objective of ExSPIN. Unlike SPIN, we have filtered out training samples where the adversary model generates incorrect SQL through the execution feedback mechanism. By calculating the difference in the log probability ratios between the correctly human-annotated SQL y and the incorrect SQL , we guide the main model to adjust its outputs to more closely align with y, thereby correcting the errors of the adversary model.
We then iteratively repeat this self-play process, allowing the model to progressively deepen its understanding of the database schema links, ultimately generating high-quality SQL queries.
Our theoretical framework is analogous to adversarial games, where the model continuously improves through competition with its past versions. Specifically, in ExSPIN, under the execution feedback mechanism, the model applies a penalty to incorrect SQL queries generated by its opponent model:
This penalty term guides the model to identify and correct errors in the opponent model’s generated SQL queries, helping it to recognize its own weaknesses and produce more accurate and executable SQL queries. At the same time, the model applies a reward to correctly annotated SQL queries:
This reward term encourages the model to modify its outputs to better align with correct SQL queries, effectively guiding the model to update itself in the right direction. Through this iterative self-play process, the model progressively enhances its text-to-SQL performance.
Using the executor, we filter out correct and incorrect SQL statements generated by the adversary model. To maintain model robustness, we discard correctly executed results and use only incorrect results for SPIN. This approach ensures that the main player model negates only incorrect SQL statements while acknowledging correct ones. As a result, the SQL generation accuracy improves during self-play. The detailed algorithm is presented in Algorithm 2, Given a query
and an initial model
, we first use explicit schema integration to obtain
. By concatenating
with
, we construct the model input
. The model is then divided into a main player model and an opponent model. The opponent model generates SQL queries, and an executor is used to filter out incorrect SQL queries. These filtered data are used to train the main player model, resulting in the next iteration of the model. Finally, the opponent model is updated to the newly trained model. In this process, the opponent model for each round is the result of the training from the previous round.
Algorithm 2: ExSPIN |
![Entropy 27 00235 i001]() |
5. Experiments
5.1. Experimental Setup
5.1.1. Datasets
We perform testing on the following two real-world benchmarks.
SPIDER. The dataset [
16] is divided into a training set (8659 samples), a development set (1034 samples), and a test set. Of the 7000 samples in the training set, 1659 are sourced from six previously released text-to-SQL datasets, including Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp. The SPIDER dataset comprises 200 databases spanning 138 different domains. We evaluate our method using the test set provided by the SPIDER dataset, which includes 2147 test queries and 40 test databases.
BIRD. The dataset [
34] consists of 12,751 question–SQL pairs across 37 domains, including finance, healthcare, education, and more. These queries are carefully crafted to reflect the complexities of real-world scenarios, such as handling large databases (some exceeding 10,000 rows), implicit domain knowledge, and noisy or ambiguous natural language queries. Additionally, BIRD prioritizes execution accuracy over syntactic correctness, ensuring that models are evaluated not only on their ability to generate valid SQL but also on their capacity to produce queries that yield correct and meaningful results when executed on real databases.
5.1.2. Models
We evaluated our approach across four distinct open-source LLMs for code generation, with parameter sizes ranging from 1.5B to 14B.
DeepSeek-Coder-Instruct 6.7B: The DeepSeek-Coder [
6] series is a comprehensive suite of code LLMs, each meticulously trained from the ground up on an extensive dataset comprising 2 trillion tokens. This dataset consists of 87% code and 13% natural language, spanning both English and Chinese. The series includes models of varying sizes, ranging from 6.7 billion to 33 billion parameters, designed to support a wide range of application scenarios. All models are pre-trained on a project-level code corpus, utilizing a 16,000-token context window and incorporating a fill-in-the-blank task to enhance their proficiency in project-level code completion and infilling. In terms of coding capabilities, DeepSeek-Coder sets new benchmarks among open-source code models across multiple programming languages and evaluation benchmarks, demonstrating its exceptional proficiency in code comprehension and generation.
Qwen2.5-Coder-Instruct 1.5B/7B/14B: Qwen2.5-Coder [
35] represents the evolution of open-source coding models, succeeding CodeQwen1.5. Available in 1.5 billion, 7 billion, 14 billion, and 32 billion parameter versions, it is trained on an enormous 5.5 trillion tokens, which include code, text-code grounding, and synthetic data. This extensive training regimen significantly enhances its coding capabilities while preserving its strong performance in mathematical and general tasks. Supporting up to 128,000 tokens and 92 programming languages, Qwen2.5-Coder excels in code generation, completion, and repair. Its instruction-tuned variant, Qwen2.5-Coder-Instruct, further refines task performance and generalization, demonstrating exceptional skill in multi-programming, code reasoning, and mathematical tasks, while maintaining robust general capabilities.
5.1.3. Baselines
To evaluate the performance of our ExSPIN, we use two SOTA fine-tuning methods, SPIN [
14] and SFT [
36], as baselines.
5.1.4. Metrics
Following [
2], we evaluate the performance of our text-to-SQL model using the execution accuracy (EX) as the primary metric. This metric measures whether the SQL query generated by the model, when executed on the database, produces the same result as the execution of the ground truth (gold) SQL query.
5.2. Overall Comparison
Figure 4 and
Figure 5 show the execution accuracy of the SQL queries generated on the SPIDER and BIRD datasets when applying SFT, SPIN, and ExSPIN to four different models. For both SPIN and ExSPIN, we report the best results across four iterations. The experimental results reveal that the SPIN method did not positively impact the SFT models. On the SPIDER dataset, SPIN achieved accuracy of only 64.2% on the DeepSeek-Coder 6.7B model, representing a 12% decrease compared to SFT. On the BIRD dataset, SPIN reached accuracy of just 30.4% on the Qwen2.5-Coder 7B model, a reduction of 1% compared to SFT. These findings suggest that SPIN failed to improve the models through self-play on top of SFT, instead leading to incorrect updates and degraded performance. This failure can be attributed to the SPIN training process, where the main player model is trained to distinguish between SQL queries generated by the opponent model across the entire dataset and adjust its parameters to avoid generating queries similar to the opponent’s. However, in the text-to-SQL task, the opponent model may generate correct SQL queries. As a result, SPIN inadvertently guides the main player model toward producing incorrect answers. This issue is less pronounced when the opponent model generates SQL queries with low accuracy, such as with the Qwen2.5-Coder 1.5B model.
In nearly all cases, ExSPIN achieves superior performance. For example, on the SPIDER dataset with the DeepSeek-Coder 6.7B model, ExSPIN reached accuracy of 83.2%, representing a 19.0% improvement over SPIN. Similarly, on the BIRD dataset, ExSPIN achieved accuracy of 52.1%, a 25.0% increase compared to SPIN. This performance boost can be attributed to the integration of the execution feedback mechanism, which addresses SPIN’s shortcomings, while explicit schema integration enhances the model’s understanding of the relationships between the query and the database. By embedding schema information into the self-play fine-tuning process through techniques such as schema filtering, value retrieval, and metadata integration, ExSPIN overcomes the inherent limitations of traditional SPIN. The inclusion of schema-linking allows the model to more effectively align natural language queries with the underlying database structure, especially in complex scenarios involving multi-table joins, precise value matching, and a deeper understanding of schema relationships. This capability significantly reduces the semantic gap between natural language and SQL, resulting in more accurate queries.
Table 1 and
Table 2 show the accuracy of different types of queries on the SPIDER and BIRD datasets using Qwen2.5-Coder 14B, respectively. In almost all cases, our ExSPIN model achieves the best performance. The experimental results indicate that both the SFT and SPIN methods struggle with three specific types of SQL queries: join queries without aggregates, aggregate queries with join and group by, and nested subqueries. On the SPIDER dataset, the accuracy of both the SFT and SPIN methods for these query types did not exceed 80%, with SPIN achieving only 73.1% accuracy on nested subqueries. On the BIRD dataset, the accuracy for nested subqueries dropped even further, with SFT and SPIN achieving only 28% and 27.7%, respectively. This performance gap arises because the handling of join operations and nested subqueries requires a clear understanding of the relationships between the database tables. However, the SPIN method relies solely on implicit associations between natural language queries and the database schema, which limits the model’s ability to accurately capture the underlying data relationships. In contrast, ExSPIN explicitly integrates schema and execution feedback information, providing the model with direct access to database structural features. This enhancement significantly improves the model’s ability to capture relationships between tables.
Finally, we conducted comparative experiments involving GPT-4o and the DPO [
24] method, where SFT, SPIN, DPO, and ExSPIN were implemented on the DeepSeek-Coder 6.7B model. The experimental results are shown in
Table 3. The experimental findings indicate that, while the DPO method still occasionally misguides the model into altering its correct outputs, its impact on the model’s accuracy is relatively minor compared to the SPIN method, due to the smaller number of parameters involved in fine-tuning. The ExSPIN method continues to demonstrate significant advantages over DPO, achieving improvements of 15.9% on the SPIDER dataset and 24.3% on the BIRD dataset. Compared to GPT-4o, by fine-tuning the DeepSeek 6.7B model, ExSPIN achieved a 7.1% lead on the SPIDER dataset and a 6% lead on the BIRD dataset. Since some of the databases and question–answer pairs in the SPIDER test set differ from those in the training set, they can be considered as unseen data for the model. The experimental results show that ExSPIN outperforms other methods on the SPIDER test set, demonstrating the generalization ability of our approach to unseen databases.
5.3. Parameter Study
In this subsection, we evaluate the impact of the regularization parameter
on ExSPIN training with the DeepSeek-Coder 6.7B. The values of
are varied between 0.1 and 1.2. As shown in
Figure 6 and
Figure 7, the experimental results indicate that a low
limits the model’s ability to update its parameters effectively. As a result, the model fails to leverage errors in the generated SQL during training, leading to only marginal improvements in accuracy. On the other hand, an excessively high
leads to the over-penalization of both the log probabilities of the gold SQL and the generated SQL, with the model attempting to maximize the gap between their log probabilities. This, in turn, reduces the similarity between the generated SQL and the gold SQL, making it more difficult for the model to produce high-quality queries.
5.4. Execution Feedback Mechanism Analysis
To better illustrate the improvement of the execution feedback mechanism in self-play fine-tuning, we have selected two examples to demonstrate the impact of execution feedback on SQL generation.
Table 4 presents the first example, where we observe that the model successfully generates the correct SQL under the guidance of explicit schema integration after SFT (
Section 4.2). We then conducted self-play experiments both with and without the execution feedback mechanism. As shown in
Table 4, the SQL generated without execution feedback is incorrect, whereas correct results are generated when execution feedback is included. This example illustrates that, while the SFT model can generate correct SQL statements under explicit schema integration, there is a notable difference between the generated SQL and the gold SQL in terms of grammatical structure. In the absence of the execution feedback mechanism, the model alters its originally correct output during self-play, ultimately failing to generate the correct SQL. Furthermore, this process reduces the diversity of the SQL outputs and weakens the model’s generalization ability. In contrast, when the feedback mechanism is introduced, misleading samples are filtered out during training. This prevents the model from being confused by its own outputs and ensures that it consistently generates correct SQL statements.
Table 5 shows the second example, where the model still fails to generate the correct SQL even after SFT. Consequently, this sample is retained for self-play, where the model corrects the erroneous outputs of its opponent model and learns from the gold SQL. As a result, the model eventually produces the correct SQL, thereby improving its ability to handle the text-to-SQL task.
These two cases demonstrate that the execution feedback mechanism plays a crucial role in self-play. It enables the model to correctly revise its mistakes while retaining the correct SQL outputs that it has already learned, ultimately improving its overall performance on text-to-SQL tasks.
5.5. Ablation Study
Finally, we conducted ablation experiments on ExSPIN with DeepSeek-Coder 6.7B to answer the following two questions:
- 1.
How does explicitly incorporating database features into the model affect its ability to understand the relationship between the database and the query?
- 2.
What is the impact of introducing an execution feedback mechanism during the SPIN process on the model’s SQL generation capabilities in self-play?
Table 6 and
Table 7 present the results of our ablation experiments on the two datasets. The experimental results show that explicit schema integration contributes an 8% improvement in the SQL generation accuracy, as it helps the model to better identify relevant tables, columns, and query structures from the database in relation to the input question. Additionally, we observe that the execution feedback mechanism plays a crucial role in guiding multi-round self-play training, enabling the model to correct its own erroneous SQL queries and avoid modifying correct SQL statements. As a result, the SQL generation accuracy was improved from 81.6% to 83.2% on SPIDER and 39.6% to 52.1% on BIRD.
5.6. Resource Consumption
Table 8 and
Table 9 presents a comparison of the resource consumption during the training processes of SPIN, ExSPIN, and SFT on the SPIDER and BIRD datasets. The experimental results demonstrate that ExSPIN, when fine-tuning the DeepSeek-Coder-6.7B-Instruct model on the SPIDER dataset, utilizes four A800 GPUs and consumes 51.25 min per epoch. In contrast, the SPIN method requires 203.5 min per epoch. This significant difference arises because ExSPIN employs an execution feedback mechanism to filter out correct examples that the current model can already generate before training, retaining only a small subset of incorrect examples for fine-tuning. On the other hand, SPIN trains on the entire dataset in each iteration, resulting in substantially higher time consumption compared to ExSPIN. Additionally, we observe that ExSPIN consumes only 31.27 min per epoch when fine-tuning the Qwen2.5-Coder-14B model. This is attributed to the stronger baseline capabilities of Qwen2.5-Coder-14B compared to DeepSeek-Coder-6.7B-Instruct, which lead to fewer remaining bad cases after the execution feedback mechanism filters the dataset, thereby reducing the training time. This suggests that our approach is more scalable to larger and more complex datasets compared to SPIN.
5.7. Bad Case Analysis
The performance on different SQL query types in
Section 5.2 reveals that ExSPIN still has limitations in handling nested subqueries, as overly complex SQL query patterns remain challenging for the current model.
Table 10 presents examples of the errors encountered by ExSPIN when processing nested subqueries. These examples demonstrate that when SQL statements contain structurally complex nested subqueries, ExSPIN struggles to generate properly structured subqueries, often producing oversimplified subqueries or failing to generate them altogether. Additionally, the model faces difficulties in managing UNION relationships between subqueries. These findings indicate that ExSPIN’s capability in handling complex nested subqueries requires further improvement. Nevertheless, our method has achieved a significant improvement in accuracy for nested subqueries, outperforming SFT and SPIN by 5.3% to 14.2%.
5.8. Limitations
Figure 8 and
Figure 9 show the SQL execution accuracy at each iteration for SPIN and ExSPIN across multiple rounds of training. The experimental results reveal a downward trend in the SQL execution accuracy as the number of training iterations increases. This decline is attributed to the fact that both SPIN and ExSPIN perform iterative training on the same dataset, which increases the likelihood of overfitting to the training set after several iterations. In future work, we plan to investigate methods such as dataset partitioning to mitigate the risk of overfitting during iterative training.
Secondly, regarding the impact of database biases on model performance, since our method assumes that the distributions of the training and test sets are the same, during the multi-turn self-play fine-tuning, the model learns the distribution of the databases and SQL in the training set. If there are biases in the dataset, this will inherently limit the potential of the current self-play approach. Surprisingly, the experiments shown in
Table 3 of
Section 5.2 on the SPIDER test set indicate that our method has potential resistance to dataset bias. Despite the existence of schema and SQL-type biases between the SPIDER test set and training set, our method still improves the performance by 6.9% to 19% compared to the SFT, SPIN, GPT-4o, and DPO methods on the SPIDER test set.
Finally, compared to SFT, the ExSPIN method consumes more memory and requires longer training times, which is indeed a limitation of our approach. We plan to explore ways to reduce the computational overhead in the future.
6. Conclusions
In this work, we have introduced ExSPIN, a novel explicit feedback-based self-play fine-tuning framework designed to address the limitations of conventional SPIN methods in text-to-SQL parsing. By incorporating schema-linking features and SQL execution results explicitly into the training process, our method bridges the gap between natural language queries and database schemas, enabling the model to generate more accurate SQL queries. Through schema filtering, value retrieval, and metadata integration, ExSPIN demonstrates a significant improvement over existing methods, as evidenced by its superior performance on the SPIDER dataset. These results highlight the importance of leveraging explicit schema-linking in self-play fine-tuning, paving the way for more robust and efficient text-to-SQL parsers. We believe that our approach sets a strong foundation for future research in combining schema-aware techniques with large language models and offers promising opportunities for enhanced database interaction systems.