TQAgent: Enhancing Table-Based Question Answering with Knowledge Graphs and Tree-Structured Reasoning
Abstract
:1. Introduction
2. Related Works
3. Materials and Methods
3.1. Question Answering Method
Algorithm 1 Planner Algorithm | ||
Require: T is the current state of the table, Q is the task goal, n is the number of samples, k is the number of operations to consider in each iteration, and is the probability threshold. | ||
Ensure: is the final sequence of candidate actions. | ||
1: | Initialize an empty list for the final sequence of actions. | |
2: | ||
3: | repeat | |
4: | Initialize an empty list A for all candidate actions. | |
5: | for to n do | ▹ Sample n times |
6: | ▹ Generate | |
7: | ||
8: | ||
9: | ||
10: | end for | |
11: | Sort A by in descending order. | |
12: | ▹ Select the top k highest probability actions. | |
13: | ▹ Filter actions based on the threshold. | |
14: | ▹ Select the next action. | |
15: | if then | |
16: | ||
17: | ▹ Update the table state. | |
18: | end if | |
19: | until | |
20: | return |
- Planner Reads the Initial Table and the Question and Plans: As a dynamic planner, the Planner generates and manages a series of action chains. Based on the current state of the table and the task goal, the Planner performs n samplings, selects the value of as the confidence level, and infers the operation tree. By continuously optimizing the action chain, the Planner improves the decision-making efficiency and accuracy of the system.
- Worker Executes Specific Operations: As the execution component, the Worker executes each operation formulated by the Planner step by step. According to the Planner’s instructions, the Worker makes dynamic adjustments to the table data (such as adding a column with add_col, performing conditional selection with select_where_row, etc.), and retrieves information from the external knowledge base or the knowledge embedded in the model when necessary through operations like retrieve_from_repo or query. The Worker feeds back the result of each step of execution to the Planner for further adjustment and optimization of the action chain. All table operations are managed based on Python’s pandas’ dataframe (https://pandas.pydata.org, accessed on 7 March 2025) to ensure the efficiency and consistency of data processing.
3.2. Knowledge Injection
3.3. Knowledge-Augmented Fine-Tuning
4. Experiments
4.1. Datasets
4.2. Baseline
- Generic Reasoning
- Text-to-SQL [23]: Converts natural language questions into SQL queries for table-based reasoning.
- Dater [19]: Employs sub-table decomposition strategy with LLaMA2-13b for end-to-end table reasoning.
- Chain-of-Thought [39]: Guides GPT-3.5 through step-by-step reasoning via explicit reasoning chains.
- Small Language Model Reasoning. This category focuses on the reasoning performance of 7B-8B scale models, comparing pre-finetuning and post-finetuning effectiveness:
- Zero-Shot Reasoning [40]: Direct inference using base versions of Qwen2-7b and Llama3-8b.
- Instruction-Finetuned Reasoning: Uses instruction-fine-tuned variants Qwen2-7b-Instruct and Llama3-8b-Instruct.
4.3. Results
- Impact of Sample: This section investigates the impact of varying sample counts (n) on model performance under a fixed probability threshold . As shown in Table 4,
- Low-sample (): Limited sampling efficiently captures critical decision paths, with each new sample providing novel information (e.g., WikiTQ accuracy increases from 46.23% to 50.78% as n grows from 1 to 3).
- High-sample (): Redundant samples emerge due to overlapping information, reducing marginal benefits (e.g., FeTaQA BLEU score plateaus at 28.62 despite n increasing from 4 to 6).
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Fang, X.; Xu, W.; Tan, F.A.; Zhang, J.; Hu, Z.; Qi, Y.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding—A Survey. arXiv 2024, arXiv:2402.17944. [Google Scholar]
- Ritze, D.; Lehmberg, O.; Bizer, C. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, Larnaca, Cyprus, 13–15 July 2015; pp. 1–6. [Google Scholar]
- Zhang, S.; Balog, K. Entitables: Smart assistance for entity-focused tables. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 255–264. [Google Scholar]
- Cheng, Z.; Dong, H.; Wang, Z.; Jia, R.; Guo, J.; Gao, Y.; Han, S.; Lou, J.G.; Zhang, D. Hitab: A hierarchical table dataset for question answering and natural language generation. arXiv 2021, arXiv:2108.06712. [Google Scholar]
- Nan, L.; Hsieh, C.; Mao, Z.; Lin, X.V.; Verma, N.; Zhang, R.; Kryściński, W.; Schoelkopf, H.; Kong, R.; Tang, X.; et al. FeTaQA: Free-form table question answering. Trans. Assoc. Comput. Linguist. 2022, 10, 35–49. [Google Scholar]
- Liu, Q.; Chen, B.; Guo, J.; Ziyadi, M.; Lin, Z.; Chen, W.; Lou, J.G. TAPEX: Table pre-training via learning a neural SQL executor. arXiv 2021, arXiv:2107.07653. [Google Scholar]
- Zhao, J.; Huang, S.; Cole, J.M. OpticalBERT and OpticalTable-SQA: Text-and table-based language models for the optical-materials domain. J. Chem. Inf. Model. 2023, 63, 1961–1981. [Google Scholar]
- Deng, X.; Sun, H.; Lees, A.; Wu, Y.; Yu, C. Turl: Table understanding through representation learning. ACM SIGMOD Rec. 2022, 51, 33–40. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar]
- Lenat, D.; Marcus, G. Getting from generative ai to trustworthy ai: What llms might learn from cyc. arXiv 2023, arXiv:2308.04445. [Google Scholar]
- Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv 2022, arXiv:2212.10511. [Google Scholar]
- Zhang, M.; Press, O.; Merrill, W.; Liu, A.; Smith, N.A. How language model hallucinations can snowball. arXiv 2023, arXiv:2305.13534. [Google Scholar]
- Mialon, G.; Dessì, R.; Lomeli, M.; Nalmpantis, C.; Pasunuru, R.; Raileanu, R.; Rozière, B.; Schick, T.; Dwivedi-Yu, J.; Celikyilmaz, A.; et al. Augmented language models: A survey. arXiv 2023, arXiv:2302.07842. [Google Scholar]
- Mruthyunjaya, V.; Pezeshkpour, P.; Hruschka, E.; Bhutani, N. Rethinking language models as symbolic knowledge graphs. arXiv 2023, arXiv:2308.13676. [Google Scholar]
- Strich, J. Adapt LLM for Multi-turn Reasoning QA using Tidy Data. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Abu Dhabi, United Arab Emirates, 19–20 January 2025; pp. 392–400. [Google Scholar]
- Martynova, A.; Tishin, V.; Semenova, N. Learn Together: Joint Multitask Finetuning of Pretrained KG-enhanced LLM for Downstream Tasks. In Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK), Abu Dhabi, United Arab Emirates, 19 January 2025; pp. 13–19. [Google Scholar]
- Zhang, W.; Wang, Y.; Song, Y.; Wei, V.J.; Tian, Y.; Qi, Y.; Chan, J.H.; Wong, R.C.W.; Yang, H. Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 6699–6718. [Google Scholar] [CrossRef]
- Zhao, Y.; Chen, L.; Cohan, A.; Zhao, C. TaPERA: Enhancing faithfulness and interpretability in long-form table QA by content planning and execution-based reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 12824–12840. [Google Scholar]
- Ye, Y.; Hui, B.; Yang, M.; Li, B.; Huang, F.; Li, Y. Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and development In Information Retrieval, Taipei, China, 23–27 July 2023; pp. 174–184. [Google Scholar]
- He, X.; Zhou, M.; Xu, X.; Ma, X.; Ding, R.; Du, L.; Gao, Y.; Jia, R.; Chen, X.; Han, S.; et al. Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18206–18215. [Google Scholar]
- Li, P.; He, Y.; Yashar, D.; Cui, W.; Ge, S.; Zhang, H.; Rifinski Fainman, D.; Zhang, D.; Chaudhuri, S. Table-gpt: Table fine-tuned gpt for diverse table tasks. Proc. Acm Manag. Data 2024, 2, 1–28. [Google Scholar]
- Sui, Y.; Zhou, M.; Zhou, M.; Han, S.; Zhang, D. Table meets llm: Can large language models understand structured table data? A benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Virtual, 4–8 March 2024; pp. 645–654. [Google Scholar]
- Rajkumar, N.; Li, R.; Bahdanau, D. Evaluating the Text-to-SQL Capabilities of Large Language Models. arXiv 2022, arXiv:2204.00498. [Google Scholar]
- Fan, J.; Gu, Z.; Zhang, S.; Zhang, Y.; Chen, Z.; Cao, L.; Li, G.; Madden, S.; Du, X.; Tang, N. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL. Proc. VLDB Endow. 2024, 17, 2750–2763. [Google Scholar] [CrossRef]
- Zhang, T.; Yue, X.; Li, Y.; Sun, H. TableLlama: Towards Open Large Generalist Models for Tables. arXiv 2024, arXiv:2403.03636. [Google Scholar]
- Chen, Y.; Yuan, Y.; Zhang, Z.; Zheng, Y.; Liu, J.; Ni, F.; Hao, J.; Mao, H.; Zhang, F. SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models. arXiv 2025, arXiv:2403.03636. [Google Scholar]
- Lavrinovics, E.; Biswas, R.; Bjerva, J.; Hose, K. Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective. J. Web Semant. 2025, 85, 100844. [Google Scholar]
- Guan, X.; Liu, Y.; Lin, H.; Lu, Y.; He, B.; Han, X.; Sun, L. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18126–18134. [Google Scholar]
- Ledger, G.; Mancinni, R. Detecting llm hallucinations using monte carlo simulations on token probabilities. Authorea Prepr. 2024; preprint. [Google Scholar]
- Du, X.; Xiao, C.; Li, S. Haloscope: Harnessing unlabeled llm generations for hallucination detection. Adv. Neural Inf. Process. Syst. 2025, 37, 102948–102972. [Google Scholar]
- Fairburn, S.; Ainsworth, J. Mitigate large language model hallucinations with probabilistic inference in graph neural networks. Authorea Prepr. 2024; preprint. [Google Scholar]
- Taveekitworachai, P.; Abdullah, F.; Thawonmas, R. Null-shot prompting: Rethinking prompting large language models with hallucination. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 13321–13361. [Google Scholar]
- Qiu, Y. The Impact of LLM Hallucinations on Motor Skill Learning: A Case Study in Badminton. IEEE Access 2024, 12, 139669–139682. [Google Scholar] [CrossRef]
- Waldo, J.; Boussard, S. GPTs and Hallucination: Why do large language models hallucinate? Queue 2024, 22, 19–33. [Google Scholar]
- Li, S.; Ji, H.; Han, J. Document-Level Event Argument Extraction by Conditional Generation. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021. [Google Scholar]
- Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; Wang, W.Y. Tabfact: A large-scale dataset for table-based fact verification. arXiv 2019, arXiv:1909.02164. [Google Scholar]
- Srivastava, P.; Ganu, T.; Guha, S. Towards Zero-Shot and Few-Shot Table Question Answering using GPT-3. arXiv 2022, arXiv:2210.17284. [Google Scholar]
- Ren, Y.; Yu, C.; Li, W.; Li, W.; Zhu, Z.; Zhang, T.; Qin, C.; Ji, W.; Zhang, J. TableGPT: A novel table understanding method based on table recognition and large language model collaborative enhancement. Appl. Intell. 2025, 55, 311. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
- Zhang, H.; Si, S.; Zhao, Y.; Xie, L.; Xu, Z.; Chen, L.; Nan, L.; Wang, P.; Tang, X.; Cohan, A. OpenT2T: An Open-Source Toolkit for Table-to-Text Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Miami, FL, USA, 12–16 November 2024; Hernandez Farias, D.I., Hope, T., Li, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 259–269. [Google Scholar] [CrossRef]
System Prompt |
---|
role: You are a table reasoning expert who strictly follows action definitions to process tabular data and selects the next action based on the established operation chain. |
rule: Directly output the next action and Check the progress status of the current action chain |
instruction: Using the Action Definition, analyze the table data and select the next action to generate from the Candidate Actions based on the established action chain. |
Action Definition: |
If the table only needs a few rows to tell whether the statement is True or False, we use select_row() to select these rows for it. For example: table caption : jeep grand cherokee. col : years | displacement | engine | row 1 : 1999-2004 | 4.0l(242cid) | power tech i6 | row 2 : 1999-2004 | 4.7l(287cid) | powertech v8 | row 3 : 2002-2004 | 4.7l(287cid) | high output powertech v8| row 4 : 1999-2001 | 3.1l diesel | 531 ohv diesel i5| row 5 : 2002-2004 | 2.7l diesel | om647 diesel i5| Question: The Jeep Grand Cherokee with the OM647 Diesel I5 had the third lowest numbered displacement. |
Function: select_row(1, 4, 5) |
Explanation: The statement wants to check if the OM647 Diesel I5 had the third lowest numbered displacement. We need to know the first three low-numbered displacements and all rows where the engine is OM647 Diesel I5. We select rows 1, 4, and 5. |
{Other Action Definition…} |
Operation Chain Reasoning |
table caption : jeep grand cherokee. col : years | displacement | engine | row 1 : 1999 - 2004 | 4.0l (242cid) | power tech i6 | row 2 : 1999 - 2004 | 4.7l (287cid) | powertech v8 | row 3 : 2002 - 2004 | 4.7l (287cid) | high output powertech v8 | row 4 : 1999 - 2001 | 3.1l diesel | 531 ohv diesel i5 | row 5 : 2002 - 2004 | 2.7l diesel | om647 diesel i5 | Question: The Jeep Grand Cherokee with the OM647 Diesel I5 had the third lowest numbered displacement. |
Candidates Action: add_col, select_col, select_where_row, group_by, sort_by, retrieve_from_repo |
Chain: select_col(power, years) -> group_by(power) -> |
Dataset | Train | Test | For Fine-Tuning |
---|---|---|---|
TabFact | 13,182 | 1695 | 13,182 |
WikiTQ | 11,321 | 4344 | 11,321 |
FeTaQA | 7033 | 2000 | 0 |
Datasets | TabFact | WikiTQ | FeTaQA | |
---|---|---|---|---|
Metric | acc | acc | BLEU | ROUGE-1 |
Generic Reasoning | ||||
gpt-3.5-16k-0613 0-shot | 70.45 | 51.84 | 27.67 | 0.61 |
gpt-3.5-16k-0613 2-shot | 71.24 | 52.32 | - | - |
gpt-3.5-16k-0613 4-shot | 71.54 | 52.56 | - | - |
Text-to-SQL gpt-3.5 | 64.71 | 52.90 | - | - |
Dater LLaMA2-13b | 65.12 | 41.44 | 29.37 | 0.63 |
Chain-of-Thought gpt-3.5 | 65.37 | 52.48 | - | - |
SLM Reasoning | ||||
Qwen2-7b Zero-Shot | 60.21 | 45.84 | - | - |
Llama3-8b Zero-Shot | 61.34 | 46.78 | - | - |
TQAgentMethods | ||||
Qwen2-7b+ TQAgent | 64.43 | 46.23 | - | - |
Qwen2-7b-Instruct+TQAgent | 68.42 | 49.72 | - | - |
Llama3-8b+TQAgent | 69.71 | 48.81 | 27.47 | 0.61 |
Llama3-8b-Instruct+TQAgent | 73.60 | 52.94 | 28.62 | 0.61 |
Datasets | TabFact | WikiTQ | FeTaQA | |
---|---|---|---|---|
Metric | acc | acc | BLEU | ROUGE-1 |
64.43 | 46.23 | 26.32 | 0.55 | |
66.25 | 47.98 | 27.15 | 0.57 | |
69.22 | 50.78 | 28.21 | 0.60 | |
73.54 | 52.89 | 28.58 | 0.61 | |
73.59 | 52.90 | 28.61 | 0.61 | |
73.60 | 52.94 | 28.62 | 0.61 |
Datasets | TabFact | WikiTQ | FeTaQA | |
---|---|---|---|---|
Metric | acc | acc | BLEU | ROUGE-1 |
TQAgent without knowledge inject | 69.71 | 48.81 | 27.47 | 0.61 |
TQAgent with knowledge inject | 73.60 | 52.94 | 28.62 | 0.61 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, J.; Zhang, P.; Wang, Y.; Xin, R.; Lu, X.; Li, R.; Lyu, S.; Ou, Z.; Song, M. TQAgent: Enhancing Table-Based Question Answering with Knowledge Graphs and Tree-Structured Reasoning. Appl. Sci. 2025, 15, 3788. https://doi.org/10.3390/app15073788
Zhao J, Zhang P, Wang Y, Xin R, Lu X, Li R, Lyu S, Ou Z, Song M. TQAgent: Enhancing Table-Based Question Answering with Knowledge Graphs and Tree-Structured Reasoning. Applied Sciences. 2025; 15(7):3788. https://doi.org/10.3390/app15073788
Chicago/Turabian StyleZhao, Jianbin, Pengfei Zhang, Yuzhen Wang, Rui Xin, Xiuyuan Lu, Ripeng Li, Shuai Lyu, Zhonghong Ou, and Meina Song. 2025. "TQAgent: Enhancing Table-Based Question Answering with Knowledge Graphs and Tree-Structured Reasoning" Applied Sciences 15, no. 7: 3788. https://doi.org/10.3390/app15073788
APA StyleZhao, J., Zhang, P., Wang, Y., Xin, R., Lu, X., Li, R., Lyu, S., Ou, Z., & Song, M. (2025). TQAgent: Enhancing Table-Based Question Answering with Knowledge Graphs and Tree-Structured Reasoning. Applied Sciences, 15(7), 3788. https://doi.org/10.3390/app15073788