A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving
Abstract
1. Introduction
- We propose RASD, which augments math word problem training by retrieving and fusing similar samples.
- We introduce a self-distillation objective with a consistency constraint to ensure the reasoning consistency.
- Extensive experiments show the effectiveness and generalization of our RASD.
2. Related Work
2.1. Deep Learning-Based Math Word Problem Solving
2.2. Data Augmentation
3. Methodology
3.1. Preliminaries
3.2. Hidden Space-Based Retrieval Augmentation
3.3. Consistent Reasoning with Self-Distillation Learning
Algorithm 1 MWP solver training with RASD |
Input: Training dataset . Output: model parameters 1: Initialize model with parameters . 2: while not converged do 3: randomly sample data ; 4: ; 5: retrieve the most similar sample from candidate set using according to Equation (2); 6: construct the augmented sample according to Equation (3); 7: decoding the output distributions of solution equation with original representation: ; 8: decoding the output distributions of solution equation with augmented representation: ; 9: calculate the according to Equation (4); 10: calculate the NLL objective and according to Equations (5) and (6); 11: update the model parameters by minimizing loss by Equation (7). 12: end while |
4. Experiments
4.1. Datasets, Baselines, and Metrics
4.2. Implementation Details
4.3. Overall Results
4.4. Ablation Study on Two Components of RASD
4.5. Ablation Study on Different Data Ratios
4.6. Ablation Study on Different Expression Lengths
4.7. Ablation Study on Different Loss Weight Coefficients
4.8. Case Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
MWP | Math Word Problem |
AI | Artificial Intelligence |
AGI | Artificial General Intelligence |
GWP | Geometry Word Problem |
NLP | Natural Language Processing |
KL | Kullback–Leibler |
NLL | Negative Log-likelihood |
References
- Wang, Y.; Liu, X.; Shi, S. Deep Neural Solver for Math Word Problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Palmer, M., Hwa, R., Riedel, S., Eds.; pp. 845–854. [Google Scholar] [CrossRef]
- Huang, D.; Shi, S.; Lin, C.Y.; Yin, J.; Ma, W.Y. How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 887–896. [Google Scholar]
- Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; Hajishirzi, H. MAWPS: A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Knight, K., Nenkova, A., Rambow, O., Eds.; pp. 1152–1157. [Google Scholar] [CrossRef]
- Qin, J.; Lin, L.; Liang, X.; Zhang, R.; Lin, L. Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3780–3789. [Google Scholar]
- Miao, S.Y.; Liang, C.C.; Su, K.Y. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 June 2020; pp. 975–984. [Google Scholar] [CrossRef]
- Patel, A.; Bhattamishra, S.; Goyal, N. Are NLP models really able to solve simple math word problems? arXiv 2021, arXiv:2103.07191. [Google Scholar] [CrossRef]
- Qin, J.; Liang, X.; Hong, Y.; Tang, J.; Lin, L. Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 1, pp. 5870–5881. [Google Scholar] [CrossRef]
- Qin, J.; Huang, Z.; Zeng, Y.; Zhang, Q.; Lin, L. An Introspective Data Augmentation Method for Training Math Word Problem Solvers. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3113–3127. [Google Scholar] [CrossRef]
- Xie, Z.; Sun, S. A Goal-Driven Tree-Structured Neural Model for Math Word Problems. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar]
- Zhang, J.; Wang, L.; Lee, K.W.; Bin, Y.; Lim, E.P. Graph-to-Tree Learning for Solving Math Word Problems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
- Tsai, S.h.; Liang, C.C.; Wang, H.M.; Su, K.Y. Sequence to general tree: Knowledge-guided geometry word problem solving. arXiv 2021, arXiv:2106.00990. [Google Scholar] [CrossRef]
- Jie, Z.; Li, J.; Lu, W. Learning to reason deductively: Math word problem solving as complex relation extraction. arXiv 2022, arXiv:2203.10316. [Google Scholar] [CrossRef]
- Jayasinghe, I.; Ranathunga, S. Two-step memory networks for deep semantic parsing of geometry word problems. In Proceedings of the International Conference on Current Trends in Theory and Practice of Informatics, Limassol, Cyprus, 20–24 January 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 676–685. [Google Scholar]
- Zhang, X.; Zhao, J.; Lecun, Y. Character-Level Convolutional Networks for Text Classification; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
- Wei, J.; Zou, K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
- Liu, Q.; Guan, W.; Li, S.; Cheng, F.; Kawahara, D.; Kurohashi, S. RODA: Reverse Operation based Data Augmentation for Solving Math Word Problems. Inst. Electr. Electron. Eng. 2021, 30, 1–11. [Google Scholar] [CrossRef]
- Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do not have enough data? In Deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7383–7390. [Google Scholar]
- Zhang, H. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- Ramé, A.; Sun, R.; Cord, M. Mixmo: Mixing multiple inputs for multiple outputs via deep subnetworks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 823–833. [Google Scholar]
- Guo, H.; Mao, Y.; Zhang, R. Augmenting data with mixup for sentence classification: An empirical study. arXiv 2019, arXiv:1905.08941. [Google Scholar] [CrossRef]
- Chen, J.; Yang, Z.; Yang, D. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv 2020, arXiv:2004.12239. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
- Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.P.; Lin, L. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv 2021, arXiv:2105.14517. [Google Scholar]
- Zhang, Y.; Zhou, G.; Xie, Z.; Huang, J.X. Number-enhanced representation with hierarchical recursive tree decoding for math word problem solving. Inf. Process. Manag. 2024, 61, 103585. [Google Scholar] [CrossRef]
- Chen, J.; Li, T.; Qin, J.; Lu, P.; Lin, L.; Chen, C.; Liang, X. UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 3313–3323. [Google Scholar] [CrossRef]
- Jain, N.; Chiang, P.y.; Wen, Y.; Kirchenbauer, J.; Chu, H.M.; Somepalli, G.; Bartoldson, B.R.; Kailkhura, B.; Schwarzschild, A.; Saha, A.; et al. Neftune: Noisy embeddings improve instruction finetuning. arXiv 2023, arXiv:2310.05914. [Google Scholar] [CrossRef]
- Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Method | Key Characteristics |
---|---|
Seq2Seq Models | Direct translation of problems into equation templates |
Tree-structured Models | Better capture of expression semantics |
(GTS, Graph2Tree) | |
Paraphrasing-based | Creation of diverse samples via synonym replacement and sentence restructuring |
Noising-based | Introduction of noise to improve model robustness |
Sampling-based (Mixup) | Generation of new data by interpolating word/sentence embeddings |
Model | MAWPS | Math23K | ASDiv-A | SVAMP |
---|---|---|---|---|
GTS + RoBERTa | ||||
+NEFTune | ||||
+Mixup | ||||
+IDAM | ||||
+RASD | 88.7(↑0.2)±0.2 | 76.4(↑0.7)±0.3 | 81.7(↑0.5)±0.8 | 41.9(↑0.9)±0.6 |
Graph2Tree + RoBERTa | ||||
+NEFTune | ||||
+Mixup | ||||
+IDAM | ||||
+RASD | 89.3(↑0.6)±0.3 | 78.5(↑1.1)±1.1 | 83.1(↑0.9)±0.3 | 46.3(↑2.5)±1.0 |
DeductReasoner + RoBERTa | ||||
+NEFTune | ||||
+Mixup | ||||
+IDAM | ||||
+RASD | 92.7(↑0.7)±0.2 | 86.8(↑0.7)±0.4 | 83.8(↑0.8)±0.5 | 46.7(↑1.7)±0.9 |
NERHRT + RoBERTa | - | - | ||
+NEFTune | - | - | ||
+Mixup | - | - | ||
+IDAM | - | - | ||
+RASD | 92.4(↑1.0)±0.5 | 87.7(↑0.5)±0.7 | - | - |
All | Angle | Length | All | Angle | Length | |
---|---|---|---|---|---|---|
Model | NGS | Geoformer | ||||
NoAug | ||||||
+NEFTune | ||||||
+Mixup | ||||||
+IDAM | ||||||
+RASD | 62.0(↑1.3)±1.3 | 72.4(↑0.9)±0.9 | 52.4(↑3.6)±1.5 | 64.0(↑3.7)±1.1 | 76.5(↑5.0)±1.2 | 50.5(↑1.4)±0.9 |
Model | MAWPS | ASDiv-A | SVAMP | Math23K |
---|---|---|---|---|
NoAug | ||||
+RA | 92.5(↑0.5)±0.5 | 83.2(↑0.2)±0.8 | 45.5(↑0.5)±0.5 | 86.6(↑0.5)±0.9 |
+SD | 92.4(↑0.4)±1.1 | 83.6(↑0.6)±1.0 | 45.1(↑0.1)±1.3 | 86.2(↑0.1)±0.8 |
+RASD | 92.7 (↑0.7)±0.2 | 83.8(↑0.8)±0.5 | 46.7(↑1.7)±0.9 | 86.8(↑0.7)±0.4 |
Data Ratio | 20% | 40% | 60% | 80% | 100% |
---|---|---|---|---|---|
w/o RASD | 68.7 | 77.0 | 82.5 | 84.7 | 86.1 |
w/RASD | 70.1(↑1.4) | 78.2(↑1.2) | 83.4(↑0.9) | 85.2(↑0.5) | 86.8(↑0.7) |
Equation Length | 3 | 5 | 7 | 9 |
---|---|---|---|---|
Train | 4397 | 11,001 | 4406 | 1349 |
Test | 173 | 522 | 191 | 66 |
Equation Length | Value Accuracy | |||
---|---|---|---|---|
3 | 5 | 7 | 9 | |
w/o RASD | 93.64 | 91.95 | 78.53 | 58.90 |
w/RASD | 95.38(↑1.74) | 91.95 | 78.53 | 61.64(↑2.74) |
Weight | Value Accuracy | |||||
---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 1 | 3 | 5 | |
w/o RASD | 91.4 | |||||
w/RASD | 91.3(↓0.1) | 91.9(↑0.5) | 92.4(↑1.0) | 91.6(↑0.2) | 91.7(↑0.3) | 91.6(↑0.2) |
Case 1: The radius of a circle is 3 cm. If its radius is extended by 2 cm, how much will the area increase? 一个圆的半径是3厘米,如果把它的半径延长2厘米,那么面积增加多少。 | |
Before augmentation: (✗) | After augmentation: (✓) |
Case 2: There are 9 apple trees and 7 pear trees in the orchard. Each apple tree can pick approximately 160 kilograms of apples. How many kilograms of apples can be picked in this orchard? 果园里有9棵苹果树,7棵梨树.每棵苹果树大约摘160千克苹果,这个果园大约摘多少千克苹果? | |
Before augmentation: (✗) | After augmentation: (✓) |
Case 3: The ticket sales for various competitions of the 2008 Olympic Games are currently booming. The minimum ticket price for a handball competition is 30 yuan, and the minimum ticket price for a swimming competition is 20 yuan less than four times that of a handball competition. How much is the minimum ticket price for a swimming competition more expensive than a handball competition? 2008年奥运会各项比赛门票销售正在火热进行中,一场手球比赛的最低票价为30元,一场游泳 比赛的最低票价比手球比赛的4倍少20元,一场游泳比赛的最低票价比手球比赛贵多少元? | |
Before augmentation: (✗) | After augmentation: (✓) |
Case 1: A herder has 450 sheep, 3/5 of which are goats. Now, 10 more goats are bought. What fraction of the sheep are goats now? 某牧民养羊450只,其中(3/5)是山羊.现在又买回10只山羊,现在山羊占几分之几? | |
Pred: (✗) | True: |
Case 2: A 12,000 m long highway is to be built. The original plan was to build 300 m per day, but the task was completed in 30 days. How many more meters were built each day than planned? 修一条长12000米的公路,原计划每天修300米,结果30天完成了任务,实际比原计划每天多修多少米? | |
Pred: (✗) | True: |
Case 3: A car travels from place A to place B, covering 4/5 of the entire journey. Of the remaining distance, 70% is uphill and the rest is downhill. It is known that the downhill distance is 3 kilometers. How far is it from A to B? 一辆汽车从甲地到乙地,行了全程的(4/5),在剩下的路程中,70%是上坡路,其余是下坡路.已知下坡路长 3千米,甲、乙两地相距多远? | |
Pred: (✗) | True: |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, X.; Qin, J.; Yang, Z. A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics 2025, 14, 3425. https://doi.org/10.3390/electronics14173425
Wu X, Qin J, Yang Z. A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics. 2025; 14(17):3425. https://doi.org/10.3390/electronics14173425
Chicago/Turabian StyleWu, Xiaoqi, Jinghui Qin, and Zhijing Yang. 2025. "A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving" Electronics 14, no. 17: 3425. https://doi.org/10.3390/electronics14173425
APA StyleWu, X., Qin, J., & Yang, Z. (2025). A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics, 14(17), 3425. https://doi.org/10.3390/electronics14173425