A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving

Wu, Xiaoqi; Qin, Jinghui; Yang, Zhijing

doi:10.3390/electronics14173425

Open AccessArticle

A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving

by

Xiaoqi Wu

,

Jinghui Qin

^*

and

Zhijing Yang

School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3425; https://doi.org/10.3390/electronics14173425

Submission received: 23 July 2025 / Revised: 24 August 2025 / Accepted: 25 August 2025 / Published: 27 August 2025

Download

Browse Figures

Versions Notes

Abstract

Solving math word problems automatically is a critical task in the field of natural language processing. Due to the insufficient size of existing MWP datasets, recent models have reached a performance bottleneck. Large-scale and high-quality training examples are crucial for training a robust math solver, but existing high-quality datasets have limited scale, and annotating or synthesizing vast MWPs explicitly is highly expensive. To address these issues, we propose a novel hidden space-based retrieval augmentation self-distillation method, named RASD, to improve the mathematical reasoning performance of MWP solvers with semantic representation augmentation and self-distillation learning. RASD enhances problem representations by retrieving and merging similar ones. It then inputs both the original and augmented representations into the decoder for solution reasoning. A self-distillation objective is used to maintain reasoning consistency between them. Extensive experiments on five popular math word problem-solving benchmarks, including MAWPS, Math23K, ASDiv-A, SVAMP, and GeoQA, show the effectiveness and universality of our RASD on improving the math reasoning ability of multiple popular baseline solvers.

Keywords:

word problem solving; data augmentation; neural network reasoning; semantic understanding

1. Introduction

Solving math word problems automatically is a crucial task in evaluating the reasoning ability of an AI (Artificial Intelligence) model. Recently, developing an MWP (math word problem) solver that can conduct mathematical equations automatically and soundly has increasingly attracted attention in both industry and academia. This focus is significant for two reasons: It investigates a machine’s ability to understand and reason through complex math problems like a human, and it presents a major challenge in terms of deriving logical conclusions from limited textual information without extra commonsense or professional knowledge. As Figure 1 shows, humans solve these problems by first comprehending the semantics from the text, then deducing the solution expression, and finally arriving at the answer. In this procedure, people may use commonsense knowledge or professional knowledge, like a theorem about parallel lines, to help with solution deduction. This deduction procedure poses a great challenge to the AI model. However, a robust MWP solver can significantly advance the progress of AGI (Artificial General Intelligence) and enable various AI applications like AI-assisted math education and other AI tasks involving mathematical reasoning, such as question answering, natural language inference, visual reasoning, etc.

To acquire these abilities, multiple math word problem-solving datasets are proposed, like Math23K [1], Dolphin18K [2], MAWPS [3], HMWP [4], ASDiv-A [5], SVAMP [6], CM17K [7], APE210K, etc. However, the size of these datasets is not large enough to train a robust solver for solving math word problems automatically, since the performance of a data-driven solver is mainly dependent on the data volume. Additionally, the pioneering work [8] pointed out that collecting a large-scale, high-quality MWP dataset is very expensive due to the professional knowledge requirements and lack of massive accessible data. Furthermore, they also validated that the traditional input-based data augmentation methods that can generate diverse examples from the original ones with various rewriting are incompetent for training an MWP solver due to the semantic complexity and integrity. The underlying reasons are that MWPs are concise yet comprehensive, so these input-based data augmentation methods are prone to disturb problem semantics and make them confusing. Therefore, developing a cost-effective method that can augment the training of an MWP solver is challenging and necessary.

To address these challenges, we propose a simple yet effective retrieval augmentation self-distillation (RASD) method. RASD enhances MWP training by retrieving and fusing similar samples in the latent space. During network training, RASD retrieves the most similar sample based on cosine distance among samples in the same batch using their latent feature representations. It then creates an augmented representation by fusing the most similar sample’s representation with the original sample’s representation. To ensure reasoning consistency, a new self-distillation objective with a consistency constraint is introduced. This objective maintains mathematical dependency consistency between the original word problem and the generated ones, enabling a more stable MWP solver capable of reasoning out correct solution expressions. The main contributions of this work are as follows:

We propose RASD, which augments math word problem training by retrieving and fusing similar samples.
We introduce a self-distillation objective with a consistency constraint to ensure the reasoning consistency.
Extensive experiments show the effectiveness and generalization of our RASD.

2. Related Work

2.1. Deep Learning-Based Math Word Problem Solving

In recent years, deep learning has made significant advancements in mathematical reasoning tasks, enhancing our understanding of logical thinking in machines. Numerous neural network architectures have been introduced to address mathematical reasoning. A concise overview of key model characteristics is provided in Table 1. Wang et al. [1] pioneered the application of deep learning to MWP solving by employing a Seq2seq model to directly translate MWPs into equation templates. Nevertheless, Seq2seq-based models fail to explicitly leverage the structural information inherent in expressions. Subsequently, more advanced networks emerged, explicitly generating tree-based expressions through tree-like decoders. Xie et al. [9] proposed a goal-driven tree-structured neural model GTS to generate expression trees for MWPs. To further extract semantic structural information, Graph2Tree [10] combines a graph-based encoder with a tree-based decoder, utilizing structural semantics from problem texts to enhance expression reasoning. This approach enables the capture of relationships and order information between entities in math problems. Furthermore, Tsai et al. [11] proposed a sequence to the general tree model to solve a GWP (geometry word problem) by learning to map the problem text into a cross-domain operation tree containing different formulas. Jie et al. [12] believed that most Seq2Seq-based or Seq2Tree-based methods lack reasoning interpretability for the generated expressions, so they proposed an MWP-solving method that constructs target expressions iteratively by using interpretable deductive reasoning steps. Jayasinghe et al. [13] proposed a two-step memory neural network for parsing the deep semantics of GWPs.

Orthogonal to all the above methods that aim to design a novel neural network or introduce knowledge to improve the performance of an MWP solver, we develop a simple yet effective retrieval augmentation self-distillation method to improve the solving performance of an MWP solver by embedding it into these models with minor training process modifications.

2.2. Data Augmentation

Data augmentation enhances data efficiency in NLP (natural language processing) by modifying existing data using prior knowledge to generate new data. It can be mainly divided into three categories: paraphrasing-based methods, noising-based methods, and sampling-based methods. A brief description of the categories is shown in the Table 1. From the perspective of paraphrasing, Zhang et al. [14] proposed to apply thesauruses to paraphrase some words in the sentences for creating augmented data. Wei et al. [15] proposed easy data augmentation techniques for boosting performance via four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. Liu et al. [16] proposed a reverse operation-based data augmentation method, RODA, for augmenting math word problems by paraphrasing MWP according to the translated expression. From the perspective of sampling, Anaby-Tavor et al. [17] proposed a sampling-based data augmentation method, LAMBADA, to generate new data by fine-tuning a state-of-the-art language generator and discriminator. From the perspective of noising, hidden space data augmentation is a common way. Zhang et al. [18] proposed a simple learning principle named mixup to train a neural network on convex combinations of pairs of examples and their labels and regularize the neural network to favor simple linear behavior by constructing new data by mixing two data samples in the feature space and their labels. Rame et al. [19] proposed a new framework for learning multi-input multi-output deep subnetworks named MixMo by replacing the suboptimal summing operation hidden in previous approaches with a more appropriate mixing mechanism. Guo et al. [20] proposed two strategies for adapting Mixup to natural language processing. One is performing interpolation on word embeddings, and the other is conducting interpolation on sentence embeddings. Chen et al. [21] proposed TMix to create a large number of augmented training samples by interpolating text in the hidden space. Different from these methods, we develop a simple yet effective hidden space data augmentation method based on retrieval augmentation self-distillation to first create new data by mixing the most similar samples in the same batch and then ask the output of the augmented sample and the original sample to be consistent.

3. Methodology

3.1. Preliminaries

In general, as shown in Figure 1, math word problem solving often takes a problem text as input and reasons out the intermediate solution equation to obtain the final answer. The commonly used architecture is a sequence-to-sequence architecture [22]. An MWP solver M can be modeled formally as follows:

M (I, Θ) = D e c (E n c (P)),

(1)

where I denotes the input to the MWP solver. The variable P corresponds to the problem text. The symbol

Θ

represents the model parameters. Lastly, the encoder is denoted by

E n c

. Similarly, the decoder is designated by

D e c

. The decoder takes the feature output of the encoder as input and reasons out the grounded solving program or symbols.

3.2. Hidden Space-Based Retrieval Augmentation

Given an MWP dataset, we process the training data in batches. Before feeding the data into the model, we perform essential preprocessing steps. These include tokenization, which divides the problem text into smaller semantic units, and word embedding, which generates high-quality word vectors for better model understanding. Each batch

D = {P_{i}}_{i = 1}^{B}

contains B samples. For the problem text

P_{i} = {x_{j}}_{j = 1}^{n}

, each token

x_{j}

is initially embedded into its corresponding word embedding. As shown in Figure 2, we feed the sequence of word embeddings into the encoder to obtain the sequence of hidden states. To effectively capture the key information of the input sequence for decoding, the final hidden state

h_{i}

is selected as the feature representation of the input problem

P_{i}

. By applying the aforementioned procedure to the input problems within a single batch, we obtain a set of original MWP features

H = {h_{i}}_{i = 1}^{B}

.

To simulate a range of potential perturbations and variations for enhancing the model’s stability in the face of such disturbances while simultaneously avoiding the introduction of overly drastic changes, we engage in retrieval augmentation within the latent space. For each MWP feature

h_{i}

, we perform a retrieval augmentation process with a set of candidate MWP features

H_{c} = {h_{c}}_{c = 1}^{n}

, to construct an augmented problem feature

h_{i}^{'}

. The candidate set

H_{c}

can consist of data within the current batch, data across the current and all previous batches, the entire dataset, or a collection that continually incorporates newly generated augmented data throughout the training process. In this work, we use the data within the current batch as the candidate set

H_{c}

. As shown in Equation (2), for each MWP feature

h_{i}

, we adopt the cosine similarity measurement to compute its similarity with other MWP features within the candidate set

H_{c}

one by one, thereby obtaining C similarity values. This similarity measure calculates the cosine of the angle between two vectors, allowing us to effectively identify and retrieve features that are semantically related to the current feature

h_{i}

. Subsequently, we sort these values in descending order and retrieve the MWP feature

h_{s}

, which has the highest similarity to the current feature

h_{i}

, as the similar MWP feature.

\begin{matrix} min_{j \in {1, \dots, n}, j \neq i} (1 - \frac{h_{i} \cdot h_{j}}{| h_{i} | | h_{j} |}) \\ subject to j \in {1, \dots, n}, j \neq i, \end{matrix}

(2)

where

\frac{h_{i} \cdot h_{j}}{| h_{i} | | h_{j} |}

denotes the calculation of the cosine similarity between two vectors. The optimization problem aims to minimize the cosine distance between a given vector

h_{i}

and another vector

h_{j}

selected from the candidate set

H_{c}

. The goal is to find the vector

h_{j}

that yields the smallest distance value, indicating the highest similarity to

h_{i}

. The constraint

j \in {1, \dots, n}, j \neq i

ensures that

h_{j}

is chosen from the candidate set

H_{c}

indexed from 1 to n, excluding the vector

h_{i}

itself, as comparing a vector to itself would trivially result in zero distance.

To solve an MWP, humans often develop strategies by drawing analogies between new and known problems and identifying shared features. Inspired by this observation, we augment the original MWP feature

h_{i}

by mixing it with the retrieved similar MWP feature

h_{s}

. The specific process involves two main steps. Firstly, for each token in the original MWP feature

h_{i}

and its corresponding token in the similar MWP feature

h_{s}

, we determine whether to mix them. To do this, we generate a binary mask M, comprising randomly determined 0 and 1. In this mask M, 1 indicates that the corresponding token will be mixed, while 0 signifies no mixing. Concurrently, we create an inverse mask

M^{'}

, which identifies tokens in the original feature

h_{i}

that remain unchanged during the generation of the augmented MWP feature

h_{i}^{'}

. Secondly, for each token selected for mixing, we establish a mixing ratio between the original and similar MWP features. This ratio is determined by a randomly generated coefficient

α

, which ranges between 0 and 1. Specifically,

α

governs the contribution of the original and similar features in forming the augmented feature

h_{i}^{'}

. The augmented MWP feature can be mathematically expressed as

h_{i}^{'} = α \cdot M ⊙ h_{i} + (1 - α) \cdot M ⊙ h_{s} + M^{'} ⊙ h_{i},

(3)

where ⊙ denotes element-wise multiplication.

By repeating the aforementioned operations for each MWP feature in the batch, we can obtain a set of augmented MWP features

H^{'} = {h_{i}^{'}}_{i = 1}^{B}

. This method enables the model to not only learn the features of the original problem but also to absorb features from other related problems, thereby simulating a more diverse range of problem formulations. The success of this approach is heavily dependent on the quality of the retrieved features. The retrieval quality largely determines the effectiveness of enhanced features, thus affecting the robustness and generalization ability of the model. When the retrieval process yields high-quality, highly relevant features

h_{s}

, the resulting augmented features

h_{i}^{'}

better capture the true semantic information of the problem. This allows the model to learn more meaningful representations and improves its ability to adapt to a variety of problems. Conversely, if the retrieved features are only loosely related to the original problem, the mixed features may introduce excessive noise, which can negatively impact the model’s performance.

3.3. Consistent Reasoning with Self-Distillation Learning

Despite the enrichment of data diversity via retrieval augmentation, it can still introduce varying degrees of noise interference. We propose a self-distillation objective with a consistency constraint, with the aim of addressing potential issues of noise interference and model instability during retrieval augmentation. Our self-distillation leverages knowledge from the model’s own output probability distribution

P_{θ} (Y_{i} ∣ X_{i})

and integrates it into the optimization objective. This self-supervised approach enables the model to optimize its own performance by leveraging its own knowledge. We also construct a constraint objective centered around the KL (Kullback–Leibler) divergence, a measure of the difference between two probability distributions, to ensure the consistency of mathematical dependencies between the generated MWP and the original ones. This facilitates the model to maintain stable performance and reduces the negative impact of noise or outliers introduced by the retrieval augmentation on the model’s learning process. Specifically, the original and augmented MWP features are processed by the decoder to obtain their respective predicted output probability distributions

P_{θ} (Y_{i} ∣ X_{i})

and

P_{θ}^{'} (Y_{i} ∣ X_{i})

. Although these two distributions originate from the same input data, the variations introduced during the retrieval augmentation process create differences in their probabilistic expressions. To control these differences, we minimize the symmetric KL divergence between the two distributions, forcing them to be consistent with each other, as shown in Equation (4). By employing this constrained objective

L_{c o n}

, we ensure that the model does not overfit to the augmented data, thereby maintaining semantic consistency between the original and augmented data.

\begin{matrix} L_{c o n} & = D_{K L} (P_{θ} (Y_{i} ∣ X_{i}), P_{θ}^{'} (Y_{i} ∣ X_{i})) \\ + D_{K L} (P_{θ}^{'} (Y_{i} ∣ X_{i}), P_{θ} (Y_{i} ∣ X_{i})), \end{matrix}

(4)

Additionally, we also employ the NLL (negative log-likelihood) learning objective as the objective function. The NLL objective for two distributions

P_{θ} (Y_{i} ∣ X_{i})

and

P_{θ}^{'} (Y_{i} ∣ X_{i})

can be modeled as

\begin{matrix} L_{i} = - l o g (P_{θ} (Y_{i} ∣ X_{i})), \end{matrix}

(5)

and

\begin{matrix} L_{a u g} = - l o g (P_{θ}^{'} (Y_{i} ∣ X_{i})), \end{matrix}

(6)

where

θ

represents the model’s parameters. The final training objective

L_{a l l}

of our RASD is to minimize the weighted sum of

L_{i}

and

L_{a u g}

L_{a l l} = L_{i} + L_{a u g} + β \cdot L_{c o n},

(7)

where

β

is the coefficient that controls the weight of

L_{c o n}

.

Algorithm 1 outlines the comprehensive training algorithm of our RASD. As introduced before, Lines 3–6 demonstrate retrieval augmentation where we construct the augmented sample. Lines 7–8 show how to obtain two output distributions over sequences

P_{θ} (Y_{i} ∣ X_{i})

and

P_{θ}^{'} (Y_{i} ∣ X_{i})

, then Line 9 calculates the self-distillation objective with consistency constraint between the two output distributions over sequences. Finally, the model parameters are updated based on the loss function of Equation (7). The training procedure will continue over the data epochs until convergence.

Algorithm 1 MWP solver training with RASD

Input: Training dataset

D = {(X_{i}, Y_{i})}_{i = 1}^{n}

.

Output: model parameters

θ

1: Initialize model with parameters

θ

.

2: while not converged do

3: randomly sample data

(X_{i}, Y_{i}) \sim D

;

4:

h_{i} \leftarrow FinalHiddenState (E_{θ} (X_{i}))

;

5: retrieve the most similar sample

h_{j}

from candidate set

C

using

h_{i}

according to Equation (2);

6: construct the augmented sample

h_{i}^{'}

according to Equation (3);

7: decoding the output distributions of solution equation with original representation:

P_{θ} (Y_{i} | X_{i}) \leftarrow D_{θ} (h_{i})

;

8: decoding the output distributions of solution equation with augmented representation:

P_{θ}^{'} (Y_{i} | X_{i}) \leftarrow D_{θ} (h_{i}^{'})

;

9: calculate the

L_{con}

according to Equation (4);

10: calculate the NLL objective

L_{i}

and

L_{aug}

according to Equations (5) and (6);

11: update the model parameters by minimizing loss

L_{all}

by Equation (7).

12: end while

4. Experiments

4.1. Datasets, Baselines, and Metrics

Our experimental analysis encompasses two widely recognized MWP datasets, Math23K [1] and MAWPS [3], as well as extending to more complex and varied MWP datasets, ASDiv-A [5] and SVAMP [6], and a geometric problem dataset, GeoQA [23], to demonstrate the effectiveness of our RASD.

We benchmark our method on multiple classical backbone solvers against multiple augmentations.The solvers are GTS [9], Graph2Tree [10], DeductReasoner [12], NERHRT [24], NGS [23], and Geoformer [25]. The augmentations include Mixup [18], NEFTune [26], and IDAM [8]. GTS is a goal-driven neural model designed to directly predict the expression tree. Graph2Tree enhances GTS with a domain-specific graph encoder. DeductReasoner is an end-to-end deductive reasoning framework to incrementally construct the target expression through interpretable deductive steps. NERHRT constructs numerical and semantic graphs to capture the relationships between numbers and words, and employs a hierarchical recursive tree decoder to generate mathematical expressions in a structured manner. NGS constructs a neural geometric solver to solve geometric problems by comprehensively analyzing multimodal information and generating interpretable programs. Geoformer develops a unified multi-task geometric transformer framework capable of concurrently managing calculation and proof problems through sequence generation. Mixup generates a weighted combination of random MWP pairs from the training data while NEFTune introduces noise to the embedding vectors during training. IDAM augments MWP examples by applying latent-space operations during training, instead of making perturbations over the input data. A solver with the RoBERTa label means that it employs pre-trained RoBERTa-base [27] embedding to enhance the understanding capabilities. For Math23K, we deploy RoBERTa-wwm-ext (https://github.com/ymcui/Chinese-BERT-wwm (24 August 2025)) while deploying RoBERTa-base [27] for MAWPs, ASDiv-A, and SVAMP. The models enhanced with our RASD will be with RASD label. The metric employed for evaluating our method’s performance is answer accuracy. We consider a prediction to be accurate if the solution derived from the predicted equation aligns precisely with the standard answer.

4.2. Implementation Details

Our approach is implemented with PyTorch (https://pytorch.org/, (24 August 2025)). All experiments are performed with an NVIDIA GeForce RTX 3090 GPU. Except for the batch size, learning rate, epoch, and seed, the hyperparameter settings are consistent with those of existing MWP solvers that do not use our method. For hyperparameter tuning, the settings for our method vary somewhat across different models and datasets. We select five random seeds: 0, 1, 42, 113, and 6174. For the best coefficient weight

β

associated with the

L_{c o n}

loss term, we choose the optimal value from the set [0.1, 0.3, 0.5, 1, 3, 5].

4.3. Overall Results

We begin our analysis by evaluating the efficacy of our RASD across four distinct MWP datasets. Table 2 details the performance metrics for the widely recognized English dataset MAWPS. Incorporating our RASD, the accuracy of the NERHRT + RoBERTa model increases from 91.4% to 92.4%. The accuracy of the DeductReasoner + RoBERTa model escalates from 92% to 92.7%. Shifting the focus to the popular Chinese dataset Math23K, the performance enhancements of our RASD are presented in Table 2. Here, the accuracy of the NERHRT + RoBERTa model is enhanced, leaping from 87.2% to 87.7%. The DeductReasoner + RoBERTa model registers an improvement from 86.1% to 86.8%. The English dataset ASDiv-A’s performance outcomes are tabulated in Table 2. Here, the GTS + RoBERTa model augmented with RASD shows a 0.5% improvement over the original GTS + RoBERTa. The Graph2Tree + RoBERTa model with RASD outperforms the baseline by up to 0.9%. The DeductReasoner + RoBERTa model’s accuracy increases from 83.0% to 83.8%. SVAMP, a more challenging dataset derived from ASDiv-A through the introduction of specific variations, is detailed in Table 2. The DeductReasoner + RoBERTa model with RASD achieves a 1.7% enhancement over the DeductReasoner + RoBERTa. The GTS + RoBERTa model with RASD improves from 41% to 41.9%. The Graph2Tree + RoBERTa model augmented with RASD demonstrates a significant performance leap of up to 2.5% over the baseline.

A review of the results from Table 2 reveals that the GTS and Graph2Tree models, when combined with NEFTune, fail to demonstrate efficacy across the four MWP datasets. Similarly, the DeductReasoner model with NEFTune is ineffective on Math23K, ASDiv-A, and SVAMP. The Mixup augmentation method does not enhance the performance of GTS with RoBERTa across all datasets and fails to improve Graph2Tree with RoBERTa on ASDiv-A and Math23K. Notably, the DeductReasoner with NEFTune shows effectiveness solely on Math23K. While IDAM achieves marginal improvements for DeductReasoner on MAWPS and Math23K, its impact remains limited compared to RASD, particularly on complex datasets like SVAMP where RASD outperforms it by 1.7–2.5 points. Across different base models and datasets, RASD consistently delivers improvements in average accuracy. The advantage of RASD becomes more pronounced on challenging variations like SVAMP, where it achieves the most significant performance leaps among all compared augmentation methods. Additionally, the standard deviations in the results indicate that RASD’s improvements are relatively stable. In most cases, the standard deviations for RASD-augmented models are comparable to or smaller than those of the baseline models, implying that the performance improvements are consistent across different runs. For example, in the case of DeductReasoner + RoBERTa, the standard deviation for the MAWPS dataset decreases from 0.15 to 0.05 when RASD is applied. This stability strengthens the case for RASD’s reliability as a data augmentation technique. These findings underscore the effectiveness and versatility of our RASD across various MWP datasets, highlighting its potential to bolster the performance of different models in solving math word problems.

To further demonstrate the effectiveness of our RASD across a broader spectrum, we extend experiments to include two solvers, NGS and Geoformer, on the recently introduced and more extensive geometric problem dataset GeoQA. The GeoQA test set contains a total of 754 problems, consisting of 417 angle calculation problems and 283 length calculation problems. The results of GeoQA are shown in Table 3. The integration of RASD with NGS results in a performance enhancement of up to 1.3% across the entire dataset. Furthermore, the incorporation of RASD into Geoformer leads to a significant improvement of 3.7% on the entire calculation problem dataset. It is noteworthy that the application of Mixup to NGS does not yield any beneficial effects on GeoQA. These findings collectively underscore the effectiveness of RASD when applied to geometric problem datasets. Overall, these results demonstrate the versatility and potential of RASD in enhancing the performance of various solvers in the domain of mathematical reasoning.

4.4. Ablation Study on Two Components of RASD

To rigorously assess the contributions of our approach, we undertake ablation studies using the solver DeductReasoner across various MWP datasets. Our primary focus is on evaluating the individual impacts of retrieval augmentation and self-distillation on the outcomes. The results are shown in Table 4. We observe that the implementation of either the retrieval augmentation or the self-distillation leads to a performance enhancement across all four MWP datasets. Specifically, DeductReasoner, when trained solely with the retrieval augmentation, exhibits superior performance compared to DeductReasoner trained only with the constraint objective of self-distillation, with the exception of the ASDiv-A dataset. The synergistic application of both modules results in the most pronounced improvements for DeductReasoner across all MWP datasets. These findings underscore the individual and combined contributions of retrieval augmentation and self-distillation to the efficacy of RASD on MWP datasets.

4.5. Ablation Study on Different Data Ratios

We conduct experiments with the solver DeductReasoner on the Math23K dataset with varying data ratios to investigate the effectiveness and robustness of our RASD. From the results depicted in Table 5, we can observe that the solver trained with our RASD outperforms the baseline across all data ratios. This observation indicates that our RASD is effective across different data distributions. Consequently, we can conclude that our RASD is both generalizable and robust, maintaining consistent performance across diverse data scenarios.

4.6. Ablation Study on Different Expression Lengths

In our quest to evaluate the efficacy and robustness of RASD, we perform a series of experiments using the cutting-edge DeductReasoner solver on the Math23K dataset, focusing on varying lengths of expression. The data distribution across different equation lengths is presented in Table 6. It is evident that equations with a length of five are the most numerous. Those with a length of seven are the second most common, while equations with a length of nine are the least common. The results of the model’s performance, as presented in Table 7, reveal that the DeductReasoner augmented with our RASD achieves superior performance compared to its baseline, particularly for expressions of lengths three and nine. Conversely, while the expression length is five and seven, the performance of DeductReasoner remains relatively stable, indicating no significant change with the incorporation of RASD. These results elucidate the impact of expression length on the solver’s performance and underscore the beneficial effects of RASD in certain contexts.

4.7. Ablation Study on Different Loss Weight Coefficients

Table 8 presents the ablation study results of the consistency objective weight

β

on the NERHRT+RoBERTa model using the MAWPS dataset. The experimental results demonstrate that all configurations with

β > 0.1

outperform the baseline (91.4), validating the robustness of our RASD. The performance improvement shows insensitivity to weight selection, as stable gains are consistently achieved across a relatively wide range.

4.8. Case Study

We conduct a comparative case study to analyze the solution expressions generated by DeductReasoner on the Math23K dataset, both before and after the application of our RASD, as shown in Table 9.

In the first case, RASD accurately identifies the original radius of “3 cm” and the extension of “2 cm” for the circle. It correctly calculates the area increase using the formula

3.14 \times [{(3 + 2)}^{2} - 3^{2}]

, avoiding incorrect steps such as the inappropriate combination of subtraction and addition operations. This shows RASD’s ability to understand and solve geometric problems effectively. In the second case, the presence of the words “apple” and “pear” introduces complexity. RASD precisely identifies the number of apple trees as “9” while effectively sidestepping irrelevant information, such as the count of pear trees, which is “7”. The third case features an extended query. RASD concentrates on the crux of the problem, discerning the ultimate goal and ensuring that the final step in the solution process is not overlooked. In the third instance, which presents a more extensive problem statement, our RASD is adept at focusing on the essence of the question. It accurately identifies the ultimate objective of the solution and ensures that the critical final step in the resolution process is neither neglected nor omitted.

As shown in Table 10, our RASD exhibits certain limitations when solving math word problems in specific scenarios. In the first case, the model encounters difficulty in dynamically adjusting proportions following an addition to the initial quantity, highlighting a limitation in handling problems with evolving initial conditions. In the second case, the model shows a tendency to miscalculate cumulative totals, which is critical for accurate intermediate value determination in multi-step problems. Lastly, the third case reveals a weakness in the model’s ability to perform reverse reasoning from partial information to the overall context, particularly when inferring the whole from a known segment. These observations collectively indicate that while our RASD is effective in many cases, it faces challenges in hard high-level scenarios.

5. Conclusions

In this paper, we propose retrieval augmentation self-distillation (RASD) for word problem solving. RASD conducts retrieval augmentation in the hidden space during network training to introduce small perturbations, thereby improving model robustness. We also establish a self-distillation objective with a consistency constraint to maintain the consistency between the original and augmented problem. Experiments on multiple benchmarks demonstrate that RASD effectively boosts the performance of existing MWP and geometric solvers, showcasing its strong generalization ability.

Author Contributions

Conceptualization, X.W. and J.Q.; Funding acquisition, J.Q. and Z.Y.; Investigation, X.W. and J.Q.; Methodology, X.W. and J.Q.; Project administration, Z.Y.; Supervision, J.Q. and Z.Y.; Validation, X.W.; Writing—original draft, X.W.; Writing—review and editing, J.Q. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 62206314, Guangdong Basic and Applied Basic Research Foundation under Grant No. 2022A1515011835, No. 2025A1515010454 and No. 2023A1515012561, and Science and Technology Projects in Guangzhou under Grant No. 2024A04J4388.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data of this study are available, by contacting the authors, upon reasonable request.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MWP	Math Word Problem
AI	Artificial Intelligence
AGI	Artificial General Intelligence
GWP	Geometry Word Problem
NLP	Natural Language Processing
KL	Kullback–Leibler
NLL	Negative Log-likelihood

References

Wang, Y.; Liu, X.; Shi, S. Deep Neural Solver for Math Word Problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Palmer, M., Hwa, R., Riedel, S., Eds.; pp. 845–854. [Google Scholar] [CrossRef]
Huang, D.; Shi, S.; Lin, C.Y.; Yin, J.; Ma, W.Y. How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 887–896. [Google Scholar]
Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; Hajishirzi, H. MAWPS: A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Knight, K., Nenkova, A., Rambow, O., Eds.; pp. 1152–1157. [Google Scholar] [CrossRef]
Qin, J.; Lin, L.; Liang, X.; Zhang, R.; Lin, L. Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3780–3789. [Google Scholar]
Miao, S.Y.; Liang, C.C.; Su, K.Y. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 June 2020; pp. 975–984. [Google Scholar] [CrossRef]
Patel, A.; Bhattamishra, S.; Goyal, N. Are NLP models really able to solve simple math word problems? arXiv 2021, arXiv:2103.07191. [Google Scholar] [CrossRef]
Qin, J.; Liang, X.; Hong, Y.; Tang, J.; Lin, L. Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 1, pp. 5870–5881. [Google Scholar] [CrossRef]
Qin, J.; Huang, Z.; Zeng, Y.; Zhang, Q.; Lin, L. An Introspective Data Augmentation Method for Training Math Word Problem Solvers. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3113–3127. [Google Scholar] [CrossRef]
Xie, Z.; Sun, S. A Goal-Driven Tree-Structured Neural Model for Math Word Problems. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar]
Zhang, J.; Wang, L.; Lee, K.W.; Bin, Y.; Lim, E.P. Graph-to-Tree Learning for Solving Math Word Problems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Tsai, S.h.; Liang, C.C.; Wang, H.M.; Su, K.Y. Sequence to general tree: Knowledge-guided geometry word problem solving. arXiv 2021, arXiv:2106.00990. [Google Scholar] [CrossRef]
Jie, Z.; Li, J.; Lu, W. Learning to reason deductively: Math word problem solving as complex relation extraction. arXiv 2022, arXiv:2203.10316. [Google Scholar] [CrossRef]
Jayasinghe, I.; Ranathunga, S. Two-step memory networks for deep semantic parsing of geometry word problems. In Proceedings of the International Conference on Current Trends in Theory and Practice of Informatics, Limassol, Cyprus, 20–24 January 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 676–685. [Google Scholar]
Zhang, X.; Zhao, J.; Lecun, Y. Character-Level Convolutional Networks for Text Classification; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
Wei, J.; Zou, K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
Liu, Q.; Guan, W.; Li, S.; Cheng, F.; Kawahara, D.; Kurohashi, S. RODA: Reverse Operation based Data Augmentation for Solving Math Word Problems. Inst. Electr. Electron. Eng. 2021, 30, 1–11. [Google Scholar] [CrossRef]
Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do not have enough data? In Deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7383–7390. [Google Scholar]
Zhang, H. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Ramé, A.; Sun, R.; Cord, M. Mixmo: Mixing multiple inputs for multiple outputs via deep subnetworks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 823–833. [Google Scholar]
Guo, H.; Mao, Y.; Zhang, R. Augmenting data with mixup for sentence classification: An empirical study. arXiv 2019, arXiv:1905.08941. [Google Scholar] [CrossRef]
Chen, J.; Yang, Z.; Yang, D. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv 2020, arXiv:2004.12239. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.P.; Lin, L. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv 2021, arXiv:2105.14517. [Google Scholar]
Zhang, Y.; Zhou, G.; Xie, Z.; Huang, J.X. Number-enhanced representation with hierarchical recursive tree decoding for math word problem solving. Inf. Process. Manag. 2024, 61, 103585. [Google Scholar] [CrossRef]
Chen, J.; Li, T.; Qin, J.; Lu, P.; Lin, L.; Chen, C.; Liang, X. UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 3313–3323. [Google Scholar] [CrossRef]
Jain, N.; Chiang, P.y.; Wen, Y.; Kirchenbauer, J.; Chu, H.M.; Somepalli, G.; Bartoldson, B.R.; Kailkhura, B.; Schwarzschild, A.; Saha, A.; et al. Neftune: Noisy embeddings improve instruction finetuning. arXiv 2023, arXiv:2310.05914. [Google Scholar] [CrossRef]
Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]

Figure 1. An example for math word problem solving. An MWP example often only contains textual information like a problem text, a solution equation, and the answer.

Figure 2. The design of our retrieval augmentation self-distillation method. For each training data sample, the encoder first encodes the problem text into a hidden state sequence. We define the final hidden state of each data sample as the problem representation

h_{i}

. Then, we obtain the augmented problem representation

h_{i}^{'}

corresponding to each problem representation

h_{i}

through the retrieval augmentation. Moreover,

h_{i}

and

h_{i}^{'}

are fed into the decoder to obtain their output distribution, respectively. Finally, our RASD constrains the difference between these output distributions by a self-distillation objective with a consistency constraint.

Figure 2. The design of our retrieval augmentation self-distillation method. For each training data sample, the encoder first encodes the problem text into a hidden state sequence. We define the final hidden state of each data sample as the problem representation

h_{i}

. Then, we obtain the augmented problem representation

h_{i}^{'}

corresponding to each problem representation

h_{i}

through the retrieval augmentation. Moreover,

h_{i}

and

h_{i}^{'}

are fed into the decoder to obtain their output distribution, respectively. Finally, our RASD constrains the difference between these output distributions by a self-distillation objective with a consistency constraint.

Table 1. Key characteristics of the related works.

Method	Key Characteristics
Seq2Seq Models	Direct translation of problems into equation templates
Tree-structured Models	Better capture of expression semantics
(GTS, Graph2Tree)
Paraphrasing-based	Creation of diverse samples via synonym replacement and sentence restructuring
Noising-based	Introduction of noise to improve model robustness
Sampling-based (Mixup)	Generation of new data by interpolating word/sentence embeddings

Table 2. Answer accuracy of solvers with our RASD and other data augmentation baselines on MAWPS, Math23K, ASDiv-A, and SVAMP. Dashes (-) indicate that the NERHRT + RoBERTa model was not tested on ASDiv-A and SVAMP datasets. Standard deviations are presented after the accuracy. Green arrows indicate improvement over the baseline, and bold indicates the best results in each column.

Model	MAWPS	Math23K	ASDiv-A	SVAMP
GTS + RoBERTa	${88.5}_{\pm 0.5}$	${75.7}_{\pm 1.0}$	${81.2}_{\pm 0.3}$	${41.0}_{\pm 0.5}$
+NEFTune	${87.6}_{\pm 1.7}$	${75.5}_{\pm 1.3}$	${80.4}_{\pm 1.5}$	${38.4}_{\pm 2.1}$
+Mixup	${88.0}_{\pm 0.4}$	${75.1}_{\pm 1.6}$	${78.6}_{\pm 1.2}$	${39.1}_{\pm 0.1}$
+IDAM	${88.0}_{\pm 0.5}$	${75.0}_{\pm 1.7}$	${79.5}_{\pm 0.5}$	${39.1}_{\pm 0.8}$
+RASD	88.7_{(↑0.2)±0.2}	76.4_{(↑0.7)±0.3}	81.7_{(↑0.5)±0.8}	41.9_{(↑0.9)±0.6}
Graph2Tree + RoBERTa	${88.7}_{\pm 0.4}$	${77.4}_{\pm 1.4}$	${82.2}_{\pm 0.2}$	${43.8}_{\pm 1.6}$
+NEFTune	${88.4}_{\pm 1.6}$	${75.7}_{\pm 1.9}$	${80.8}_{\pm 0.4}$	${42.2}_{\pm 0.1}$
+Mixup	${88.9}_{\pm 0.4}$	${76.5}_{\pm 0.7}$	${81.6}_{\pm 1.3}$	${44.1}_{\pm 1.6}$
+IDAM	${88.5}_{\pm 0.9}$	${76.6}_{\pm 2.1}$	${81.9}_{\pm 1.3}$	${43.7}_{\pm 0.8}$
+RASD	89.3_{(↑0.6)±0.3}	78.5_{(↑1.1)±1.1}	83.1_{(↑0.9)±0.3}	46.3_{(↑2.5)±1.0}
DeductReasoner + RoBERTa	${92.0}_{\pm 1.5}$	${86.1}_{\pm 0.8}$	${83.0}_{\pm 0.5}$	${45.0}_{\pm 0.7}$
+NEFTune	${92.3}_{\pm 2.2}$	${85.5}_{\pm 0.6}$	${82.1}_{\pm 1.4}$	${44.1}_{\pm 1.9}$
+Mixup	${92.0}_{\pm 1.2}$	${86.3}_{\pm 0.7}$	${82.3}_{\pm 0.4}$	${44.4}_{\pm 1.6}$
+IDAM	${92.5}_{\pm 0.5}$	${86.4}_{\pm 0.9}$	${82.8}_{\pm 1.3}$	${44.8}_{\pm 1.3}$
+RASD	92.7_{(↑0.7)±0.2}	86.8_{(↑0.7)±0.4}	83.8_{(↑0.8)±0.5}	46.7_{(↑1.7)±0.9}
NERHRT + RoBERTa	${91.4}_{\pm 0.8}$	${87.2}_{\pm 0.5}$	-	-
+NEFTune	${90.6}_{\pm 1.4}$	${85.4}_{\pm 0.6}$	-	-
+Mixup	${90.6}_{\pm 0.9}$	${86.1}_{\pm 1.1}$	-	-
+IDAM	${91.4}_{\pm 0.8}$	${86.2}_{\pm 0.8}$	-	-
+RASD	92.4_{(↑1.0)±0.5}	87.7_{(↑0.5)±0.7}	-	-

Table 3. Answer accuracy of solvers with our RASD and other baselines on different test subsets of GeoQA. “All” represents the overall performance across all types of geometric problems in the dataset. “Angle” and “Length” refer to the performance on angle calculation and length calculation problems, respectively. Standard deviations are presented after the accuracy. Green arrows indicate improvement over the baseline, and bold indicates the best results in each column.

	All	Angle	Length	All	Angle	Length
Model	NGS			Geoformer
NoAug	${60.7}_{\pm 1.9}$	${71.5}_{\pm 2.1}$	${48.8}_{\pm 0.8}$	${60.3}_{\pm 0.9}$	${71.5}_{\pm 1.1}$	${49.1}_{\pm 0.9}$
+NEFTune	${61.3}_{\pm 1.5}$	${72.7}_{\pm 1.0}$	${49.1}_{\pm 1.8}$	${63.4}_{\pm 0.9}$	${73.6}_{\pm 0.6}$	${53.7}_{\pm 1.3}$
+Mixup	${59.3}_{\pm 2.2}$	${68.8}_{\pm 2.3}$	${49.1}_{\pm 0.9}$	${62.5}_{\pm 1.4}$	${74.6}_{\pm 1.1}$	${50.2}_{\pm 1.8}$
+IDAM	${59.4}_{\pm 2.3}$	${70.5}_{\pm 1.9}$	${49.5}_{\pm 2.5}$	${63.6}_{\pm 1.6}$	${74.7}_{\pm 1.4}$	${50.3}_{\pm 1.6}$
+RASD	62.0_{(↑1.3)±1.3}	72.4_{(↑0.9)±0.9}	52.4_{(↑3.6)±1.5}	64.0_{(↑3.7)±1.1}	76.5_{(↑5.0)±1.2}	50.5_{(↑1.4)±0.9}

Table 4. Ablation study on two modules of our RASD on four MWP datasets. We undertake an ablation study using the solver DeductReasoner + RoBERTa. “+RA”, “+SD”, and “+RASD” refer to the performance of utilizing the retrieval augmentation, the self-distillation, and RASD, respectively. Green arrows indicate improvement over the baseline, and bold indicates the best results in each column.

Model	MAWPS	ASDiv-A	SVAMP	Math23K
NoAug	${92.0}_{\pm 1.5}$	${83.0}_{\pm 0.5}$	${45.0}_{\pm 0.7}$	${86.1}_{\pm 0.8}$
+RA	92.5_{(↑0.5)±0.5}	83.2_{(↑0.2)±0.8}	45.5_{(↑0.5)±0.5}	86.6_{(↑0.5)±0.9}
+SD	92.4_{(↑0.4)±1.1}	83.6_{(↑0.6)±1.0}	45.1_{(↑0.1)±1.3}	86.2_{(↑0.1)±0.8}
+RASD	92.7_{(↑0.7)±0.2}	83.8_{(↑0.8)±0.5}	46.7_{(↑1.7)±0.9}	86.8_{(↑0.7)±0.4}

Table 5. Ablation study on different data ratios of Math23K based on DeductReasoner + RoBERTa. The data ratio refers to the percentage of the dataset used for training. The DeductReasoner trained with our RASD outperforms the baseline across all data ratios. Green arrows indicate improvement over the baseline.

Data Ratio	20%	40%	60%	80%	100%
w/o RASD	68.7	77.0	82.5	84.7	86.1
w/RASD	70.1(↑1.4)	78.2(↑1.2)	83.4(↑0.9)	85.2(↑0.5)	86.8(↑0.7)

Table 6. Data count statistics for different equation lengths in the Math23K dataset.

Equation Length	3	5	7	9
Train	4397	11,001	4406	1349
Test	173	522	191	66

Table 7. Ablation study on different expression lengths of Math23K based on DeductReasoner + RoBERTa. Green arrows indicate improvement over the baseline.

Equation Length	Value Accuracy
Equation Length	3	5	7	9
w/o RASD	93.64	91.95	78.53	58.90
w/RASD	95.38(↑1.74)	91.95 $(-)$	78.53 $(-)$	61.64(↑2.74)

Table 8. Ablation study on different loss weight coefficients of constraint objective

L_{c o n}

on MAWPS based on NERHRT + RoBERTa. Green arrows indicate improvement over the baseline, and red arrows indicate degradation.

Table 8. Ablation study on different loss weight coefficients of constraint objective

L_{c o n}

on MAWPS based on NERHRT + RoBERTa. Green arrows indicate improvement over the baseline, and red arrows indicate degradation.

$L_{con}$ Weight $β$	Value Accuracy
$L_{con}$ Weight $β$	0.1	0.3	0.5	1	3	5
w/o RASD	91.4
w/RASD	91.3(↓0.1)	91.9(↑0.5)	92.4(↑1.0)	91.6(↑0.2)	91.7(↑0.3)	91.6(↑0.2)

Table 9. Case studies. MWPs are correctly solved after applying our RASD. Red crosses indicate incorrect expressions, and green checkmarks indicate correct expressions.

Case 1: The radius of a circle is 3 cm. If its radius is extended by 2 cm, how much will the area increase? 一个圆的半径是3厘米，如果把它的半径延长2厘米，那么面积增加多少。
Before augmentation: $3 \times ((3 - 2) + 3) \div 2$ (✗)	After augmentation: $3.14 \times [{(3 + 2)}^{2} - 3^{2}]$ (✓)
Case 2: There are 9 apple trees and 7 pear trees in the orchard. Each apple tree can pick approximately 160 kilograms of apples. How many kilograms of apples can be picked in this orchard? 果园里有9棵苹果树，7棵梨树．每棵苹果树大约摘160千克苹果，这个果园大约摘多少千克苹果？
Before augmentation: $(9 + 7) \times 160$ (✗)	After augmentation: $9 \times 160$ (✓)
Case 3: The ticket sales for various competitions of the 2008 Olympic Games are currently booming. The minimum ticket price for a handball competition is 30 yuan, and the minimum ticket price for a swimming competition is 20 yuan less than four times that of a handball competition. How much is the minimum ticket price for a swimming competition more expensive than a handball competition? 2008年奥运会各项比赛门票销售正在火热进行中，一场手球比赛的最低票价为30元，一场游泳比赛的最低票价比手球比赛的4倍少20元，一场游泳比赛的最低票价比手球比赛贵多少元?
Before augmentation: $30 \times 4 - 20$ (✗)	After augmentation: $30 \times 4 - 20 - 30$ (✓)

Table 10. The analysis of our method’s limitations. MWPs are hard for our current RASD system, which cannot solve them. Red crosses indicate incorrect expressions.

Case 1: A herder has 450 sheep, 3/5 of which are goats. Now, 10 more goats are bought. What fraction of the sheep are goats now? 某牧民养羊450只，其中(3/5)是山羊．现在又买回10只山羊，现在山羊占几分之几？
Pred: $(1 - \frac{3}{5}) + 10$ (✗)	True: $(450 \times (\frac{3}{5}) + 10) \div (450 + 10)$
Case 2: A 12,000 m long highway is to be built. The original plan was to build 300 m per day, but the task was completed in 30 days. How many more meters were built each day than planned? 修一条长12000米的公路，原计划每天修300米，结果30天完成了任务，实际比原计划每天多修多少米？
Pred: $12000 \div (300 \times 30) - 300$ (✗)	True: $12000 \div 30 - 300$
Case 3: A car travels from place A to place B, covering 4/5 of the entire journey. Of the remaining distance, 70% is uphill and the rest is downhill. It is known that the downhill distance is 3 kilometers. How far is it from A to B? 一辆汽车从甲地到乙地，行了全程的(4/5)，在剩下的路程中，70%是上坡路，其余是下坡路．已知下坡路长 3千米，甲、乙两地相距多远？
Pred: $3 \div (3 \div 70 % \times \frac{4}{5})$ (✗)	True: $3 \div [(1 - \frac{4}{5}) \times (1 - 70 %)]$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, X.; Qin, J.; Yang, Z. A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics 2025, 14, 3425. https://doi.org/10.3390/electronics14173425

AMA Style

Wu X, Qin J, Yang Z. A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics. 2025; 14(17):3425. https://doi.org/10.3390/electronics14173425

Chicago/Turabian Style

Wu, Xiaoqi, Jinghui Qin, and Zhijing Yang. 2025. "A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving" Electronics 14, no. 17: 3425. https://doi.org/10.3390/electronics14173425

APA Style

Wu, X., Qin, J., & Yang, Z. (2025). A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving. Electronics, 14(17), 3425. https://doi.org/10.3390/electronics14173425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Retrieval Augmentation Self-Distillation Method for Math Word Problem Solving

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Math Word Problem Solving

2.2. Data Augmentation

3. Methodology

3.1. Preliminaries

3.2. Hidden Space-Based Retrieval Augmentation

3.3. Consistent Reasoning with Self-Distillation Learning

4. Experiments

4.1. Datasets, Baselines, and Metrics

4.2. Implementation Details

4.3. Overall Results

4.4. Ablation Study on Two Components of RASD

4.5. Ablation Study on Different Data Ratios

4.6. Ablation Study on Different Expression Lengths

4.7. Ablation Study on Different Loss Weight Coefficients

4.8. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI