Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method

Peng, Rao; Huang, Litian; Yu, Xinguo

doi:10.3390/app14188184

Open AccessArticle

Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method

by

Rao Peng

¹

,

Litian Huang

¹

and

Xinguo Yu

^1,2,*

¹

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

²

Central China Normal University Wollongong Joint Institute, Central China Normal University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8184; https://doi.org/10.3390/app14188184

Submission received: 16 July 2024 / Revised: 3 September 2024 / Accepted: 9 September 2024 / Published: 11 September 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Developing Arithmetic Word Problem (AWP) -solving algorithms has recently become one of the hottest research areas because it can simultaneously advance general artificial intelligence and the application of AI technology in education. This paper presents a novel algorithm for solving AWPs by synergizing Large Language Models (LLMs) with the Scene-Aware Syntax–Semantics (

S^{2}

) method. The innovation of this algorithm lies in leveraging the LLM to divide problems into multiple scenes, thereby enhancing the relation-flow approach in the processes of relation extraction and reasoning. Our algorithm consists of three components: scene decomposer, relation extractor, and symbolic solver. In the scene decomposer, we propose the Chain-Of-Scene (COS) method. It dynamically constructs prompts for the LLM using a retrieval-augmented strategy, thus enabling the chain-formed generation of scenes from the input problem. In the relation extractor, we introduce the Scene-Aware

S^{2}

method, which utilizes syntax–semantics models to match the text within each scene and convert it into relations. This allows for the efficient and accurate extraction of explicit and implicit relations. Finally, a symbolic solver is employed to reason through the set of relations to derive the solution. Experimental results on six authoritative datasets demonstrate that the proposed algorithm achieves an average solving accuracy of 90.4%, outperforming the State-Of-The-Art (SOTA) algorithm by 1.1%. The case study further illustrates that it outputs more reliable solutions than baseline algorithms. These findings have significant implications for promoting smart education and developing personalized intelligent tutoring systems.

Keywords:

arithmetic word problem; problem solving; large language model; Chain-Of-Scene; syntax–semantics method; relation-flow approach

1. Introduction

Problem solving has become a research hotspot with the rapid advancement of artificial intelligence, in which solving Arithmetic Word Problems (AWPs) is one of the most active and challenging tasks [1]. An AWP typically describes a series of quantitative relations and one or more unknown quantities related to these relations in natural language [2]. Solvers for AWPs usually need to possess language comprehension, logical reasoning, and world knowledge, among other skills [3]. In recent years, with the rapid advancement of Natural Language Processing (NLP) and deep learning technologies [4,5], significant progress has been made in the research of AWP solving [6,7]. Nevertheless, the performance of existing solving algorithms still falls short of practical requirements [8,9]. Developing algorithms for solving AWPs focuses on two main tasks: first, improving the solving accuracy of the algorithm, as this directly determines its performance and application range; second, enhancing the reliability of the algorithm’s output, as this affects users’ willingness and experience in using machine solvers.

Large Language Models (LLMs) are natural language generation models achieved through unsupervised pre-training and reinforcement learning [5,10]. They have gained widespread popularity across various industries due to their strong adaptability to different tasks [4]. Recently, many studies have focused on the usage of LLMs in solving AWPs [11,12]. One of the most notable areas of research is Chain-Of-Thought (COT), which enhances the reasoning capabilities of an LLM by prompting it to process complex tasks step by step [6]. Language models employing COT-like methods achieve considerable progress in solving problems because they can continually leverage the extensive knowledge acquired during pre-training to analyze and understand the problem during the step generation [13,14]. However, the main issues with COT-like methods are twofold. First, as AWPs become more complex, LLMs employing COT may generate useless and incorrect steps in the chain (e.g., listing invalid equations) [15], leading to the incorrect solution; second, the precise reasoning for equations relies on the correct invocation of external code interpreters by LLMs [7]. The COT method, when generating long content containing multiple steps, requires repeated calls to the interpreter, which leads to an increase in errors. Therefore, although LLMs excel at leveraging pre-trained knowledge to understand natural language representations of AWPs, they are better suited for solving AWPs when combined with other precise and reliable methods.

The relation-flow approach developed in the past decade is a branch of problem-solving algorithms aiming at providing reliable, accurate, and interpretable solutions [16,17]. The commonality among these algorithms is that they all acquire relations from the problem text through patterns and derive solutions by reasoning about these relations. This characteristic makes the relation-flow approach suitable for educational applications because both the relations involved and the actions to manipulate these relations are aligned with user cognition [18]. Although recent studies have shown good performance on AWPs [17], these algorithms still lag behind State-Of-The-Art (SOTA) algorithms [14,19] in terms of solving accuracy. This is because extracting relations from complex AWPs requires rich external linguistic knowledge (e.g., co-reference resolution) and world knowledge (e.g., common sense), which are difficult to formalize into patterns or rules in relation-flow algorithms. Fortunately, the rapid development of LLMs in recent years has enabled access to extensive linguistic and world knowledge through pre-training [4,20,21], further providing new potential for the development of relation-flow approaches.

Based on these findings, this paper suggests that combining LLMs with the relation-flow approach can offer complementary advantages in solving AWPs. On the one hand, LLMs can provide the algorithm with external knowledge. On the other hand, reasoning through relations rather than natural language can achieve more accurate and reliable solutions.

This paper proposes a novel AWP-solving algorithm that integrates LLMs with relation-flow, consisting of three components: scene decomposer, relation extractor, and symbolic solver. In the scene decomposer, the paper proposes a Chain-Of-Scene (COS) method to prompt the LLM, which decomposes the text of the input AWP into multiple scenes rather than directly solving the original problem. In the relation extractor, a Scene-Aware

S^{2}

method is proposed to use the model matching to extract a relation set from each scene and produce a list of relation sets. Finally, a symbolic solver is adopted to recurrently reason through the list of relation sets and output the solution for the AWP.

Consequently, the main contributions of this study are as follows:

It proposes a novel algorithm that leverages the LLM to divide the problem into multiple scenes, thereby enhancing the relation-flow approach in relation extraction and reasoning, achieving high accuracy and reliability in solving AWPs.
It proposes a COS prompting method, which constructs prompts through a retrieval-augmented strategy, enabling the LLM to accurately generate scenes from an AWP while avoiding the errors caused by direct solving.
It proposes a Scene-Aware $S^{2}$ method, which combines scene types with syntax–semantics to extract both explicit and implicit relations, thereby enhancing the efficiency and accuracy of relation acquisition.

The remainder of the paper is organized as follows. Section 2 presents a review of related work. Section 3 introduces the methodology of this paper, including the pipeline of our algorithm and the implementation details of the methods. Section 4 presents the experiments and analysis of the results. Section 5 presents the final remarks, including a discussion of experiment results, conclusions, and implications.

2. Related Work

2.1. Usage of LLM in Solving AWPs

In recent years, following the breakthroughs in NLP by LLMs such as ChatGPT, researchers have become interested in their reasoning capabilities for AWPs [4,11]. COT is a prompting-based method that optimizes input prompts to augment the outputs of LLMs [6]. This method demonstrates that prompting the model to leverage the knowledge obtained during pre-training has more advantages over fine-tuning methods for mathematical reasoning tasks [22]. However, it has not achieved satisfactory accuracy in solving AWPs.

To enhance solving performance, a series of studies have developed novel prompting methods based on COT, referred to collectively as “COT-like” methods in this paper [14,23]. For instance, the Program-Of-Thought (POT) method prompts the LLM to convert input AWPs into a series of executable programming languages [23]. The Equation-Of-Thought (EOT) method directly converts AWPs into systems of equations [14]. The X-Of-Thought (XOT) method employs a majority voting strategy to select answers from the outputs of multiple COT-like methods [14]. Nonetheless, these methods perform poorly on complex AWPs that involve more reasoning steps.

To address complex AWPs, Zhou et al. [24] proposed an improved COT method that enables LLMs to generate and solve simpler subproblems at each step of the reasoning chain. Shridhar et al. [25] proposed a dual-agent model method, where one agent specializes in decomposing the problem, and the other agent specializes in solving each subproblem. Although these studies indicate that linguistic and world knowledge of LLMs can be utilized to decompose complex AWPs, their correct solving depends on repeatedly invoking a code interpreter during generation [7], with the success rate of invocation limiting the final solving accuracy.

Based on the aforementioned studies, this paper concludes that while current LLMs can leverage external knowledge to understand and decompose complex AWPs, they are not proficient in precise reasoning and calculation. Combining the strengths of LLMs with other reliable methods may result in more effective algorithms for solving AWPs. Hence, this paper proposes a COS prompting method, which leverages the advantages of LLMs to generate scenes for AWPs rather than solving them directly.

2.2. Advance in Relation-Flow Approach

The relation-flow approach is a series of problem-solving methods that use relations as an intermediate state [17]. Yu et al. [26] first proposed the

S^{2}

method to implement the relation-flow approach. This method constructs a model pool based on the syntactic and semantic associations between problem sentences and quantitative relations, and then uses this model pool to extract explicit relations from the problem text.

Leveraged on the

S^{2}

method, the relation-flow algorithms were proposed for solving AWPs [17], plane geometric problems [16], function problems [18], and direct circuit problems [27]. Although these methods have achieved good performance, they have not considered the implicit relations in the problem text, which limits their further development. Yu et al. [17] first highlighted the importance of implicit relations for problem-solving and proposed a BERT-based method to extract implicit relations in AWPs. This method first classifies the problem text and then adds a template for implicit relations based on the problem type.

Despite the many advances made by the relation-flow approach, the current algorithms based on it still lag behind SOTA algorithms [14,19,28] in terms of solving accuracy for AWPs. This is because these algorithms only consider the ideal situation where an AWP can be solved by transforming it into a single relation set. In reality, a complex AWP often contains numerous explicit and implicit relations, some of which are relevant, while others are not. Treating these relations as a unified set not only reduces the efficiency of relation extraction but also leads to redundant or erroneous results during relation reasoning. Therefore, a potential way to tackle this issue is to leverage external knowledge to first partition the relevant parts of AWPs into different scenes, and then extract relation sets separately from these distinct scenes. This is precisely the origin of our idea and the focus of this research.

2.3. Progress of Symbolic Solver

In this paper, symbolic solvers refer to the procedures that reason the relation sets to gain the solutions. Previous studies on the relation-flow approach have studied the development of symbolic solvers for relation sets [17,18].

In early research on problem solving [2,3], symbolic solvers were essentially equation solvers, primarily used for solving already converted linear equation systems. Since equation systems can be effectively solved by available tools such as SymPy, some studies have focused on designing methods or models for acquiring equation systems rather than developing symbolic solvers [19,29,30].

Subsequently, studies began developing algorithms based on the relation-flow approach [16,17,18,27], leading to the creation of symbolic solvers for inferring relation sets. These symbolic solvers primarily perform two functions: converting a relation set into a symbolic system and solving the system. Researchers have developed different symbolic solvers for various types of relations. Yu et al. [17] developed a symbolic solver for inferring explicit and implicit relations, Huang et al. [16] developed a symbolic solver for diagram relations, and Yu and Sun [18] developed a symbolic solver for solving functional relations.

However, these symbolic solvers are limited to situations where an AWP is converted into a single relation set. In the method proposed in this paper, an AWP may generate multiple scenes through an LLM, resulting in the extraction of multiple relation sets from scenes. To address this issue, we developed a symbolic solver capable of reasoning the list of relation sets. This solver recurrently infers the relation set for each scene and passes the intermediate results forward, ultimately obtaining the complete solution for an AWP.

Table 1 provides a comparative summary of the characteristics of advanced algorithms for solving AWPs, highlighting the evolution of approaches and the incorporation of LLMs and symbolic solvers.

3. Methodology

3.1. The Pipeline of Proposed Algorithm

The algorithm proposed in this paper consists of three key components:

Scene Decomposer: The first component uses the COS prompting method to decompose the input problem into a list of scenes, each represented as a text closure. This method is detailed in Section 3.2.1 and constitutes one of the primary contributions of our algorithm.
Relation Extractor: The second component uses the Scene-Aware $S^{2}$ method to extract a relation set from each scene, producing a list of relation sets. This method is another major contribution and is discussed in Section 3.3.1.
Symbolic Solver: The third component derives the final solution by reasoning through the list of relation sets. This component primarily enhances the symbolic solver proposed by [17], adapting it for use in our context.

An overview of our algorithm is shown in Figure 1. In this figure,

S_{i}

denotes the i-th scenes decomposed by the LLM, and

R_{i}

and

A_{i}

denote the relation set and the solution acquired from

S_{i}

, respectively. The algorithm operates as described in Algorithm 1. To implement the tasks of the algorithm, three procedures are employed: Procedure 1 for the scene decomposer, Procedure 2 for the relation extractor, and Procedure 3 for the symbolic solver.

Algorithm 1: The algorithm for solving AWPs.

Input: A problem text T

Output: A solution A

1: Step 1: Scene Decomposer: Use Procedure 1 to implement the COS method to acquire a list of scenes S from T;
2: Step 2: Relation Extractor: Use Procedure 2 to implement the Scene-Aware $S^{2}$ method to acquire a list of relation sets $R$ from S;
3: Step 3: Symbolic Solver: Use procedure 3 to reason a list of relation sets $R$ and acquire a solution A.

3.2. Scene Decomposer

3.2.1. COS Prompting Method

To introduce the proposed method, this section first defines arithmetic word problem and scene as follows:

Definition 1

(Arithmetic Word Problem). An arithmetic word problem is a set of known conditions and at least one final solution goal described using natural language and numerical symbols.

Definition 2

(Scene). A scene is a subset of conditions described in an AWP and a sub-goal that can be inferred using these conditions. The sub-goal is a necessary condition for the final solution goal of the AWP. Therefore, an AWP contains at least one scene.

Figure 2 illustrates some exemplars of AWPs containing different numbers of scenes. To identify the different scenes in an AWP, the algorithm requires external linguistic or world knowledge. For example, in the first AWP of exemplars, identifying the scene “There are 15 kg of apples, 20 kg of bananas in the store. So the store has [A] kg fruits” requires the algorithm to know that “apples and bananas are sub-classes of fruit, but rice does not”. Although LLMs acquire such knowledge through pre-training [4,20,21], correctly applying this knowledge during reasoning remains an issue.

Inspired by COT [6], which models complex tasks as a chain of simpler tasks, we treat an AWP as a chain of scenes, where each scene includes a sub-goal and the minimal set of conditions required to infer this sub-goal, and the goals inferred from preceding scenes can serve as conditions for subsequent scenes.

Specifically, our method first prepares a pool containing a few exemplars, where each exemplar includes an original AWP and a series of annotated scenes, as shown in Figure 2. Each AWP in this pool is sourced from grade school textbooks and has been annotated by mathematics teachers and experts. Then, it uses MWP-BERT to encode the input AWP into a vector [31]. Finally, it employs a retrieval-augmented strategy [32] to retrieve exemplars and construct the few-shots in the prompt. This is carried out by retrieving 1 to 3 samples from the exemplar pool based on vector similarity and inserting them into the prompt.

3.2.2. Implementation of COS Prompting Method

This component employs the MWP-BERT as a tool for vectoring the input problem text T, owing to it strengthening the representation of mathematical concepts through multiple math-related pre-training tasks based on BERT [31]. Through encoding, T can be represented as a word vector sequence H, and then mean pooling is used on H to obtain a vector

h (T)

of T.

H = MWP - BERT (T) = {h (w_{i})}_{i = 1}^{n}

(1)

h (T) = \bar{H}

(2)

An exemplar dataset

D = {O_{i}, C_{i}}_{i = 1}^{m}

is prepared for few-shot prompting, where each sample in this dataset consists of an original text

O_{i}

and a list of annotated scenes

C_{i}

. The cosine similarity

s i m (i)

between

h (T)

and

h (O_{i})

is used to retrieve the most similar K-samples from

D

to construct the prompt. The initial prompt for the LLM includes the decomposing query Q, the exemplar

{O_{x}, C_{x}}^{D}

, and the input problem text T. The LLM generates j-th scene

s_{j}

in a recurrent way.

s i m (i) = cos (h (T), h (O_{i})), O_{i} \in D

(3)

x = \underset{D}{a r g s o r t} (s i m (i)) [: K]

(4)

L L M (\hat{s_{1}}, . . ., \hat{s_{j - 1}} ∣ Q, {O_{x}, C_{x}}^{D}, T) = \hat{s_{j}}

(5)

Overall, for each input problem text T, this decomposing component outputs a list of scenes S. The details of the procedure are presented in Procedure 1.

Procedure 1: Scene Decomposer using COS Prompting

1: Input: a problem text T, and a decomposing query Q;
2: Output: a list of scene S;
3: Step 1: Problem Vectorizing
4: ▹Use MWP-BERT to vectorize the problem text and obtain a vector $h (T)$ for T;
5: Step 2: Exemplar Retrieving
6: ▹ Load the exemplar dataset $D = {O_{i}, C_{i}}_{i = 1}^{m}$ ;
7: ▹ Retrieve the top-k exemplars from $D$ based on $s i m (i)$ ;
8: ▹ Record the top-k index x;
9: Step 3: Scene Generating
10: ▹ Build the few-shot prompt: $P T = (Q, {O_{x}, C_{x}}^{D}, T)$ ;
11: ▹ Input the prompt to LLM and generate scenes: $L L M (P T) \to S = {s_{j}}_{j = 1}^{n}$ ;

3.3. Relation Extractor

3.3.1. Scene-Aware $S^{2}$ Method

The Scene-Aware

S^{2}

method proposed in this paper is an evolution of the

S^{2}

method [17]. The evolution lies in the following facts. (a) It extracts the relation set from the text of each scene rather than from the original text of an AWP; (b) it first classifies the type of scene, then selects appropriate models from the pool to obtain the explicit relation; (c) it indexes different implicit templates for different scenes based on their scene types, then uses entity matching within the scene to instantiate the implicit relations. This evolution not only reduces the number of models involved in explicit relation matching, but also facilitates the precise extraction of multiple implicit relations, thereby significantly enhancing the performance of relation acquisition. To elucidate our method, we first define the core concept of “Relation” as follows:

Definition 3

(Relation). A relation is an equation that may include numbers, variables, and quantitative phrases. A relation is called an explicit relation if it is explicitly stated in the text. In contrast, it is called an implicit relation if it is required for the solving but is not explicitly stated in the text.

Provided that “The height of Lei is 175 cm” is the part of a scene narration, it states an explicit relation “Height (Lei) = 175 × cm”. Provided “Find the volume of a cube”, then the algorithm has to add a relation “volume (cube) = high × wide × long” not explicitly stated in the text, which constitutes an implicit relation. Obviously, explicit relations are related to the statements in the scene, while implicit relations depend on the type of scene. Relations in this paper do not refer to quantitative relations because they are all involved in solving AWPs.

Our method first classifies the text of each scene using the Quantity-to-Relation Attention Network (QRAN) proposed by [17], and sequentially extracts explicit and implicit relations based on the classification results. The QRAN reallocates attention weights to the contexts of quantity-related entities, which contain critical information for distinguishing scene types, thereby effectively accomplishing the scene classification task.

In order to extract explicit relations from each scene, this method first constructs a pool of Scene-Aware

S^{2}

models. It is based on the

S^{2}

model proposed by [17] and re-summarizes the explicit relation patterns in different scenes. It not only considers general explicit relations that may appear in all scenes, but also considers explicit relations that only appear in specific scenes. The definition of the Scene-Aware

S^{2}

model is as follows:

Definition 4

(Scene-Aware

S^{2}

Model). A Scene-Aware

S^{2}

model is a quintuple δ = (L, P, K, Q; R), with L denoting scene types to which the model is applicable, P denoting structure of Part-Of-Speech (POS), K representing keywords, Q representing the matching rules, and R representing the relation of the model. Δ = {

δ_{i} = (L_{i}, P_{i}, K_{i}, Q_{i}; R_{i}) ∣ i = 1, 2, 3, \dots}

is called a pool of Scene-Aware

S^{2}

models.

Table 2 shows typical models in the Scene-Aware

S^{2}

model pool, where each row represents a model. In the “model structure”, the first line represents the POS structure P (outside the “[ ]”) and the keywords K (inside the “[ ]”) to be matched; the second line represents the matching rules Q; the third line represents the relation R to be mapped. Models with scene type “general” are used to extract general explicit relations and are applicable to all scenes. Few models are applicable to multiple scene types because the explicit relations in these scenes are similar. It is worth noting that the original

S^{2}

model [17] only considers the quantitative relations between entities, such as the quantitative relation between “Pen” and “Eraser” in the second row. Our method evolves this model to enable it to extract the relations between the attributes of entities, such as the relation between “durian” and “apple” in the “price” attribute in the third row.

In order to extract implicit relations, this method prepares an implicit template pool for non-general scenes. The form of the implicit template is similar to [17,33,34], which consists of several concepts and a template relation between them. For example, the implicit template for a “Circle Area” scene is

{C C : [A r e a, R a d i u s], T P : “ A r e a = π \times R a d i u s \times R a d i u s ”}

, where “CC” is the concepts and “TP” is the template of relation. We use the concepts in the implicit template to match the entities or attributes in the explicit relation acquired by Scene-Aware

S^{2}

models. By replacing the concepts with the matched entities or attributes, the relation template can be instantiated into an implicit relation.

Overall, this method extracts a relation set including all explicit and implicit relations for each scene, thereby producing a list of relation sets for an AWP.

3.3.2. Implementation of Scene-Aware $S^{2}$ Method

This section discusses the implementation of the Scene-Aware

S^{2}

method for acquiring a relation set from a scene. It first utilizes QRAN for classifying the type of scene, then employs scene-aware

S^{2}

models to match scene text and extract explicit relations; finally, it matches concepts in the implicit template with entities in the explicit relations to instantiate implicit relations. This process of this component is delineated by Procedure 2.

In QRAN, the algorithm first obtains the vector sequence

{h (w_{i})}^{s}

of the scene text using MWP-BERT [31]. It then calculates the attention score a of the mean vector

h (s)

for each quantity word vector

h (q)

, and finally, the vector v for classification is obtained by the attention-weighted sum of

h (q)

. This vector is used to predict the scene type

\hat{l}

through a single-layer network. The training setup of the QRAN follows the description in [17].

For explicit relations, the algorithm has developed a three-step matching process. The first step is the matching of scene types, which matches the ‘L’ part of the Scene-Aware

S^{2}

model to the predicted

\hat{l}

from QRAN, thus filtering applicable models from the model pool. The second step is the matching of the POS structure, which utilizes an N-LTP [35] tool to parse the POS tags of scene text, then matches it to the ‘P’ part of the candidate models. The third step is keyword matching, which is accomplished by using vector matching to compute the keywords ‘K’ in the model and the words in the scene text that occupy the same position in the POS.

For implicit relations, the algorithm first uses text matching to align the concepts in implicit templates to the entities in explicit relations. Then, for those unmatched concepts, this algorithm concatenates the vectors of the two words and feeds them into a trained binary classification network

B C N (\cdot)

to verify whether they match. Finally, it converts the implicit template into a relation by replacing concepts with entities and adds it to the relation set.

Procedure 2: Relation extractor using Scene-Aware S² method.

3.4. Symbolic Solver

In this algorithm, the symbolic solver is primarily responsible for reasoning an input list of relation sets and outputting a solution that includes a series of answers. For each individual relation set, this component follows the same method as [17] to perform relation inference. First, it replaces entities within the relation set

R_{i}

with algebraic symbols to form a symbolic equation system

E_{i}

and an entity–symbol mapping table

F_{i}

. Then, it solves for the unknowns in the equation system

E_{i}

using SymPy [36]. Finally, it restores the solutions to the relations using the entity–symbol mapping table

F_{i}

.

This component differs from previous research by employing an answer buffer

A^{*}

to assist in reasoning across multiple relation sets. Since the order of the relation sets depends on the sequence in which the LLM generates scenes, the answer buffer sequentially records the inference results produced by reasoning each relation set in this order. It allows subsequent relation sets to share results from the answer buffer as conditional relations for their reasoning. The complete reasoning steps of the symbolic solver are shown in Procedure 3.

Procedure 3: Symbolic Solver

4. Experiments

Our experiment evaluates the proposed algorithm using the accuracy of solving AWPs, which is the percentage of correctly solved AWPs in a given dataset. The solving accuracy has been widely used in the evaluation of algorithms in previous related studies [14,17,19,28,32]. Additionally, a case study is conducted to compare and analyze the reliability and error causes of the outputs from the proposed algorithm and the baseline algorithms. To assess solving accuracy, this paper prepared six datasets of AWPs, which are from authoritative sources, and five algorithms were chosen as baseline algorithms.

4.1. Experimental Settings

4.1.1. Datasets

Table 3 gives the prepared six datasets across two languages. PBE is the sum of three Chinese datasets (PEP, BNU, EEP) provided in [17]. The PEP dataset is collected from textbooks for primary students published by People’s Education Press in 2018, while the BNU dataset is collected from textbooks published by Beijing Normal University in 2018, both in China, respectively. EEP is a dataset that includes problems from entrance exam papers for middle schools in 34 provinces of China from 2010 to 2019. PBE contains 3262 AWPs after the aggregation and cleaning of duplicate problems. Math23a and APECa are subsets of the math word problem datasets Math23k [37] and APE-clean [31], respectively, filtered to remove unsolvable and non-arithmetic types of problems, retaining only high-quality Chinese AWPs.

AGG is an English dataset proposed by [34], containing 1492 AWPs that cover several typical scenes. MAWPS is an English dataset proposed in [38] and widely used in problem-solving research, containing 2373 AWPs. GSM8K is a popular English dataset [22] and contains 1319 AWPs for testing, widely used to evaluate foundational mathematical abilities of large language models.

In addition to the aforementioned characteristics, this paper also calculated the average length of problem text (counted by number of words) and the average length of operation steps (counted by number of operators), represented as ’AvgLen’ and ’AvgOp’, respectively, in Table 3. These characteristics have been used in a previous study [1] to help assess the difficulty of problems within the dataset.

4.1.2. Baselines

Five baseline algorithms are the most related or latest algorithms in the literature and they are described below.

YuAlg: A relation-flow algorithm presented [17] in 2023. This algorithm is the most advanced algorithm in the relation-flow approach for solving AWPs so far. It synergizes the $S^{2}$ method for extracting explicit relations and a neural miner for extracting implicit relations. It develops a symbolic solver to reason the relation set.
ZhangAlg: A Sequence-to-Sequence (Seq2Seq) algorithm presented in [19] in 2020. It constructs a graph encoder with a tree decoder to generate an arithmetic expression for a problem text. It achieved the highest solving accuracy without the use of a pre-trained model.
JieAlg: A Seq2Seq algorithm presented in [28] in 2022. It uses BERT as the pre-trained encoder and proposes a deductive reasoner to construct target expression iteratively. It achieves SOTA performance on Math23k and MAWPS.
YasuAlg: An LLM-based algorithm presented in [32] in 2023. It proposes an analogy-based prompting method, prompting the LLM to first recall relevant exemplars before solving the initial problem. This method significantly improves the performance of LLMs on mathematical reasoning tasks.
LiuAlg: An LLM-based algorithm presented in [14] in 2023. It introduces an XOT method, which integrates a problem-solving framework by prompting LLM with diverse reasoning thoughts. This method achieves SOTA performance on the GSM8K dataset under the same LLM settings.

The implementation details of the algorithm described in this paper are as follows. YuAlg, a relation-flow algorithm, employs the same model pool and classifiers as the original work to solve the dataset. ZhangAlg and JieAlg, both Seq2Seq-based algorithms, utilize the code and training set provided by the original work to train solvers, which are then used to solve the datasets in our experiments (excluding the AWPs involved in training). Due to the lack of expression annotation in the original GSM8K, our experiments utilized the expressions converted by LILA [39] to enable training on GSM8K. YaAlg and LiuAlg, as LLM-based algorithms, both use gpt-3.5-turbo as the base generation model and apply different prompting methods to solve the datasets. In order to conduct a fair comparison with baselines, the algorithm proposed in this paper (hereinafter referred to as PROP) also uses gpt-3.5-turbo as the base generation model and employs the same classifier as YuAlg for classifying scene types. In addition, our experiments also tested the solving accuracy of PROP using different LLMs.

4.2. Experimental Results

Table 4 presents the solving accuracy of PROP and all baseline algorithms across six datasets. For each dataset, our experiments constructed both Chinese and English versions using Baidu Translate (available at: https://fanyi.baidu.com, accessed on 15 March 2024). PROP was compared with the baseline algorithms on both the Chinese and English versions of the datasets.

Table 4 discloses three key findings.

PROP achieves an average accuracy of 90.4% on Chinese datasets and 90.3% on English datasets, outperforming the best Seq2Seq-based algorithm (JieAlg) by 6% and 10.9%, and outperforming the best LLM-based algorithm (LiuAlg) by 1.1% and 0.7%. The differences between the Chinese and English versions are mainly due to translation errors, which are reflected in the performance of all algorithms.
PROP (90.4%) achieved an improvement of 31.9% and 34.6% on the Chinese and English datasets, respectively, compared to YuAlg (58.5%), despite both using a relation-flow approach to solve AWPs. This significant improvement indicates that PROP’s relation-flow approach, empowered by LLM, has a broader solvable range for AWPs.
PROP (90.3%) outperformed YasuAlg (86.1%) and LiuAlg (89.6%) on the English dataset by 4.2% and 0.7%, respectively. This demonstrates that symbolic reasoning based on relation sets effectively reduces errors compared to LLM directly reasoning based on natural language or invoking code interpreters.

To further verify that the relation-flow approach can enhance the problem-solving performance of LLMs, this experiment compared LiuAlg (directly using the LLM for solving) with PROP (using relations for solving) by adopting different LLMs: gpt-3.5-turbo and qwen-Max-0428. To reduce the cost, this paper tests their solving accuracy across three small datasets in the Chinese version, and the results are presented in Table 5. From this table, it can be seen that for the same LLM, combining it with the relation-flow approach can effectively improve solving accuracy. Additionally, larger scale and more advanced training methods can provide LLM with more comprehensive knowledge to decompose complex scenes.

4.3. Case Study

To assess the reliability of the algorithm’s output and analyze the causes of solving errors, our experiment sampled typical cases of varying difficulty levels from datasets. Specifically, the output from ZhangAlg and LiuAlg was compared to PROP algorithm. Among them, ZhangAlg outputs the arithmetic expression and the answer, LiuAlg provides a natural language description of the reasoning process and the answer, and PROP outputs the relation sets and the answer. The details of the case study are presented in Table 6.

From Table 6, the following can be seen. For Case 1, all algorithms produce the correct answer. This is because it is a simple AWP that only contains a single “general” scene, requiring only the information stated in the problem for the solution. For Case 2, ZhangAlg outputs an incorrect answer because it failed to obtain external knowledge from the limited training set, leading to the erroneous inclusion of the rice quantity “20 kg” in the fruit quantity calculation “(15 + 20 + 20) kg”. In contrast, both LiuAlg and PROP acquire external knowledge through LLMs, handling this problem effectively and obtaining the correct answer. For Case 3, both ZhangAlg and LiuAlg produce incorrect answers. ZhangAlg fails due to poor learning of long expressions, while for LiuAlg, errors arise from generating invalid equations such as “(5/9) * N/N = (2/5)” during excessive reasoning steps. PROP, using a relation-flow method, decomposes the problem into three different scenes, forming three relational sets. It achieves the correct answer through precise reasoning among these relations.

5. Final Remarks

5.1. Discussion of Experiment Results

The proposed algorithm achieved competitive levels across all datasets since it combined the strengths of two advanced solving methods [14,17]. Additionally, the proposed algorithm showed significant improvements over the baseline relation-flow algorithm. This improvement is attributed to the incorporation of LLMs, enabling the algorithm to access external knowledge to better understand and decompose complex scenes in AWPs, thereby facilitating relation extraction and reasoning.

Further comparative experiments confirmed that the proposed algorithm outperforms LLM-based baseline algorithms under the condition of using the same LLM. This superiority is mainly due to the proposed algorithm converting scenes generated by LLMs into relation sets for reasoning, avoiding errors that may arise from LLMs directly using natural language to reason about mathematical logic. Furthermore, the proposed algorithm integrates the Scene-Aware

S^{2}

method and symbolic solver, achieving high-performance relation extraction and reasoning, resulting in more accurate and stable solutions compared to using LLMs alone.

Additionally, Table 4 also demonstrates the improvements brought by pre-training. By incorporating pre-trained BERT as the encoder, JieAlg achieved an 8.8% improvement in solving accuracy compared to ZhangAlg. Furthermore, YasuAlg, which leverages LLMs, demonstrated an additional 7% improvement over JieAlg. These improvements indicate that the models can acquire external knowledge through pre-training. Such knowledge helps the solving algorithms understand more complex scenes in AWPs, and as the model size increases, it significantly enhances solving performance.

The case study results provide an alternative explanation for the improvement in solving accuracy. All baseline algorithms can correctly solve simple AWPs because they only require the algorithm to convert the semantic information of the problem statement into expressions or answers. However, complex AWPs require extensive world knowledge. For example, in Case 2, the algorithm must know that “bananas and apples are fruits, but rice is not” to solve the problem correctly. Acquiring such knowledge typically requires the model to undergo unsupervised and reinforcement learning on large world datasets, which is an advantage of LLMs. For more complex AWPs, besides world knowledge, longer reasoning steps and the generation of multiple equations are involved. The accumulation of errors makes LLMs more prone to generating invalid logic and equations. By introducing the relation set as a stable intermediate state, this paper reduces the impact of LLM-generated errors on problem-solving and thereby achieves a more reliable AWP-solving algorithm.

5.2. Conclusions and Future Work

This paper proposes a novel method for solving AWPs by combining the advantages of LLMs and relation-flow, achieving a highly accurate and reliable algorithm.

First, the paper identifies key weaknesses in LLM-based baseline algorithms, particularly their tendency to generate invalid equations and incorrect calculations due to error propagation in multi-step reasoning. These weaknesses often stem from the direct use of LLMs for problem-solving, which can accumulate errors when reasoning through complex multi-step problems. To address these issues, the proposed method introduces the COS prompting technique. Instead of directly solving AWPs, COS prompts the LLM to decompose the problem into scenes described in natural language, which significantly reduces the chances of error accumulation by enabling more accurate relation extraction and reasoning processes.

Secondly, the introduction of the Scene-Aware

S^{2}

method plays a critical role in enhancing the robustness of the problem-solving process. By extracting relations from the decomposed scenes, this method provides a structured and reliable basis for further reasoning. This method contrasts with relying solely on LLM-generated logic and equations, which can be error-prone when handling complex AWPs. The relation sets serve as a buffer against errors by enabling precise reasoning through a symbolic solver, thereby reducing the overall error rate and improving the performance of the solving algorithm.

Experimental results demonstrate that this integrated method surpasses both advanced LLM-based and relation-flow algorithms in solving accuracy and output reliability, confirming its effectiveness in tackling complex AWPs.

For future research, this paper proposes three potential directions. First, distill the related knowledge from LLMs to smaller models, thereby reducing the cost of AWP-solving algorithms and facilitating their wider application in education. Second, vectorize the Scene-Aware

S^{2}

model pool to reduce the number of models, thereby improving the efficiency of model matching and relation extraction. Lastly, explore the combination of LLM-based and relation-flow approaches for solving diagram-text problems, as existing LLM-based solving algorithms have not yet achieved significant progress on these challenging problems.

5.3. Implications

The implications of this research are profound, particularly in smart education and intelligent tutoring systems. By leveraging the complementary strengths of LLMs and relation-flow approaches, this method not only enhances the accuracy and reliability of AWP solvers but also aligns with the cognitive processes of learners. This dual focus on accuracy and interpretability is crucial for educational applications, where the trustworthiness and transparency of automated systems directly impact their adoption and effectiveness in classroom settings.

Furthermore, the proposed algorithm’s potential to provide precise and contextually relevant solutions to AWPs may significantly advance the field of intelligent tutoring systems. These systems could benefit from more sophisticated problem-solving capabilities, offering students not only correct answers but also step-by-step explanations that mirror human reasoning processes. This could lead to improved learning outcomes by helping students understand the underlying logic of mathematical problems, thus making intelligent tutoring systems more effective as tools for personalized learning.

Author Contributions

Conceptualization, R.P. and X.Y.; methodology, R.P.; software, R.P. and L.H.; validation, R.P., X.Y. and L.H.; formal analysis, X.Y.; investigation, R.P.; resources, R.P.; data curation, R.P. and L.H.; writing—original draft preparation, R.P.; writing—review and editing, X.Y.; visualization, L.H.; supervision, X.Y.; project administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62277022) and the China Postdoctoral Science Foundation (2023M731245).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AWP	Arithmetic Word Problem
NLP	Natural Language Processing
LLM	Large Language Model
$S^{2}$	Syntax-Semantic
SOTA	State-Of-The-Art
COT	Chain-Of-Thought
COS	Chain-Of-Scene
POS	Part-Of-Speech
QRAN	Quantity-to-Relation Attention Network
Seq2Seq	Sequence-to-Sequence

References

Zhang, D.; Wang, L.; Zhang, L.; Dai, B.T.; Shen, H.T. The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2287–2305. [Google Scholar] [CrossRef] [PubMed]
Bobrow, D.G. Natural Language Input for a Computer Problem Solving System; Massachusetts Institute of Technology: Cambridge, MA, USA, 1964. [Google Scholar]
Kintsch, W.; Greeno, J.G. Understanding and solving word arithmetic problems. Psychol. Rev. 1985, 92, 109–129. [Google Scholar] [CrossRef] [PubMed]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Zhou, A.; Wang, K.; Lu, Z.; Shi, W.; Luo, S.; Qin, Z.; Lu, S.; Jia, A.; Song, L.; Zhan, M.; et al. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification. In Proceedings of the Twelfth International Conference on Learning Representations, ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
Lu, Y.; Pian, Y.; Chen, P.; Meng, Q.; Cao, Y. RadarMath: An Intelligent Tutoring System for Math Education. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 16087–16090. [Google Scholar] [CrossRef]
Mousavinasab, E.; Zarifsanaiey, N.; Kalhori, S.R.N.; Rakhshan, M.; Keikha, L.; Saeedi, M.G. Intelligent Tutoring Systems: A Systematic Review Of Characteristics, Applications, And Evaluation Methods. Interact. Learn. Environ. 2021, 29, 142–163. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Zong, M.; Krishnamachari, B. Solving math word problems concerning systems of equations with gpt-3. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 15972–15979. [Google Scholar] [CrossRef]
Zong, M.; Krishnamachari, B. Solving math word problems concerning systems of equations with GPT models. Mach. Learn. Appl. 2023, 14, 100506. [Google Scholar] [CrossRef]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Liu, T.; Guo, Q.; Yang, Y.; Hu, X.; Zhang, Y.; Qiu, X.; Zhang, Z. Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 2807–2822. [Google Scholar] [CrossRef]
Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. In Proceedings of the Twelfth International Conference on Learning Representations, ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
Huang, L.; Yu, X.; Niu, L.; Feng, Z. Solving Algebraic Problems with Geometry Diagrams Using Syntax-Semantics Diagram Understanding. Comput. Mater. Contin. 2023, 77, 517–539. [Google Scholar] [CrossRef]
Yu, X.; Lyu, X.; Peng, R.; Shen, J. Solving arithmetic word problems by synergizing syntax-semantics extractor for explicit relations and neural network miner for implicit relations. Complex Intell. Syst. 2023, 9, 697–717. [Google Scholar] [CrossRef]
Yu, X.; Sun, H.; Sun, C. A relation-centric algorithm for solving text-diagram function problems. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 8972–8984. [Google Scholar] [CrossRef]
Zhang, W.; Shen, Y.; Ma, Y.; Cheng, X.; Tan, Z.; Nong, Q.; Lu, W. Multi-View Reasoning: Consistent Contrastive Learning for Math Word Problem. In Proceedings of the Findings of the Association for Computational Linguistics, EMNLP, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1103–1116. [Google Scholar] [CrossRef]
Roberts, A.; Raffel, C.; Shazeer, N. How Much Knowledge Can You Pack Into the Parameters of a Language Model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5418–5426. [Google Scholar] [CrossRef]
Kauf, C.; Ivanova, A.A.; Rambelli, G.; Chersoni, E.; She, J.S.; Chowdhury, Z.; Fedorenko, E.; Lenci, A. Event knowledge in large language models: The gap between the impossible and the unlikely. Cogn. Sci. 2023, 47, e13386. [Google Scholar] [CrossRef] [PubMed]
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. arXiv 2023, arXiv:2211.12588. [Google Scholar] [CrossRef]
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.V.; et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Shridhar, K.; Stolfo, A.; Sachan, M. Distilling Reasoning Capabilities into Smaller Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 7059–7073. [Google Scholar] [CrossRef]
Yu, X.; Gan, W.; Wang, M. Understanding explicit arithmetic word problems and explicit plane geometry problems using syntax-semantics models. In Proceedings of the 2017 International Conference on Asian Language Processing (IALP), Singapore, 5–7 December 2017; pp. 247–251. [Google Scholar] [CrossRef]
Jian, P.; Sun, C.; Yu, X.; He, B.; Xia, M. An End-to-end Algorithm for Solving Circuit Problems. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1940004:1–1940004:21. [Google Scholar] [CrossRef]
Jie, Z.; Li, J.; Lu, W. Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; Volume 1, pp. 5944–5955. [Google Scholar] [CrossRef]
Kushman, N.; Artzi, Y.; Zettlemoyer, L.; Barzilay, R. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Long Papers), ACL, Baltimore, MD, USA, 23–24 June 2014; Association for Computer Linguistics: Stroudsburg, PA, USA, 2014; Volume 1, pp. 271–281. [Google Scholar] [CrossRef]
Qin, J.; Liang, X.; Hong, Y.; Tang, J.; Lin, L. Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; Volume 1, pp. 5870–5881. [Google Scholar] [CrossRef]
Liang, Z.; Zhang, J.; Wang, L.; Qin, W.; Lan, Y.; Shao, J.; Zhang, X. MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL, 2022. Association for Computational Linguistics, Seattle, WA, USA, 10–15 July 2022; pp. 997–1009. [Google Scholar] [CrossRef]
Yasunaga, M.; Chen, X.; Li, Y.; Pasupat, P.; Leskovec, J.; Liang, P.; Chi, E.H.; Zhou, D. Large Language Models as Analogical Reasoners. In Proceedings of the Twelfth International Conference on Learning Representations, ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
Mitra, A.; Baral, C. Learning To Use Formulas To Solve Simple Arithmetic Problems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Long Papers), ACL, Berlin, Germany, 7–12 August 2016; Association for Computer Linguistics: Stroudsburg, PA, USA, 2016; Volume 1, pp. 2144–2153. [Google Scholar] [CrossRef]
Roy, S.; Roth, D. Mapping to Declarative Knowledge for Word Problem Solving. Trans. Assoc. Comput. Linguist. 2018, 6, 159–172. [Google Scholar] [CrossRef]
Che, W.; Feng, Y.; Qin, L.; Liu, T. N-LTP: An Open-source Neural Language Technology Platform for Chinese. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Virtual, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 42–49. [Google Scholar] [CrossRef]
Meurer, A.; Smith, C.P.; Paprocki, M.; Čertík, O.; Kirpichev, S.B.; Rocklin, M.; Kumar, A.; Ivanov, S.; Moore, J.K.; Singh, S.; et al. SymPy: Symbolic computing in Python. PeerJ Comput. Sci. 2017, 3, e103. [Google Scholar] [CrossRef]
Wang, Y.; Liu, X.; Shi, S. Deep Neural Solver for Math Word Problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 845–854. [Google Scholar] [CrossRef]
Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; Hajishirzi, H. MAWPS: A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1152–1157. [Google Scholar] [CrossRef]
Mishra, S.; Finlayson, M.; Lu, P.; Tang, L.; Welleck, S.; Baral, C.; Rajpurohit, T.; Tafjord, O.; Sabharwal, A.; Clark, P.; et al. LILA: A Unified Benchmark for Mathematical Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics, EMNLP, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5807–5832. [Google Scholar] [CrossRef]

Figure 1. An overview of our algorithm.

Figure 2. The exemplars of AWPs containing different scenes.

Table 1. Comparison of the characteristics of related algorithms.

Name	Year	Approach	LLM	Symbolic Solver
ZhangAlg [31]	2022	Sequence-to-Sequence	No	No
JieAlg [28]	2022	Sequence-to-Sequence	No	No
YuAlg [17]	2023	Relation-Flow	No	Yes
LiuAlg [14]	2023	Prompting-LLM	Yes	No
YasuAlg [32]	2024	Prompting-LLM	Yes	No
Our Work	2024	Prompting-LLM + Relation-Flow	Yes	Yes

Table 2. The Scene-Aware

S^{2}

models.

Table 2. The Scene-Aware

S^{2}

models.

Scene Type	Model Structure	Matched Sentence	Explicit Relations
General	n m q a = n; b = m; c = q a = b * c	Ming had 15 stamps	Ming = 15 × stamp
General	n m [more than] n a = n; b = m; c = n a = b + c	Pens are 5 more than eraser	Pen = 4 + Eraser
General	n [of] n m [times] n a = n; b = n; c = m; d = n a (b) = c * a (d)	The price of durian is 5 times that of apple	Price (durian) = 5 × Price (apple)
Change	n m q [left] a = n; b = m; c = q Left (a) = b * c	The match is still 2 cm left.	Left (match) = 2 × cm
Rate and Journal	n [average] m q [per] q a = n; b = m; c = q; d = q Rate (a) = b * (c/d)	The factory binds an average of 90 books per day	Rate (factory) = 90 × (book/day)
Discount	n [at] m [discount] a = n; b = m Discount (a) = b	A product is sold at a 10% discount	Discount (product) = 10%
Solution	n [made of] n n [ratio of] q a = n; b = n; c = n; d = q Solution (a), Solute (b)/Solvent (c) = d	An alcohol is made of ethanol and water in a mass ratio of 1:5.	Solution(alcohol), Solute (ethanol) ÷ Solvent (water) = 1:5

Note:

[*]

indicates the keywords in model; a(b) indicates the attribute a of entity b.

Table 3. Datasets for evaluating solving accuracy.

Dataset	Size	AvgLen	AvgOp	Source	Language
PBE [17]	3262	33.76	2.95	textbook, exam	CN
Math23a [37]	20,160	27.68	2.70	web	CN
APECa [31]	15,245	29.32	2.82	textbook, web	CN
AGG [34]	1492	26.96	1.30	web, exam	EN
MAWPS [38]	1987	30.28	1.47	web	EN
GSM8K [34]	1319	42.25	3.29	web	EN

Table 4. Comprehensive experiment results on solving accuracy.

		Algorithms Tested on Chinese Version					Algorithms Tested on English Version
		YuAlg	ZhangAlg	JieAlg	LiuAlg	PROP	YuAlg	JieAlg	YasuAlg	LiuAlg	PROP
Dataset	#	%	%	%	%	%	%	%	%	%	%
PBE	3262	55.2	49.2	52.8	85.2	89.5	54.0	49.0	75.7	82.8	88.2
Math23a	20,160	60.0	77.8	87.8	90.1	91.2	58.2	84.3	88.0	90.2	91.0
APECa	15,245	58.1	72.6	81.9	89.4	90.0	52.3	80.5	85.5	89.9	89.6
AGG	1492	54.2	66.7	73.4	86.2	88.2	55.3	72.5	85.1	90.4	90.9
MAWPS	1987	69.7	84.3	91.5	94.6	94.2	71.2	92.0	94.4	95.3	95.7
GSM8K	1492	35.0	38.5	43.7	81.3	83.2	37.8	46.5	77.8	83.2	84.5
Avg	43,465	58.5	72.6	81.4	89.3	90.4	55.7	79.1	86.1	89.6	90.3

“#” refers to the number of problems in dataset.

Table 5. Comparison results between PROP and LiuAlg.

Dataset	LiuAlg_gpt	PROP_gpt	LiuAlg_qwen	PROP_qwen
PBE	85.2	89.5	88.9	90.3
AGG	86.2	88.2	94.8	95.1
GSM8K	81.3	83.2	91.3	93.6
Avg	84.6	87.8	90.9	92.2

Table 6. The case study for analyzing the outputs.

AWP case 1	Ming has 15 stamps, and Li has 5 stamps less than Ming. How many stamps does Li have?
Output	ZhangAlg: Equation: ‘15−5’, Answer: 10 Correct LiuAlg: 1. Let Ming have xx stamps; 2. According to the question, Ming has 15 stamps, that is, x = 15; 3. The question also mentions that Li has 5 fewer stamps than Ming. Therefore, the number of stamps Li has can be expressed as x − 5; 4. Now let’s substitute the number of stamps Ming has x = 15; 5. The number of stamps Li has is: x − 5 = 15 − 5 = 10; 6. Therefore, Li has 10 stamps; Correct PROP: 1. {Type: General, Relation Set: {Ming = 15 * stamps, Li = Ming − 5 * stamps}, Answer: {Li = 10 * stamps}; Final Answer: Li has 10 stamps. Correct
AWP case 2	There are 15 kg of apples, 20 kg of bananas, and 20 kg of rice in the store. Divide all the fruits equally into 5 boxes. How many kg of fruits should be put in each box?
Output	ZhangAlg: Equation: ‘(15 + 20 + 20)/5’, Answer: 11 Incorrect LiuAlg: 1. Calculate the total number of kg of fruit: 15 kg of apples, 20 kg of bananas, total fruit = apples + bananas = 15 + 20 = 35 kg; 2. Determine the number of boxes and the average amount of fruit in each box: the number of boxes is 5 boxes, the average amount of fruit in each box = total amount of fruit/number of boxes; 3. Calculate how many kg of fruit should be put in each box: the amount of fruit that should be put in each box = total amount of fruit/number of boxes, the amount of fruit that should be put in each box = 35/5 = 7 kg. 4. Therefore, each box should have 7 kg of fruit, so that all the fruit can be divided equally into 5 boxes. Correct PROP: 1. {Type: Part-Whole, Relation Set:{apples = 15 * kg, bananas = 20 * kg}, Answer: {fruits = 35 * kg}; 2. {Type: Rate, Relation Set: {Total (fruits) = 35 * kg, Amount (fruits) = 5 * boxes}, Answer: {Rate (fruits) = 7 * (kg/box)}; Final Answer: 7 kg in each box. Correct
AWP case 3	There are 36 students in the morning, of which girls account for (5/9) of the total number of students. Later, a few boys came. Now girls account for (2/5) of the total number of students. Later, how many boys came?
Output	ZhangAlg: Equation: ‘36 * (5/9)/(1 − (2/5)) − 36’, Answer: −2.666 Incorrect LiuAlg: 1. Let the total number of people in the morning be N; 2. Initially, girls account for (5/9) of the total number of people in the morning, so the number of girls is (5/9) * N; 3. Later, several boys came, making girls account for (2/5) of the total number of people, so the number of girls is still (5/9) * N, and the total number is N + x, where x is the number of new boys; 4. According to the proportional relationship given in the question, we can establish the following equation: (5/9) * N/N = (2/5); 5. This is because in the second state, the number of girls is still (5/9) * N, and the total number becomes N + x; 6. Solving this equation, we can get: (5/9) = (5/2), which is obviously inconsistent, so we need to re-examine the problem. Incorrect PROP: 1. {Type: General, Relation Set: {students (morning) = 36, girls = students (morning) * (5/9)}, Answer: {girls = 20}; 2. {Type: General, Relation Set: {girls = 20, girls = students (now) * (2/5)}, Answer: {students (now) = 50}; 3. {Type: Change, Relation Set: {students (morning) = 36, students (now) = 50}, Answer: {Came (boys) = 14}; Final Answer: 14 boys came. Correct

The first column is the title of the content in the second column. In the second column, the bold text represents the name of the algorithm being compared, the red text indicates incorrect answers output by the algorithm, and the green text indicates correct answers output by the algorithm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, R.; Huang, L.; Yu, X. Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method. Appl. Sci. 2024, 14, 8184. https://doi.org/10.3390/app14188184

AMA Style

Peng R, Huang L, Yu X. Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method. Applied Sciences. 2024; 14(18):8184. https://doi.org/10.3390/app14188184

Chicago/Turabian Style

Peng, Rao, Litian Huang, and Xinguo Yu. 2024. "Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method" Applied Sciences 14, no. 18: 8184. https://doi.org/10.3390/app14188184

APA Style

Peng, R., Huang, L., & Yu, X. (2024). Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method. Applied Sciences, 14(18), 8184. https://doi.org/10.3390/app14188184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Solving Arithmetic Word Problems by Synergizing Large Language Model and Scene-Aware Syntax–Semantics Method

Abstract

1. Introduction

2. Related Work

2.1. Usage of LLM in Solving AWPs

2.2. Advance in Relation-Flow Approach

2.3. Progress of Symbolic Solver

3. Methodology

3.1. The Pipeline of Proposed Algorithm

3.2. Scene Decomposer

3.2.1. COS Prompting Method

3.2.2. Implementation of COS Prompting Method

3.3. Relation Extractor

3.3.1. Scene-Aware S 2 Method

3.3.2. Implementation of Scene-Aware S 2 Method

3.4. Symbolic Solver

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Baselines

4.2. Experimental Results

4.3. Case Study

5. Final Remarks

5.1. Discussion of Experiment Results

5.2. Conclusions and Future Work

5.3. Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. Scene-Aware $S^{2}$ Method

3.3.2. Implementation of Scene-Aware $S^{2}$ Method