Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT

Spreitzer, Carina; Straser, Oliver; Zehetmeier, Stefan; Maaß, Katja

doi:10.3390/educsci14070698

Open AccessArticle

Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT

¹

Institute of Instructional and School Development, University of Klagenfurt, Sterneckstraße 15, 9020 Klagenfurt, Austria

²

International Centre for STEM Education (ICSE), University of Education Freiburg, Kunzenweg 21, 79117 Freiburg, Germany

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Educ. Sci. 2024, 14(7), 698; https://doi.org/10.3390/educsci14070698

Submission received: 2 May 2024 / Revised: 21 June 2024 / Accepted: 23 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Fostering Mathematics Teachers for a New Era)

Download Versions Notes

Abstract

This work explores the mathematical modelling capabilities of various iterations of ChatGPT, focusing on their performance across tasks of differing complexity and openness. The study examines the abilities of GPT-3.5, GPT-4.0, and a more instructed version, GPT-MM, in multiple scenarios. It is observed that all versions demonstrate basic mathematical problem-solving skills. However, their effectiveness varies with increasing task complexity. While GPT-4.0 and GPT-MM show marginal improvements in providing detailed solutions, significant challenges persist, especially in moderate to complex modelling contexts where comprehending the nuances of tasks becomes challenging. Additionally, the study suggests that the openness of modelling tasks has a limited impact on performance, highlighting that mathematical and contextual complexities play more critical roles. The implications of these observations are discussed in terms of potential enhancements to teaching methodologies and the integration of AI tools like GPT in educational settings. This reiterates the importance of further research to fully understand the capabilities and limitations of AI tools and ensure their effective use in education.

Keywords:

artificial intelligence; math education; STEM education; mathematical modelling; real-life contexts

1. Introduction

Mathematical modelling is a central topic in mathematics and science education. It provides a bridge between mathematical concepts and their real-world applications. Moreover, it helps pupils understand how mathematical principles can be used to solve problems in various fields of the real world. Engaging in mathematical modelling fosters critical thinking and problem-solving skills [1]. Pupils learn to identify, analyse, and solve complex problems by breaking them down into manageable parts and applying mathematical techniques to find solutions. In particular, modelling can deepen pupils’ understanding of mathematical concepts by applying them to real-world situations. However, modelling as a mathematical activity is not a typical day-by-day routine in mathematics classrooms [2]. Many teachers do not regularly integrate modelling tasks into their lessons [3]. Implementing changes in daily teaching requires adequate concepts of mathematics lessons. Tasks are a fundamental part of mathematical lessons and, thus, have a central position within mathematics education [4], particularly in modelling. In the realm of artificial intelligence, advanced systems such as large language models, an example of which being ChatGPT which stands out due to its user-friendly interface and accessibility, demonstrate capabilities with respect to understanding human–AI interactions. These systems have proven to be able to solve elementary and moderately complex mathematical problems. However, preliminary studies indicate that ChatGPT faces significant challenges when addressing more advanced topics in mathematics education. In any case, tools such as this may influence how we teach and learn mathematics. We must enhance our knowledge of their mathematical capabilities to understand their impact on mathematics education. In this work, we investigate their mathematical modelling capabilities. We provide a review of the theoretical aspects of mathematical modelling in mathematics education and the methods for assessing mathematical modelling competence. This is followed by an overview of the use of AI in mathematics and STEM education. Our study focuses on researching in which way the modelling competencies of GPTs are influenced by the depth of mathematical content and in which way mathematical modelling represents a distinct complexity in addition to mathematical depth.

In particular, we focus on these research questions:

Q1: How do GPT-3.5 and GPT-4.0 compare in their ability to solve mathematical modelling tasks?

Q2: What are the specific strengths and weaknesses of GPT-3.5 and GPT-4.0 in mathematical modelling?

Q3: How have the mathematical modelling competencies of GPT evolved from earlier versions to the latest, particularly in terms of solving tasks with varying degrees of openness and complexity?

2. Theoretical Background of Mathematical Modelling

2.1. Notions and Concepts

Modelling refers to solving a problem from the real world. A model is a simplified representation of reality; it has a specific intention and only considers some aspects of reality [5]. Modelling means understanding a realistic problem, setting up a model of the problem, and finding a solution by working on the model mathematically. Whilst some perspectives [6,7] focus on pure mathematical contexts, other perspectives highlight the necessity to use realistic questions [8,9]. The modelling process [10] begins with a reality-based problem. When trying to understand the problem, a mental model of the situation is set up (situation model). In the next step, the problem is consciously simplified, structured, and idealised (real model). The mathematising of the real model leads to a mathematical model. By working mathematically within the model, a mathematical solution can be found. This solution has to be interpreted and validated. Finally, the whole process needs to be presented. If the solution or the chosen process proves inappropriate for reality, particular steps or the entire modelling process must be worked through again.

2.2. Aims of Modelling

There are various possible aims of modelling. Maaß [11] distinguishes the following: (1) methodological aims, focused on developing students’ competencies in applying mathematics to simple and complex real-world situations, (2) culture-related aims, focused on giving comprehensive insights into mathematics as a science and its value for our culture and society, (3) pragmatic aims, focused on understanding real-world situations and dealing with them, (4) psychological aims, focused on developing positive attitudes towards mathematics and supporting the memorisation and understanding of mathematical content, and (5) pedagogical aims, focused on developing problem-solving, reasoning, and creativity competencies.

Kaiser and Sriraman [12] distinguish (1) the realistic perspective, which emphasises the ability to solve real-world problems, an understanding of the world, and the development of modelling competencies, (2) the educational perspective, which aims to structure mathematical learning processes and the learning of mathematical concepts, and (3) the social–critical perspective, which emphasises the critical understanding of the surrounding world.

English [13] defines modelling as “a powerful vehicle for bringing features of 21st-century problems into the mathematics classroom” (p. 362). Maaß et al. [14,15] combine modelling with socially relevant problems that involve ethical, moral, social, or cultural aspects. The aim is to critically examine how mathematical modelling can affect decisions with respect to socially relevant problems, including mathematical activities’ respective possibilities and limitations.

2.3. Modelling in Mathematics Classrooms

There has been a demand for integrating modelling into mathematics lessons for quite some time [16]. This has led to a comprehensive list of objectives linked to modelling implementation. This list includes, among others, the following objectives [10,17,18]:

The students should be able to apply mathematics daily and professionally.
Mathematics is supposed to help students understand their world and critically view mathematical information in the sense of active citizenship.
The students are to develop problem-solving competencies. They are also expected to deal with situations utterly unfamiliar to them and communicate with the help of mathematics.
Students should have insights into the usefulness of mathematics.
Modelling will help students understand and memorise mathematical content more easily.
Modelling tasks are supposed to help students gain a more positive attitude towards mathematics.

2.4. Modelling Competencies

Modelling competencies can be regarded as a complex construct containing various sub-competencies. For example, Blomhøj and Jensen [19] differentiate between the parts of the modelling process to be carried out, the mathematics to be used, and the context in which students have to work.

Maaß [11] defines sub-competencies that do not belong to a specific modelling step but are needed throughout the process. She proposes the following definition (p. 173): modelling competencies include the abilities and skills required to conduct the processes adequately and in a goal-oriented way. In detail, modelling competencies contain the following: (1) competencies for carrying out the single steps of the modelling process, (2) metacognitive modelling competencies, (3) competencies for reasoning about the modelling process.

To acquire this kind of modelling competencies, it is essential to consider tasks that require the whole modelling process and tasks in which only single steps need to be carried out [19,20]. For example, word problems are texts in which a familiar situation is described and a quantitative question is posed that can be solved with the help of mathematics. The information needed to solve the task is given in the text or can be found elsewhere [21].

2.5. Classification of Tasks

There are various classifications of tasks, each focusing on different aspects. Burkhardt [22] distinguishes between illustrations of mathematical content and realistic situations to be worked on, as well as between situations where standard models can be used and situations requiring the development of new models. Kaiser [23] differentiates between embedded word problems, illustrations of mathematical concepts, applications of standard algorithms to solve reality-related problems, and modelling, which involves complex problem-solving processes. These classifications consider complexity, didactic intentions, and the relationship to reality. Tasks in the OECD/PISA framework [24] are distinguished based on the area from which the context is taken, such as personal, educational/occupational, public, and scientific. This classification considers the distance between the situation and the student’s world. Franke [25] distinguishes different ways in which tasks are represented, such as using text, pictures, text and pictures, materials, or real situations. This classification is particularly relevant in primary school settings. Büchter and Leuders [26] differentiate tasks for learning (exploring, systematising, practising) and tasks for assessment (formative assessment, summative assessment, experiencing competence), considering openness and the possibility to differentiate between different levels of student performance. Bruder [27] categorises tasks by analysing the initial conditions, the transformations required to reach a conclusion, and the final conditions. This framework evaluates tasks based on their degree of openness and complexity. The COACTIV project comprehensively classifies tasks based on a mathematical frame, cognitive demands, cognitive elements of the modelling cycle, and solutions [28]. These elements are essential for understanding the nature and complexity of tasks. Blomhøj and Jensen [29] and Brand [30] identify two types of tasks: holistic tasks, which require navigating through the entire modelling cycle, and atomistic tasks, which focus on specific modelling competencies. Holistic tasks necessitate learners exhibiting skills across all stages of modelling, providing a comprehensive evaluation of their ability to generate plausible solutions to real-world challenges. These possible classifications provide insights into the nature of tasks and can inform the development of a comprehensive classification scheme for modelling tasks.

2.6. Classification Scheme for Modelling

Maaß [31] proposes a classification scheme of modelling which is based on several central questions, for example, the following (p. 295):

Which modelling activities must be carried out?
Which data are provided?
What is the nature of the relationship between the context and reality?
Which type of representation is chosen?
How open is the task?

Within this classification scheme, the following characteristics of modelling are described (among others): modelling activity, data, relationship to reality, representation, and openness. This paper will use this classification scheme to analyse ChatGPT’s solutions.

2.7. Modelling Activity

Based on Blum and Leiß [32], a modelling cycle requires the following activities [33]:

(1) Understanding the task and the real situation: this activity includes thinking and reflecting on the given problem, thoroughly assessing the situation, and designing a mental model.

(2) Making assumptions and simplifying the situation model: here, the focus is on analysing the given situation to separate essential and relevant information from non-relevant information.

(3) Mathematising and developing the real model: this activity involves designing and formulating a mathematical model to analyse the situation.

(4) Working within the mathematical model and finding a solution: this step is about working mathematically. This activity can require mathematical operations and changes to a given mathematical model.

(5) Interpreting the solution: this activity focuses on interpreting the found mathematical results. A mathematical finding or representation needs to be interpreted to explain its meaning within the specific real context.

(6) Validating the solution: the given model or solution is validated here. This activity includes comparing different models and solutions, reflecting critically on the mathematical work used, and giving reasons for its appraisal.

(7) Communicating the solution: this activity comprises various ways of presenting and communicating the task’s solution. This allows for an external assessment and critical discussion of the proposed solution.

2.8. Data

The classification scheme distinguishes the following variants of data provided in a modelling task [31]:

Superfluous data: this kind of task contains more data than needed. Relevant and irrelevant data need to be distinguished by the pupils.
Missing data: this kind of task contains less data than needed. A possible solution needs additional information or estimated variables. This kind of task might have several different solutions.
Missing and superfluous data: this kind of task does not contain all the data needed to find a solution; simultaneously, it contains superfluous data.
Inconsistent data: this kind of task contains data that are not relevant to the solution.
Matching data: this kind of task contains exactly the information needed to find a solution.

2.9. Relationship to Reality

Modelling tasks can vary with respect to their relationship to reality. For example [31]:

Authentic tasks focus on context-related topics and pose relevant questions. A task can be considered to be authentic when the data themselves, the way in which the task is presented, and the question itself are authentic or if the task is simulated in a mathematics classroom situation.
Realistic tasks: in this case, tasks are close to reality, while the data or the question are not necessarily authentic. Even if the data may have a realistic meaning, they might have been artificially constructed. Vice versa, the data of such tasks can be authentic, while the question is not.
Embedded tasks: here, the proposed situation embeds the mathematical topics. Thus, it is not necessary to reflect on the particular context.
Artificial tasks: the proposed situation is intentionally artificial.
Fantasy tasks: in this case, the task context is inspired by a fantasy world; this might be highly appealing, particularly in primary or elementary classrooms.

2.10. Representation

There are various ways in which modelling tasks can be presented [31]:

Text: this kind of task is presented as pure text.
Pictures: this kind of task consists of images or photographs only.
Text and pictures: pictures and photographs can illustrate the provided text and support the pupils in making a connection between the task and the reality.
Materials: various artefacts, such as documents, newspaper articles, or radio broadcasts, can support the mental visualisation of a proposed problem or a situation.
Situations: this task uses real situations to be explored mathematically.

2.11. Openness

Modelling tasks can be differentiated regarding their grade of openness [31]:

Solved example: the proposed task has already been solved and can serve as an example.
Ascertaining task: the initial situation is given in this task; the subsequent transformations must be performed.
Reversal task: the end situation and the transformation are given; the initial situation is missing and needs to be found.
Ascertaining problem: the initial situation is given; the transformation and end situation are unknown. This is a situation typical of modelling tasks.
Reversal problem: the end situation is given, while the rest is unknown. These are modelling tasks where an aim is given.
Finding a situation: a mathematical tool, or any transformation, is given; the situation where this tool can be used must be found.
Open problem: in this case, the initial situation, the transformation, and the end situation are not given.

3. Theoretical Background of AI

3.1. Background

AI tools are becoming increasingly prominent in education owing to the advancement of digitalization. AI’s potential to change how we learn and teach is significant, although claims of “unlimited” possibilities may be overstated [34]. Even before the rise of GPT, AI significantly impacted education, with various tools available for personalised learning, such as intelligent tutoring systems, automated assessment, and enhanced communication [35]. AI-supported mathematics education has the potential to improve the learning outcomes by addressing diversity in the classroom and reducing the workload of teachers by taking on administrative tasks, among other things. However, the limited proficiency of teachers when it comes to using AI tools may constrain its impact [34].

ChatGPT, which was introduced in 2022 with Version 3, has brought about a paradigm shift in the field of mathematics education [36]. It is a chatbot belonging to the class of large language models, i.e., artificial neural networks designed to understand and generate human-like text [37]. The acronym GPT stands for Generative Pre-trained Transformer, which refers to a semi-supervised pre-trained [38] artificial neural network [39] capable of understanding and generating new data, such as texts or images [37] and is built on the work by Vaswani et al. [40].

At the moment, there are two easily accessible versions of ChatGPT: 3.5. and 4.0. Version 3.5. is freely available and comparatively fast, even though it is limited in its capacity to process data and only reacts to direct prompts. GPT 4.0 seems to be technologically more advanced than 3.5., but specific data are not openly available at the moment. However, GPT 4.0 is slower than 3.5, although it is capable of processing more extensive data and is able to process not only direct prompts but also data in the form of standard formats (such as pdf, xlsx, etc.). GPT 4.0 can also be personalised in a way that its sources may be restricted and its behaviour may be controlled by direct instructions.

The impact of GPT or other large language models on education is a very active field of research with several potential applications. These include, among others, personalised assessments, wherein models adapt the testing to individual students’ needs [41], writing assistants that aid students in enhancing their composition skills [42], and individualised learning, which tailors both educational content and pacing to each learner’s unique abilities and knowledge gaps [41].

Alongside these potential applications, several concerns have been raised, including cheating [41], the propagation of misconceptions, and plagiarism [43]. Critics also argue that these tools may reduce student autonomy, lack emotional engagement, fail to reflect on student interactions, diminish originality, and impede the development of competencies [43]. Discussions about banning GPT and similar tools from education and academia have already begun [44]. However, to determine whether GPT is an enrichment or a danger to education, we must fully understand its academic capabilities.

In this work, we have focused on ChatGPT instead of similar tools because it is easily accessible. Our experience indicates that other LLMs do not provide significantly better results, although we acknowledge that this estimation is not based on empirical data; more and detailed research on this matter seems to be highly promising and needed.

3.2. Mathematical Performance of GPT

From Version 3.5 on, GPT can compute integrals, perform linear algebra, and solve partial differential equations when given appropriate instructions [45]. However, this capability does not guarantee that GPT can independently solve mathematical problems. The scientific abilities of GPT are the subject of extensive research. OpenAI has evaluated GPT’s mathematical performance, among other aspects, by administering Math SAT tests. GPT-3.5 scored in the 70th percentile, while GPT-4.0 reached the 89th percentile [38]. This performance is slightly below the 25th percentile for first-year Ivy League students, who have an average score of 730. It is unknown whether SAT tests were included in GPT’s training data, so these results should be interpreted with caution. However, the study by Korkmaz Guler et al. [46] yielded similarly promising results, demonstrating that GPT could solve most tasks from national math tests, although not always on the first attempt and often requiring explicit instructions to do so.

Plevris et al. [47] assessed GPT’s and Google Bard’s performance in problem-solving across various difficulty levels. They reported that both produce reliable results only with respect to straightforward calculations and basic logical tasks. Detailed analyses conducted independently by Dao and Le [48] and Wardat et al. [49] investigated GPT’s problem-solving skills. While acknowledging GPT’s capabilities as a powerful tool, they also found that its performance significantly depends on the complexity of the task and the specific mathematical area, typically underperforming compared to Vietnamese students taking the same test [48].

At the higher education level, Frieder et al. [50] investigated GPT’s performance on tasks from prominent first-year college mathematics textbooks, its ability to identify errors in proofs, and its performance on exercises at the math Olympiad level. While their results were positively surprising, they concluded that GPT is “inconsistently bad at advanced mathematics” [50] (p. 9).

The performance of GPT seems to be highly dependent on the mathematical complexity of the exercises and the mathematical domain. Yet, the specific type of task has not been explicitly considered in these studies. When addressing word problems, both Shakarian et al. [51] and Zong and Krishnamachari [52] demonstrated that GPT can be highly successful in converting text-based tasks into mathematical problems and solve them accordingly, but the authors did not seem to vary the complexity of the task systematically.

It is also worth mentioning that LLM’s and GPT’s responses may be biased by the dataset that was used for training. For instance, GPT is reported to have, at least in some cases, political left-leaning tendencies [53], and LLMs have gender stereotypes [54]. While these biases may or may not be directly relevant to mathematics education, the composition of the training set could directly influence GPT’s mathematical capabilities. The representation (or lack thereof) of certain mathematical areas such as algebra, analysis, topology, types of exercises, and certain levels of difficulty in the training set may impact its knowledge and proficiency. The training data could also affect the methods GPT employs to solve problems, such as relying on specific numerical methods for solving differential equations or biasing its suggestions towards specific digital tools for problem-solving. However, such biases are not known to the authors.

In summary, GPT is a tool that, to a certain extent, can solve mathematical problems across multiple levels. Those examining the differences between various GPT versions have usually observed that GPT-4.0 outperforms GPT-3.5; see, for example [38]. So, if used appropriately, it can be a powerful resource for students to enhance their understanding of mathematics. Regardless, GPT has made its way into the classroom and, therefore, should be systematically integrated into mathematics education. As pointed out before, perceptions of GPT’s mathematical performance vary depending on the mathematical level, type of exercise, and target groups. Both teachers and students must understand its mathematical limitations and capabilities with respect to different mathematical topics and levels.

3.3. Research Questions

Although large language models like GPT appear inherently suited to “understand” word problems, the level of contextual understanding required to solve these problems has not been thoroughly evaluated. Therefore, we pose the following research questions:

Q1: How do GPT-3.5 and GPT-4.0 compare in their ability to solve mathematical modelling tasks?

Q2: What are the specific strengths and weaknesses of GPT-3.5 and GPT-4.0 in mathematical modelling?

Q3: How have the mathematical modelling competencies of GPT evolved from earlier versions to the latest, particularly in terms of solving tasks with varying degrees of openness and complexity?

4. Method

4.1. Selection of Tasks

We selected five modelling tasks for analysis by GPT. The following section describes the reasons for the selection and, thus, the characteristics of the tasks. As part of the DISUM project, Schukajlow et al. [55] found that, although modelling skills were more strongly correlated with reading skills than internal mathematical skills, this correlation was weak and insignificant. Although no significant correlation has been found, it can be assumed that reading skills influence modelling skills [56]. Tasks with little text were, therefore, selected.

Moreover, this study intentionally confined its modelling tasks to those necessitating only basic arithmetic operations. This constraint was imposed not to assess the functional knowledge or formulaic proficiency of GPT, but to ensure that the focus remained squarely on evaluating modelling competencies without the confounding variable of factual knowledge recall.

The adoption of holistic modelling tasks was a strategic choice informed by the work of Blomhøj and Jensen [29] and Brand [30], both of whom differentiated between holistic tasks and atomistic tasks (see chapter “Classification of Tasks”, above). This approach aims to evaluate the overarching modelling competencies of participants despite the inherent challenges, such as validating real-world results and the necessity for context-specific knowledge.

To navigate the complex landscape of authentic real-world and modelling tasks and the plethora of existing classifications, this study drew upon the classification scheme proposed by Maaß [31] (see also the chapter “Classification Scheme for Modelling”, above). This scheme aids the organization of the diverse characteristics of modelling tasks into a coherent framework, albeit with a selective focus on attributes most pertinent to our research aims.

The criteria for task selection, therefore, were based on several key considerations: the clarity and simplicity of the task language, the varying difficulty levels of tasks, the presentation of tasks within a credible, real-world context, the requirement for tasks to encompass the entire modelling cycle, and the diversity in the provision of information needed for task resolution. Table 1 presents the characteristics of the five chosen tasks for further analysis.

The characteristics in Table 1 indicate that tasks 1, 3, and 4 exhibited increasing complexity. In task 1, only the required data are available, solely involving the computation (transformation) of data into a final result within a contextual framework. Here, making assumptions or providing justifications for decisions in the modelling process is unnecessary. In contrast, tasks 3 and 4 require formulating assumptions and validations within the context to make well-founded statements. In these tasks, less specific data are available compared to what is needed (e.g., assumptions about fuel consumption per 100 km (task 3)). These tasks are also classified as ascertaining problems, as the transformation process (assumptions, calculation sequences) is not predetermined but developed during the modelling process. Consequently, the final outcome is not predetermined, but it depends on the assumptions made during the transformation. Task 4 is even more complex than task 3, as it necessitates a more significant number of assumptions and hence decisions.

Appendix A lists the five tasks; for three tasks, their translation in English is available (tasks 1, 2, and 3).

4.2. Process for ChatGPT Solutions

The process for generating the solutions to the tasks in the different GPT versions followed the steps presented below. The goal was to articulate the instructions so that GPT could directly address the question within the model and the leading instructions would not influence the resolution of the task in the prompts towards the correct solutions. To gauge how GPT responds to instructions in modelling tasks, various modelling assignments were presented to GPT for resolution. The tasks in this phase extended beyond the five selected assignments from Table 1 without any restrictions regarding the content or context of the task. Additionally, we posed follow-up questions to the solutions to gain a deeper understanding of them.

Our investigation revealed that the simple directive “solve the following task” proved insufficient, as it predominantly elicited mere calculations without substantive justifications or validation. In response, we designed follow-up questions to assess how GPT responds when prompted for reasoning. However, we meticulously avoided queries that could bias the content directionally, such as incorporating environmental considerations in a fuel consumption task (e.g., task 3). Instead, we focused on neutral inquiries like “Verify your calculation”, “Give a validation for the task”, or “Are there alternative solutions?”.

This approach taught us that instructions such as “Solve and check your results” minimized external influences. Consequently, we adopted these formulations to craft solutions for our further analyses. This strategy ensured that the responses remained focused on the logic and integrity of the reasoning process rather than deviating into subjective interpretations or unrelated areas. To eventually create solutions to the five tasks from Table 1 (see also Appendix A), which were used for assessing modelling competencies, we agreed on the following formulations for GPT-3.5 and GPT-4.0: “Solve the following task and calculate concrete results. Check your results. Describe your solution steps”. The originally formulated task followed this instruction. The solution provided formed the basis for assessment in the subsequent sections. No further inquiries were made regarding the solutions to ensure the process was as standardized as possible.

In addition to GPT-3.5 and GPT-4.0, we created a GPT Math Modeller (GPT-MM) which has more detailed background knowledge about mathematical modelling and can, therefore, produce more satisfactory results. GPT-MM was instructed to base its handling on foundational literature and follow the procedures outlined in the article by Greefrath and Maaß [33]. Apart from this, GPT-MM received the same instructions as GPT-3.5 and GPT-4.0. This approach aimed to simulate a more “mathematically educated” GPT, comparable to a student who has learned the fundamentals of mathematical modelling. The article by Greefrath and Maaß [33] was selected because it provides clear, practicable instructions and definitions for each competency and step in the modeling process, and it uses focused language. The concept of mathematical modelling used in this article is quite standard and agrees, for example, with [2] and is widely accepted. Appendix A documents the three solutions (GPT-3.5, GPT-4.0, and GPT-MM) to each of the tasks.

4.3. Qualitative Analyses

We conducted qualitative content analysis [57,58] using a deductive approach to organise and analyse the solutions of GPT based on pre-existing theory or concepts. This approach started with a theoretical framework or hypothesis guiding data categorisation into specific themes or categories. To assess and, therefore, categorise modelling competencies, we referred to Maaß [59,60], who makes suggestions for both primary and secondary education. Here, she states that evaluation criteria are often proposed to assist teachers not only in considering the purely mathematical aspects of modelling, but also in addressing all aspects of modelling. As we wanted to evaluate all aspects of modelling in the solutions of GPT, we took the latest list of criteria for written student solutions to modelling tasks for secondary education [33]; see also chapter “Modelling Activity” above), which is summarised in Table 2. The deductive method allows for a structured and theory-driven analysis, ensuring that the data interpretation is closely aligned with established theories. We used the following categories to rate each competence in the modelling cycle: 0—not done, 1—done, but mathematically incorrect, 2—done, but incomplete, 3—done and, from the coders’ point of view, sufficient, and 4—not necessary due to the task.

Three separate coders individually evaluated the solutions to the three research questions. After this independent analysis, they convened a detailed meeting to merge their ratings, ensuring consistency and communicative validation as suggested by Kvale [61]. During this process, we combined ratings, which, despite being labelled differently, shared identical meanings to one rating.

Several methods were employed to gauge the reliability and agreement among the coders. To assess agreement concerning the coded solutions, we explored four different statistical measures: Gwet’s AC [62], Krippendorff’s α [63], Conger’s κ [64], and simple percentage agreement [65]. These methods effectively evaluate data coded by multiple individuals across various categories. The calculations for intercoder reliability were performed using the R programming language [66] and the irrCAC package [67].

5. Results

5.1. GPTs’ Solutions

As mentioned above, Appendix A lists the five tasks and the solutions. In the following sections, we cite parts of the solution for argumentation purposes. In the Appendix A, screenshots show the unchanged solutions.

5.2. Rating of Solutions

The rating of the solutions offered by the three versions (GPT-3.5, GPT-4.0, and GPT-MM) led to 360 (three coders × three versions × five tasks × eight competences of the modelling cycle) rating decisions. We synthesised the analyses of the three independent coders in the coders’ meeting. Table 3 summarises the synthesised ratings for the solutions by GPT-3.5. Table 4 does the same with respect to GPT-4.0 and Table 5 with respect to GPT-MM. For most tasks, the three coders agreed on the categories of the competences of GPT when solving the modelling tasks. Examination of intercoder agreement of the five tasks yielded coefficients ranging from -0.057 to 0.667 for GPT-3.5, from 0.143 to 0.917 for GPT-4.0, and from 0.110 to 0.833 for GPT-MM, indicating substantial agreement among the coders, especially with respect to the solutions by GPT-4.0 [68]. Table 6 lists the coefficients in detail. Here, we see that some of the coefficients were not significant and showed low values, especially Krippendorff’s α and Conger’s κ. This result can be traced back to a discussed problem of the coefficient Krippendorff’s α, where “for nominal codings with uneven marginal distributions, Krippendorff’s α may be used only when the coding task is very difficult” [69] (p. 20). Due to the uneven distributions of our categories, we must deal with this weakness of Krippendorff’s α. To best address the shortcomings of the individual coefficients, four coefficients are always reported so that the interrater reliability results can be analysed in the best possible way. The first part of Table 6 (GPT-3.5) has the lowest values, which can be traced back to the GPT-3.5 solutions. The solutions are very general and not very precise, making categorisation difficult. The content of the solutions is interpreted in the next section.

5.3. Interpretation of Solutions

In the remaining part of this section, we describe the three different solutions offered by the GPT versions for each task. We look at similarities and differences among the solutions. Reference is made to the individual competences in the modelling cycle. Note that the solutions to tasks 1, 2, and 3 were initially phrased in German, and we, therefore, translated them into English.

5.3.1. Task 1: Rubber Ball

Task 1 has a “constructed context”. While physically unrealistic, it does not contain any superfluous or missing information. The mathematical background is self-contained and can be directly formalized. The task is considered to be a closed task with a low level of modelling complexity.

All three versions of GPT began by mathematically modelling the situation, describing the maximum height of the rubber ball before and after the first bounce and subsequent bounces. They precisely detailed how much the ball fell and ascended before and after each bounce:

“To solve this problem, we need to understand that each time the rubber ball touches the ground, it reaches half of its previous height. When it first falls, it reaches a height of 18 m, then it bounces back to half of that height, which is 9 m. On the second bounce, it falls from 9 m, then bounces back to half of that, which is 4.5 m. On the third bounce, it falls from 4.5 m, then bounces back to half of that, which is 2.25 m. This process continues.”
GPT 3.5

While both GPT-4 variants (plain and math modeller) correctly calculated the total distance by summing the falls and ascents up to the third bounce, GPT-3.5 employed a geometric series to calculate the total distance, resulting in an incorrect outcome. All versions then reported the calculated results to answer the question without considering its real-world context. It may seem surprising that GPT 3.5 could not solve this straightforward task. The incorrect mathematical approach adopted by GPT 3.5 might stem from a misinterpretation that the height is halved after each bounce.

5.3.2. Task 3: Refuelling

The refuelling modelling task, developed by Blum [70], addresses whether it is worth travelling 20 km from Trier to Luxembourg to a different gas station to save money. This task is considered to be open. i.e., it lacks the necessary information and has a moderate modelling complexity. At a minimum, one must assume the car’s fuel consumption. Other applicable, but not essential, information might include the time required to travel from Trier to Luxembourg and the size of the fuel tank.

“To check whether the trip is worthwhile for Mr. Stein, we need to calculate the cost of the trip and compare it with the potential savings from the cheaper petrol price in Luxembourg.”
GPT-3.5

All three GPT versions correctly asserted that a cost–benefit analysis should form the basis for deciding whether a trip to Luxembourg for refuelling is worthwhile. However, all three versions considered only monetary factors.

“The amount of petrol required for the trip is $\frac{40 k m}{15 k m / l i t e r} ≅ 2.67 l i t e r s$ . At the Luxembourg petrol price, the cost of the round trip in terms of fuel is 2.67 × €1.05 = €2.8”
GPT-MM

All three versions addressed the task by realistically assuming gasoline consumption and calculating the costs of the trip from Trier to Luxembourg and back. From this point, the approaches of GPT-3.5 and both GPT-4.0 versions diverged.

“Based on this information, we can calculate the fuel consumption for the 20 km trip:
Consumption = (20 km/100 km) ×6 L/100 km = 1.2 L
Now, we can calculate the fuel costs both in Tier and Luxembourg:
In Tier: 1.1 Euro/liter ×1.2 L = 1.32 Euro
In Luxembourg: 0.85 Euro/liter ×1.2 L = 1.02 Euro
The difference between the costs in Tier and Luxembourg is:
Savings = Costs in Tier − Costs in Luxembourg
= 1.32 Euro − 1.02 Euro
= 0.3 Euro”
GPT-3.5

GPT-3.5 calculated the fuel consumption required for a 20 km trip and then incorrectly concluded that the trip to Luxembourg would be worthwhile because 1.2 L of gasoline is cheaper in Luxembourg than in Trier.

“To determine if the trip is worth it, we need to calculate the break-even point, where the savings on petrol equal the cost of the journey. Let x be the liters of petrol needed to break even.
0.15x = 2.80
$x = \frac{2.80}{0.15} \approx 18.67 l i t e r s$ ”
GPT-MM

GPT 4.0 and GPT-MM both calculated the amount of gasoline required so that the savings would offset the cost of travelling forth and back, and they considered whether the amount of refuelling needed in Luxembourg to achieve this was realistic. Aside from a forgotten unit in the middle of an equation, both solutions were correct, and their presentations were understandable.

In summary, GPT 3.5 not only produced an incorrect result, but the entire approach was completely flawed and GPT 3.5 seemingly did not “grasp” the situation in detail. In contrast, both GPT-4.0 and GPT-MM produced correct results with reasonable explanations. However, their solutions lacked depth. Ecological and time factors were not considered. The only reflection was that, under realistic conditions, the assumptions might not be entirely correct. Still, the methodology and implications were not questioned, even though they were instructed to do so.

5.3.3. Task 4: Water Supply in Traffic Jam

Task 4 addresses how long the water supply needs to last in a long traffic jam on a highway (20 km) during a hot summer day. This task is considered to be an open task with superfluous information and a moderate-to-high modelling complexity, as it requires the modelling of several different aspects and their combination to solve the task.

The solutions offered by the GPT-3.5 and GPT-4 variants differed significantly. GPT-3.5 neither filtered out irrelevant information nor made reasonable assumptions for the missing information.

“Let’s denote:
D as the distance of the traffic jam (20 km)
t as the time it takes for the truck to supply all people with water
r as the rate at which the truck supplies water (in km/h)
We can set up the equation: r × t = D”
GPT-3.5

GPT started by identifying the necessary information. Although the approach to determine the delivery rate (r) through the equation (

r \times t = D

) was correct, the challenge to solving this task arose when determining (t).

“Given that Christina has been stuck in the traffic jam for 6 h, we know the total time is 6 h. We can set up the equation with this information:
$r \times 6 = 20$
To solve for r, we divide both sides by 6:
$r = \frac{20}{6} \to r = 6$
So, the truck is supplying water at a rate of 3.33 km/h.
Now, to find out how long it will take for the truck to supply all people with water, we can use the same rate:
$r \times t = D$
$3.33 \times t = 20$
[…]
$t \approx 6$ h.
So, it will take approximately 6 h for the truck to supply all people with water.”
GPT-3.5

GPT then solved the problem without additional assumptions, simply using D = 20 km and t = 6 h, arbitrarily choosing the 6 h as if given in the text, an approach lacking both meaning and justification, and it did not address the complexity of the problem. In summary, GPT-3.5 failed to process the data accurately to model the situation correctly. Furthermore, it could not distinguish between relevant and irrelevant information or identify missing information, as it did not incorporate the context correctly.

“The average distance between cars in the jam to estimate the number of cars. How much time it takes for the truck to serve water to each car.”
GPT-4.0

Both GPT-4.0 and GPT-MM adopted a completely different approach relative to GPT 3.5. GPT-4.0 and GPT-MM started by sorting the information and by making assumptions about car size and the space between cars, albeit somewhat less than realistic. They used this information to calculate the number of vehicles in one lane of the traffic jam.

“Traffic jam length = 20 km = 20,000 m
Average distance between cars = 10 m”
GPT-4.0

“Assuming an average car length of 4.5 m (including spacing between cars), we can estimate the number of cars in a 20 km traffic jam.”
GPT-MM

This assumption, in both cases, was unfounded. Nevertheless, followed by a plausible assumption regarding the time required to provide water per car, the total time needed to serve water to every car was calculated explicitly. However, both versions implicitly assumed that the highway had only one traffic lane, simplifying the problem significantly. In the end, there was little to no reflection on the solution, although some general statements were made.

5.3.4. Tasks 2 and 5

Tasks 2 and 5 exhibited a very low complexity of modelling and differed in the level of openness. All three versions yielded similar results. In task 2, the objective was to determine the price of a 50-entry card based on the cost of a single-entry card or a 20-entry card, respectively. All versions delivered similar outcomes, though GPT-MM was the only one that set the price lower than the calculated price from the 20-entry card. In task 5, the goal was to ascertain how much money a shop could earn if customers rounded up to the nearest tenth. Both tasks received valid solutions from all versions of GPT, albeit with slight variations in detail; again, GPT-4.0 and GPT-MM did offer more detailed solutions.

6. Discussion

As we observed, GPT, in all its variants, has at least basic mathematical problem-solving competencies. GPT seems to be able to use several complex mathematical tools to solve mathematical tasks. As explained before, we did not vary the mathematical complexity in the different tasks and only varied the complexity of the modelling contexts. In summary, we observed that, as the complexity of the context increased, along with the required modelling competencies, all versions of GPT tended to struggle more. Furthermore, the openness of the tasks did not seem to impact GPT’s modelling capabilities significantly. Consistent with the findings of Wardat et al. [49] and Dao and Le [48] which suggest that the mathematical complexity of tasks and the complexity of the context may influence the outcomes, we can confirm that GPT-4.0 and its more ‘educated’ version, GPT-MM, produce much more accurate results than GPT-3.5. In fact, GPT-3.5 began to struggle at a very low level of complexity, such a finding supporting Frieder et al. [50], who maintained that the complexity of a task, mathematical or contextual, significantly influences the performance of GPT. GPT-4.0 and GPT-MM appeared to be much more capable of correctly solving modelling tasks, but their reflections on their solutions were superficial, even in moderate complexities. In moderate-to-complex contexts, GPT-4.0 variants also seemed not to “comprehend” the context enough. We observed minimal to no differences between GPT-4.0 and GPT-MM. However, a more sophisticated prompt engineering approach or different background information might have led to more significant differences.

The implications for mathematics teaching vary depending on the level and type of education. Some possible implications are:

Students can use GPT to solve simple modelling tasks. GPT models can be effectively integrated into educational settings for simple mathematical tasks and basic problem-solving exercises. Teachers should leverage GPT tools to assist with routine calculations and basic logical tasks, freeing up classroom time for more complex problem-solving activities that require human guidance and a deeper understanding.

The solutions often sounded professional and well thought out, although they were frequently, at best, incomplete or even incorrect. The tendency to use GPT models for solving even moderately complex problems may undermine the development of students’ critical thinking and independent problem-solving skills. There is a risk that students may become overly reliant on AI tools, potentially hindering their ability to tackle challenging tasks without technological assistance. Thus, teachers need training to understand the capabilities and limitations of GPT models. This will enable them to incorporate these tools effectively into their teaching strategies, ensuring that students benefit from AI assistance without becoming overly reliant on it for complex problem-solving.

GPT models should be viewed as supplementary educational tools rather than primary problem-solving resources. They often struggle with the nuanced understanding required for complex modelling tasks. This limitation underscores the broader challenge of AI use in education: the inability of current models to fully grasp contextual subtleties, which are crucial for high-level problem solving and critical thinking. GPT models’ inability to consistently handle complex mathematical modelling tasks means that they should support, not replace, traditional teaching methods and human oversight in advanced mathematical contexts.

GPT’s struggle to comprehend contexts in mathematical modelling tasks can also be seen as an opportunity for math educators to teach how AI, particularly large language models, work. A detailed error analysis of GPT’s mistakes may illustrate how GPT operates. Certain words or phrases may trigger GPT to use inappropriate methods because they are more commonly associated with these words (see, for example, GPT-3’s error in the ball task).

Moreover, the limitations of GPT models in complex scenarios highlight the continued importance of developing students’ critical thinking and independent problem-solving skills. Relying on AI tools for solving simple tasks might inadequately prepare students for real-world challenges that require comprehensive problem-solving abilities. Education systems must ensure that students are not just proficient in using AI tools, but also capable of addressing complex problems independently, a skill which is crucial for their future professional and personal success. Schools should emphasize these competencies, ensuring that students are equipped to tackle everyday problems without overreliance on AI.

7. Limitation

Since this is a qualitative analysis, inherent limitations exist, such as potential biases introduced by personal interpretations. We reduced this by involving at least three independent coders. We did not systematically analyse GPT’s responses, although we employed a qualitative approach to capture a broad scope of the types of answers it could provide. Furthermore, we did not engage in any prompt engineering to enhance GPT’s results, as we aimed to create a situation comparable to that in which a typical student in an everyday school context might find themselves. Consequently, a quantitative analysis based on these findings may yield more reliable results.

8. Conclusions

In conclusion, consistent with the findings of OpenAI et al. [38], GPT models exhibit considerable potential as tools for solving mathematical and STEM-related problems when appropriately instructed. Despite this potential, GPT models, including GPT-3.5, GPT-4.0, and GPT-MM, face significant limitations as general problem-solving tools, particularly in mathematical contexts of varying complexity. The study demonstrated that, while these models can produce accurate results in simple tasks, they struggle significantly with moderately to highly complex modelling problems.

We did not analyse how GPT handles increasingly difficult math problems, but we did iterate the complexity of math modelling tasks. Even in low-to-moderate complexity modelling tasks, all versions of GPT tended to produce unsatisfactory results, only managing to solve problems with authentic contexts or at higher mathematical levels at a very basic level or even providing incorrect solutions.

Consequently, the educational implications suggest that, while GPT can aid students in solving simple tasks, its limitations in handling complex problems necessitate careful integration into educational contexts. Educators should receive training on the effective use of GPT tools to enhance their teaching methods, ensuring students understand the boundaries of these technologies and develop robust problem-solving skills independent of AI assistance.

Author Contributions

Conceptualization, C.S., O.S. and S.Z.; methodology, C.S., O.S., S.Z. and K.M.; software, C.S., O.S. and S.Z.; formal analysis, C.S., O.S. and S.Z.; investigation, C.S., O.S. and S.Z.; writing—original draft preparation, C.S., O.S. and S.Z.; writing—review and editing, C.S., O.S., S.Z. and K.M.; visualization, C.S., O.S. and S.Z.; project administration, C.S., O.S. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data is available in Appendix A. The solutions created in ChatGPT are available with screenshots.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Task 1: Rubber Ball

Source: Jordan et al. [71]

Original information in German

Wenn ein Gummiball zu Boden fällt, springt er die Hälfte der Strecke wieder hoch. Der Ball wird von einem 18 Meter hohen Dach fallen gelassen. Welche gesamte Entfernung hat der Ball zurückgelegt, wenn er das dritte Mal den Boden berührt?

Translation into English

When a rubber ball falls to the ground, it bounces halfway up again. The ball is dropped from a roof 18 metres high. What is the total distance travelled by the ball when it touches the ground for the third time?

Solution GPT-3.5

German	English

Solution GPT-4.0

German	English

Solution GPT-MM

German	English
Only available in English

Appendix A.2. Task 2: Ferry Price

Source: Jordan et al. [71]

Original information in German

An der Anlegestelle einer großen Fähre findet sich diese Preistabelle:

Die Fährgesellschaft will eine Blockkarte für 50 Personen einführen. Was wäre dafür ein angemessener Preis? Begründe deine Antwort.

Translation into English

This price table can be found at the landing stage of a large ferry:

The ferry company wants to introduce a block ticket for 50 people. What would be a reasonable price? Give reasons for your answer.

Solution GPT-3.5

German	English

Solution GPT-4.0

German	English

Solution GPT-MM

German	English
Only available in English

Appendix A.3. Task 3: Refuelling

Source: Blum [70]

Original information in German

Herr Stein wohnt in Tier, 20 km von der Grenze zu Luxemburg entfernt. Er fährt mit seinem VW Golf zum Tanken nach Luxemburg, wo sich direkt hinter der Grenze eine Tankstelle befindet. Dort kostet der Liter Benzin nur 0.85 Euro, im Gegensatz zu 1.1 Euro in Tier.

Lohnt sich diese Fahrt für Herrn Stein? Begründe deine Antwort.

Translation into English

Mr Stein lives in Tier, 20 km from the border with Luxembourg. He drives his VW Golf to Luxembourg, where there is a petrol station just over the border, to fill the tank with petrol. There, a litre of petrol costs just 0.85 euros, compared to 1.1 euros in Tier.

Is this journey worthwhile for Mr Stein? Give reasons for your answer.

Solution GPT-3.5

German	English

Solution GPT-4.0

German	English

Solution GPT-MM

German	English
Only available in English

Appendix A.4. Task 4: Water Supply in the Traffic Jam

Source: Maaß & Gurlitt [3]

Original information in English

Traffic jams often occur at the beginning of summer holidays. Christina has been stuck in a 20 km traffic jam for 6 h. It is very warm and she is extremely thirsty. There is a rumour that a small truck is supposed to supply the people with water, but she has not yet received anything. How long will it take for the truck to supply all people with water?

Solution GPT-3.5

Solution GPT-4.0

Solution GPT-MM

Appendix A.5. Task 5: Round Up Please

Source: Flößer [72]

Original information in English

“That is thirty-nine dollars and ninety-seven cents, please!” Oh dear, why always such an odd amount of money? And what should I then do with the 3 cents left?

A typical situation at the checkouts in German supermarkets: you get 1, 2, or 5 cent coins as change and these then fill up your wallet. What is the solution? How about making a donation! The founders of the donation campaign “Germany rounds up” had a brilliant idea: customers can donate the amount of change they get by rounding up the amount they have to pay to the nearest ten by saying ‘round up please’ at the checkout. A donation that does not harm anyone and helps reduce child poverty! The campaign is still active in many retail markets to this day and has already raised a lot of money, as over 8.3 million euros have been donated so far!

But how much money can be collected in an average supermarket in a day?

Solution GPT-3.5

Solution GPT-4.0

Solution GPT-MM

References

Kaiser, G.; Bracke, M.; Göttlich, S.; Kaland, C. Authentic Complex Modelling Problems in Mathematics Education. In Educational Interfaces between Mathematics and Industry; Damlamian, A., Rodrigues, J.F., Sträßer, R., Eds.; New ICMI Study Series; Springer International Publishing: Berlin/Heidelberg, Germany, 2013; Volume 16, pp. 287–297. [Google Scholar] [CrossRef]
Blum, W. ICMI Study 14: Applications and modelling in mathematics education—Discussion document. Educ. Stud. Math. 2002, 51, 149–171. [Google Scholar] [CrossRef]
Maaß, K.; Gurlitt, J. Designing a Teacher Questionnaire to Evaluate Professional Development in Modelling. In Proceedings of the CERME 6, Lyon, France, 28 January–1 February 2009; Available online: http://www.inrp.fr/editions/editions-electroniques/cerme6/ (accessed on 22 June 2024).
Krainer, K. Powerful tasks: A contribution to a high level of acting and reflecting in mathematics instruction. Educ. Stud. Math. 1993, 24, 65–93. [Google Scholar] [CrossRef]
Henn, H.-W. Why Sometimes Cats Fall from the Sky … or … about Good and Bad Models [Warum manchmal Katzen vom Himmel fallen … oder … von Guten und von Schlechten Modellen]. In Model Building, Computers and Mathematics Instruction [Modellbildung Computer und Mathematikunterricht]; Hischer, H., Ed.; Franzbecker: Hildesheim, Germany, 2000; pp. 9–17. [Google Scholar]
Klieme, E.; Neubrand, M.; Lüdtke, O. Mathematical Basic Education: Test Design and Results [Mathematische Grundbildung: Testkonzeption und Ergebnisse]. In PISA 2000: Basic Competencies of Students in an International Comparison [PISA 2000: Basiskompetenzen von Schülerinnen und Schülern im Internationalen Vergleich]; PISA Consortium, Ed.; VS Verlag für Sozialwissenschaften: Wiesbaden, Germany, 2001; pp. 139–190. [Google Scholar]
Matos, J.F. Mathematics Learning and Modelling: Theory and Practice. In Mathematical Modelling Teaching and Assessment in a Technology-Rich World; Galbraith, P., Blum, W., Booker, G., Huntley, I., Eds.; Horwood: Chichester, UK, 1998; pp. 21–27. [Google Scholar]
Alsina, C. Neither a Microscope nor a Telescope Just a Mathscope. In Mathematical Modelling Teaching and Assessment in a Technology-Rich World; Galbraith, P., Blum, W., Booker, G., Huntley, I., Eds.; Horwood: Chichester, UK, 1998; pp. 3–10. [Google Scholar]
Galbraith, P. Modelling Teaching Reflecting—What I Have Learned. In Advances and Perspectives in the Teaching of Mathematical Modelling and Applications; Sloyer, C.W., Huntley, I., Blum, W., Eds.; Water Street Mathematics: Yorklyn, DE, USA, 1995; pp. 21–45. [Google Scholar]
Niss, M.; Blum, W.; Galbraith, P. Introduction. In Modelling and Applications in Mathematics Education; Blum, W., Galbraith, P.L., Henn, H.-W., Niss, M., Eds.; New ICMI Study Series; Springer: Boston, MA, USA, 2007; Volume 10, pp. 3–32. [Google Scholar] [CrossRef]
Maaß, K. Mathematical Modelling in the Classroom: Results of an Empirical Study [Mathematisches Modellieren im Unterricht: Ergebnisse einer empirischen Studie]. In Texts on Mathematical Research and Teaching [Texte zur Mathematischen Forschung und Lehre]; Franzbecker: Hildesheim, Germany, 2004; Volume 30. [Google Scholar]
Kaiser, G.; Sriraman, B. A global survey of international perspectives on modelling in mathematics education. ZDM Math. Educ. 2006, 38, 302–310. [Google Scholar] [CrossRef]
English, L.D. Advancing Mathematics Education Research within a STEM Environment. In Research in Mathematics Education in Australasia 2012–2015; Makar, K., Dole, S., Visnovska, J., Goos, M., Bennison, A., Fry, K., Eds.; Springer: Singapore, 2016; pp. 353–371. [Google Scholar] [CrossRef]
Maaß, K.; Doorman, M.; Jonker, V.; Wijers, M. Promoting active citizenship in mathematics teaching. ZDM Math. Educ. 2019, 51, 991–1003. [Google Scholar] [CrossRef]
Maaß, K.; Zehetmeier, S.; Weihberger, A.; Flößer, K. Analysing mathematical modelling tasks in light of citizenship education using the COVID-19 pandemic as a case study. ZDM Math. Educ. 2023, 55, 133–145. [Google Scholar] [CrossRef]
Kaiser-Meßmer, G. Applications in Mathematics Education [Anwendungen im Mathematikunterricht]; Franzbecker: Hildesheim, Germany, 1986. [Google Scholar]
Blum, W. Application contexts in mathematics education—Trends and perspectives [Anwendungsbezüge im Mathematikunterricht—Trends und Perspektiven]. Schriftenreihe Didakt. Math. 1996, 23, 15–38. [Google Scholar]
Stern, E. Mathematics. In Encyclopedia of Psychology: Practice Areas. Series I Educational Psychology Vol. 3. Psychology of Teaching and School [Enzyklopädie der Psychologie: Themenbereich d. Praxisgebiete. Serie I Pädagogische Psychologie Bd. 3. Psychologie des Unterrichts und der Schule]; Weinert, F.E., Ed.; Hogrefe: Göttingen, Germany, 1997; pp. 398–426. [Google Scholar]
Blomhøj, M.; Jensen, T.H. What’s All the Fuss about Competencies? In Modelling and Applications in Mathematics Education; Blum, W., Galbraith, P.L., Henn, H.-W., Niss, M., Eds.; New ICMI Study Series; Springer: New York, NY, USA, 2007; Volume 10, pp. 45–56. [Google Scholar] [CrossRef]
Verschaffel, L.; de Corte, E.; Lasure, S.; van Vaerenbergh, G.; Bogaerts, H.; Ratinckx, E. Learning to solve mathematical application problems: A design experiment with fifth graders. Math. Think. Learn. 1999, 1, 195–229. [Google Scholar] [CrossRef]
Verschaffel, L.; de Corte, E.; Greer, B. Making Sense of Word Problems; Contexts of Learning; Swets & Zeitlinger: Lisse, The Netherlands, 2000; Volume 8. [Google Scholar]
Burkhardt, H. Mathematical modelling in the curriculum. In Applications and Modelling in Learning and Teaching Mathematics; Blum, W., Berry, J.S., Biehler, I., Huntley, I., Kaiser-Meßmer, G., Profke, L., Eds.; Horwood: Newyork, NY, USA, 1989; pp. 1–11. [Google Scholar]
Kaiser, G. Reality-Related Aspects in Mathematics Education—An Overview of the Current and Historical Discussion [Realitätsbezüge im Mathematikunterricht—Ein Überblick über die Aktuelle und Historische Diskussion]. In Series of the ISTRON Group. Materials for a Reality-Related Mathematics Education [Schriftenreihe der ISTRON-Gruppe. Materialien für Einen Realitätsbezogenen Mathematikunterricht]; Graumann, G., Ed.; Franzbecker: Hildesheim, Germany, 1995; Volume 2, pp. 66–84. [Google Scholar]
OECD. The PISA 2003 Assessment Framework; OECD: Paris, France, 2003. [Google Scholar] [CrossRef]
Franke, M. Didactics of Arithmetic in Elementary School [Didaktik des Sachrechnens in der Grundschule]. In Mathematics for Primary and Secondary Education [Mathematik Prima-und Sekundarstufe]; Springer Spektrum: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Büchter, A.; Leuders, T. Developing Math Tasks on Your Own: Promoting Learning—Assessing Performance [Mathematikaufgaben Selbst Entwickeln: Lernen Fördern—Leistung Überprüfen]; Cornelsen Scriptor: Berlin, Germany, 2005. [Google Scholar]
Bruder, R. Construct-Select-Accompany: On Dealing with Tasks [Konstruieren-Auswählen-Begleiten: Über den Umgang mit Aufgaben]. 2003.
Jordan, A.; Krauss, S.; Löwen, K.; Blum, W.; Neubrand, M.; Brunner, M.; Kunter, M.; Baumert, J. Tasks in the COACTIV project: Evidence of the cognitive activation potential in German mathematics instruction [Aufgaben im COACTIV-Projekt: Zeugnisse des kognitiven Aktivierungspotentials im deutschen Mathematikunterricht]. J. Für Math.-Didakt. 2008, 29, 83–107. [Google Scholar] [CrossRef]
Blomhøj, M.; Jensen, T.H. Developing mathematical modelling competence: Conceptual clarification and educational planning. Teach. Math. Its Appl. 2003, 22, 123–139. [Google Scholar] [CrossRef]
Brand, S. Acquisition of Modelling Competences: Empirical Comparison of a Holistic and an Atomistic Approach to Fostering Modelling Competences [Erwerb von Modellierungskompetenzen: Empirischer Vergleich Eines Holistischen und Eines Atomistischen Ansatzes zur Förderung von Modellierungskompetenzen]. In Perspectives of Mathematics Education; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2014. [Google Scholar]
Maaß, K. Classification scheme for modelling tasks. J. Für Math. Didakt. 2010, 31, 285–311. [Google Scholar] [CrossRef]
Blum, W.; Leiß, D. Modelling in class with the “Refueling” task [Modellieren im Unterricht mit der “Tanken”-Aufgabe]. Math. Lehren 2005, 128, 18–21. [Google Scholar]
Greefrath, G.; Maaß, K. Diagnosis and Evaluation in Mathematical Modelling [Diagnose und Bewertung beim Mathematischen Modellieren]. In Modelling Competences—Diagnosis and Evaluation [Modellierungskompetenzen—Diagnose und Bewertung]; Greefrath, G., Maaß, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–19. [Google Scholar]
Zhang, K.; Aslan, A.B. AI technologies for education: Recent research & future directions. Comput. Educ. Artif. Intell. 2021, 2, 100025. [Google Scholar] [CrossRef]
Chassignol, M.; Khoroshavin, A.; Klimova, A.; Bilyatdinova, A. Artificial Intelligence trends in education: A narrative overview. Procedia Comput. Sci. 2018, 136, 16–24. [Google Scholar] [CrossRef]
Weßels, D. ChatGPT—A milestone in AI development [ChatGPT—Ein Meilenstein der KI-Entwicklung]. Mitt. Dtsch. Math.-Ver. 2023, 31, 17–19. [Google Scholar] [CrossRef]
Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.-L.; Tang, Y. A brief overview of ChatGPT: The history, status quo, and potential future development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
OpenAI Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Bishop, C.M. Neural networks and their applications. Rev. Sci. Instrum. 1994, 65, 1803–1832. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Lo, C.K. What is the impact of ChatGPT on education? A rapid review of the literature. Educ. Sci. 2023, 13, 410. [Google Scholar] [CrossRef]
Lund, B.; Ting, W. Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Libr. Hi Tech News 2023, 40, 26–29. [Google Scholar] [CrossRef]
Lin, S.-M.; Chung, H.-H.; Chung, F.-L.; Lan, Y.-J. Concerns about Using ChatGPT in Education. In Lecture Notes in Computer Science. Innovative Technologies and Learning: 6th International Conference; Huang, Y.-M., Rocha, T., Eds.; Springer International: Berlin/Heidelberg, Germany, 2023; Volume 14099, pp. 37–49. [Google Scholar] [CrossRef]
Yu, H. Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Front. Psychol. 2023, 14, 1181712. [Google Scholar] [CrossRef]
Helfrich-Schkarbanenko, A. Mathematics and ChatGPT [Mathematik und ChatGPT]; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Korkmaz Guler, N.; Dertli, Z.G.; Boran, E.; Yildiz, B. An artificial intelligence application in mathematics education: Evaluating ChatGPT’s academic achievement in a mathematics exam. Pedagog. Res. 2024, 9, em0188. [Google Scholar] [CrossRef]
Plevris, V.; Papazafeiropoulos, G.; Jiménez Rios, A. Chatbots put to the test in math and logic problems: A comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI 2023, 4, 949–969. [Google Scholar] [CrossRef]
Dao, X.-Q.; Le, N.-B. Investigating the effectiveness of ChatGPT in mathematical reasoning and problem solving: Evidence from the Vietnamese national high school graduation examination. arXiv 2023. [Google Scholar] [CrossRef]
Wardat, Y.; Tashtoush, M.A.; AlAli, R.; Jarrah, A.M. ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia J. Math. Sci. Technol. Educ. 2023, 19, em2286. [Google Scholar] [CrossRef]
Frieder, S.; Pinchetti, L.; Chevalier, A.; Griffiths, R.-R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.C.; Berner, J. Mathematical capabilities of ChatGPT. Adv. Neural Inf. Process. Syst. 2023, 36, 1–37. [Google Scholar] [CrossRef]
Shakarian, P.; Koyyalamudi, A.; Ngu, N.; Mareedu, L. An independent evaluation of ChatGPT on mathematical word problems (MWP). arXiv 2023. [Google Scholar] [CrossRef]
Zong, M.; Krishnamachari, B. Solving math word problems concerning systems of equations with GPT-3. Proc. AAAI Conf. Artif. Intell. 2023, 37, 15972–15979. [Google Scholar]
McGee, R.W. Is Chat GPT Biased against Conservatives? An Empirical Study. SSRN 2023. [Google Scholar] [CrossRef]
Wan, Y.; Pu, G.; Sun, J.; Garimella, A.; Chang, K.-W.; Peng, N. “Kelly is a Warm Person, Joseph is a Role Model”: Gender biases in LLM-generated reference letters. arXiv 2023, arXiv:2310.09219. [Google Scholar]
Schukajlow, S.; Kolter, J.; Blum, W. Scaffolding mathematical modelling with a solution plan. ZDM Math. Educ. 2015, 47, 1241–1254. [Google Scholar] [CrossRef]
Hankeln, C.; Beckschulte, C. Partial Competences of Modelling and Their Assessment—Presentation of a Test Development [Teilkompetenzen des Modellierens und ihre Erfassung—Darstellung einer Testentwicklung]. In Modelling Competences—Diagnosis and Evaluation [Modellierungskompetenzen—Diagnose und Bewertung]; Greefrath, G., Maaß, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 65–86. [Google Scholar]
Mayring, P. Qualitative content analysis. Forum Qual. Sozialforschung/Forum: Qual. Soc. Res. 2000, 1, 20. [Google Scholar] [CrossRef]
Mayring, P. Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. 2014. Available online: https://nbn-resolving.org/urn:nbn:de:0168-ssoar-395173 (accessed on 22 June 2024).
Maaß, K. Modelling in mathematics education at the lower secondary level [Modellieren im Mathematikunterricht der Sekundarstufe I]. J. Für Math. -Didakt. 2005, 26, 114–142. [Google Scholar] [CrossRef]
Maaß, K. What are modelling competencies? ZDM Math. Educ. 2006, 38, 113–142. [Google Scholar] [CrossRef]
Kvale, S. Doing Interviews; SAGE Publications Ltd.: Thousand Oaks, CA, USA, 2007. [Google Scholar] [CrossRef]
Gwet, K.L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 2008, 61, 29–48. [Google Scholar] [CrossRef]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology; Sage Publications Inc.: Thousand Oaks, CA, USA, 2004. [Google Scholar]
Conger, A.J. Integration and generalization of kappas for multiple raters. Psychol. Bull. 1980, 88, 322–328. [Google Scholar] [CrossRef]
Lombard, M.; Snyder-Duch, J.; Bracken, C.C. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Hum. Commun. Res. 2002, 28, 587–604. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing [Computer Software]. R Foundation for Statistical Computing. 2021. Available online: https://www.R-project.org/ (accessed on 22 June 2024).
Gwet, K.L. irrCAC: Computing Chance-Corrected Agreement Coefficients (CAC). 2019. Available online: https://CRAN.R-project.org/package=irrCAC (accessed on 22 June 2024).
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159. [Google Scholar] [CrossRef]
Feng, G.C. Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology 2015, 11, 13–22. [Google Scholar] [CrossRef]
Blum, W. Modelling Tasks in Mathematics Education—Challenges for Students and Teachers [Modellierungsaufgaben im Mathematikunterricht—Herausforderung für Schüler und Lehrer]. In Realworld Mathematics Education: From the Subject and for Practice; Festschrift for Hans-Wolfgang Henn’s 60th Birthday [Realitätsnaher Mathematikunterricht: Vom Fach aus und für die Praxis; Festschrift für Hans-Wolfgang Henn zum 60. Geburtstag]; Büchter, A., Ed.; Franzbecker: Hildesheim, Germany, 2006; pp. 8–23. [Google Scholar]
Jordan, A.; Ross, N.; Krauss, S.; Baumert, J.; Blum, W.; Neubrand, M.; Löwen, K.; Brunner, M.; Kunter, M. Classification Scheme for Maths Tasks: Documentation of Task Categorisation in the COACTIV Project. [Klassifikationsschema für Mathematikaufgaben: Dokumentation der Aufgabenkategorisierung im COACTIV-Projekt.]; Materialien aus der Bildungsforschung; Max-Planck-Inst. für Bildungsforschung: Berlin, Germany, 2006; Volume 81. [Google Scholar]
Flößer, K. Round Up, Please! [Aufrunden, Bitte!]. 2018. Available online: https://icse.ph-freiburg.de/problemdesquartals/das-problem-des-quartals-mathe-edition-aufrunden-bitte/ (accessed on 22 June 2024).

Table 1. Characteristics of modelling tasks.

	Data	Nature of Relationship to Reality	Type of Representation	Openness of Task
Task 1: rubber ball	Matching	Embedded, intentionally artificial	Text	Ascertaining task
Task 2: ferry price	Missing	Embedded	Text	Ascertaining task
Task 3: refuelling	Missing	Authentic, close to reality	Text, picture, situation	Ascertaining problem
Task 4: water supply	Missing	Close to reality	Text, situation	Ascertaining problem
Task 5: round up	Superflous and missing	Authentic	Text, situation	Ascertaining problem

Note. For a description of the categories, see Section 2.6.

Table 2. Partial competences of modelling [33].

Partial Competence	Description
Understanding	Students construct their own mental model of a given problem situation and, thus, understand the question.
Collecting information, analysing sources (simplifying)	Students separate important and unimportant information with respect to a real-life situation.
Mathematising	Students translate suitably simplified real-life situations into mathematical models (e.g., term, equation, figure, diagram, function).
Using mathematics	Students work with the mathematical model.
Interpreting	Students relate the results obtained from the model to the real situation and, thus, achieve real results.
Validating *	Students check the real results from the situation model for appropriateness. Students compare and evaluate different mathematical models with respect to a real-life situation.
Discussing (possibly) contradicting results	Students relate the answers obtained from the situation model to the real situation and, thus, answer the question.

Note. * Validating involves two steps, which is why this competence is split for the evaluation of the solutions into “validation in the situation model” and “validation in the context”.

Table 3. Ratings for the solutions offered by GPT-3.5.

Partial Competence	Task 1: Rubber Ball	Taks 2: Ferry Price	Taks 3: Refuelling	Task 4: Water Supply	Task 5: Round Up
Understanding	2	3	0	0	3
Collecting information, analysing sources (simplifying)	4	4	4	2	2
Mathematising	1	3	2	2	2
Using mathematics	3	3	2	2	2
Interpreting	3	2	2	2	3
Validating in the situation model	0	2	0	2	3
Validating in the context	4	0	0	0	2
Discussing (possibly) contradicting results	1	1	1	1	3

Note. 0 = not done; 1 = done, but mathematically incorrect; 2 = done, but incomplete; 3 = done and, from the coders’ point of view, sufficient; 4 = not necessary due to the task.

Table 4. Ratings for the solution offered by GPT-4.0.

Partial Competence	Task 1: Rubber Ball	Taks 2: Ferry Price	Taks 3: Refuelling	Task 4: Water Supply	Task 5: Round Up
Understanding	3	3	3	3	3
Collecting information, analysing sources (simplifying)	4	4	4	3	3
Mathematising	3	3	3	2	3
Using mathematics	3	3	3	3	3
Interpreting	3	2	3	2	3
Validating in the situation model	3	2	3	2	3
Validating in the context	4	0	0	0	3
Discussing (possibly) contradicting results	3	2	3	3	3

Note. 0 = not done; 1 = done, but mathematically incorrect; 2 = done, but incomplete; 3 = done and, from the coders’ point of view, sufficient; 4 = not necessary due to the task.

Table 5. Ratings for the solution offered by GPT-MM.

Partial Competence	Task 1: Rubber Ball	Taks 2: Ferry Price	Taks 3: Refuelling	Task 4: Water Supply	Task 5: Round Up
Understanding	3	3	3	3	3
Collecting information, analysing sources (simplifying)	4	4	4	3	3
Mathematising	3	3	3	2	3
Using mathematics	3	3	3	3	3
Interpreting	3	3	3	3	3
Validating in the situation model	3	3	2	3	3
Validating in the context	4	2	0	0	3
Discussing (possibly) contradicting results	3	3	3	3	3

Note. 0 = not done; 1 = done, but mathematically incorrect; 2 = done, but incomplete; 3 = done and, from the coders’ point of view, sufficient; 4 = not necessary due to the task.

Table 6. Coefficients of intercoder agreement for the coding of the solutions.

Coefficient	Estimate	S.E.	p-Value
GPT-3.5
Task 1: rubber ball
Gwet’s AC₁	0.592	0.158	<0.01
Krippendorf’s α	0.564	0.163	0.011
Conger’s κ	0.562	0.145	<0.01
Simple agreement (%)	0.667	0.126	<0.01
Taks 2: ferry price
Gwet’s AC₁	0.273	0.170	0.152
Krippendorf’s α	0.294	0.182	0.150
Conger’s κ	0.296	0.157	0.101
Simple agreement (%)	0.417	0.137	0.019
Taks 3: refuelling
Gwet’s AC₁	0.381	0.133	0.024
Krippendorf’s α	0.376	0.154	0.045
Conger’s κ	0.396	0.123	0.014
Simple agreement (%)	0.500	0.109	<0.01
Task 4: water supply
Gwet’s AC₁	0.241	0.117	0.079
Krippendorf’s α	0.154	0.160	0.366
Conger’s κ	0.200	0.120	0.139
Simple agreement (%)	0.375	0.098	<0.01
Task 5: round up
Gwet’s AC₁	0.156	0.054	0.024
Krippendorf’s α	−0.057	0.082	0.513
Conger’s κ	0.000	0.040	$~ 1$
Simple agreement (%)	0.292	0.042	<0.01
GPT-4.0
Task 1: rubber ball
Gwet’s AC₁	0.694	0.161	<0.01
Krippendorf’s α	0.566	0.182	0.017
Conger’s κ	0.560	0.167	0.012
Simple agreement (%)	0.750	0.122	<0.01
Taks 2: ferry price
Gwet’s AC₁	0.397	0.123	0.014
Krippendorf’s α	0.296	0.221	0.223
Conger’s κ	0.309	0.182	0.134
Simple agreement (%)	0.500	0.109	<0.01
Taks 3: refuelling
Gwet’s AC₁	0.901	0.099	<0.01
Krippendorf’s α	0.828	0.188	<0.01
Conger’s κ	0.822	0.184	<0.01
Simple agreement (%)	0.917	0.083	<0.01
Task 4: water supply
Gwet’s AC₁	0.790	0.138	<0.01
Krippendorf’s α	0.743	0.178	<0.01
Conger’s κ	0.733	0.175	<0.01
Simple agreement (%)	0.833	0.109	<0.01
Task 5: round up
Gwet’s AC₁	0.306	0.174	0.123
Krippendorf’s α	0.143	0.150	0.372
Conger’s κ	0.232	0.089	0.034
Simple agreement (%)	0.500	0.109	<0.01
GPT-MM
Task 1: rubber ball
Gwet’s AC₁	0.778	0.163	<0.01
Krippendorf’s α	0.681	0.168	<0.01
Conger’s κ	0.673	0.157	<0.01
Simple agreement (%)	0.833	0.109	<0.01
Taks 2: ferry price
Gwet’s AC₁	0.784	0.155	<0.01
Krippendorf’s α	0.649	0.205	0.016
Conger’s κ	0.644	0.188	0.011
Simple agreement (%)	0.833	0.109	<0.01
Taks 3: refuelling
Gwet’s AC₁	0.799	0.142	<0.01
Krippendorf’s α	0.691	0.160	<0.01
Conger’s κ	0.683	0.151	<0.01
Simple agreement (%)	0.833	0.109	<0.01
Task 4: water supply
Gwet’s AC₁	0.391	0.152	0.037
Krippendorf’s α	0.110	0.166	0.530
Conger’s κ	0.135	0.143	0.377
Simple agreement (%)	0.500	0.109	<0.01
Task 5: round up
Gwet’s AC₁	0.627	0.249	0.040
Krippendorf’s α	0.274	0.138	0.087
Conger’s κ	0.294	0.109	0.031
Simple agreement (%)	0.750	0.122	<0.01

Note. Estimates, their standard errors (S.E.), and p-values associated with the testing of the hypothesis that the respective coefficient equals zero for each measure of intercoder agreement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Spreitzer, C.; Straser, O.; Zehetmeier, S.; Maaß, K. Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT. Educ. Sci. 2024, 14, 698. https://doi.org/10.3390/educsci14070698

AMA Style

Spreitzer C, Straser O, Zehetmeier S, Maaß K. Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT. Education Sciences. 2024; 14(7):698. https://doi.org/10.3390/educsci14070698

Chicago/Turabian Style

Spreitzer, Carina, Oliver Straser, Stefan Zehetmeier, and Katja Maaß. 2024. "Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT" Education Sciences 14, no. 7: 698. https://doi.org/10.3390/educsci14070698

APA Style

Spreitzer, C., Straser, O., Zehetmeier, S., & Maaß, K. (2024). Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT. Education Sciences, 14(7), 698. https://doi.org/10.3390/educsci14070698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mathematical Modelling Abilities of Artificial Intelligence Tools: The Case of ChatGPT

Abstract

1. Introduction

2. Theoretical Background of Mathematical Modelling

2.1. Notions and Concepts

2.2. Aims of Modelling

2.3. Modelling in Mathematics Classrooms

2.4. Modelling Competencies

2.5. Classification of Tasks

2.6. Classification Scheme for Modelling

2.7. Modelling Activity

2.8. Data

2.9. Relationship to Reality

2.10. Representation

2.11. Openness

3. Theoretical Background of AI

3.1. Background

3.2. Mathematical Performance of GPT

3.3. Research Questions

4. Method

4.1. Selection of Tasks

4.2. Process for ChatGPT Solutions

4.3. Qualitative Analyses

5. Results

5.1. GPTs’ Solutions

5.2. Rating of Solutions

5.3. Interpretation of Solutions

5.3.1. Task 1: Rubber Ball

5.3.2. Task 3: Refuelling

5.3.3. Task 4: Water Supply in Traffic Jam

5.3.4. Tasks 2 and 5

6. Discussion

7. Limitation

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Task 1: Rubber Ball

Appendix A.2. Task 2: Ferry Price

Appendix A.3. Task 3: Refuelling

Appendix A.4. Task 4: Water Supply in the Traffic Jam

Appendix A.5. Task 5: Round Up Please

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI