Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations

Ignatenko, Vera; Surkov, Anton; Zakharov, Vladimir; Koltcov, Sergei

doi:10.3390/sci7030130

Open AccessArticle

Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations

Laboratory for Social & Cognitive Informatics, National Research University Higher School of Economics, St. Petersburg 192148, Russia

^*

Author to whom correspondence should be addressed.

Sci 2025, 7(3), 130; https://doi.org/10.3390/sci7030130

Submission received: 11 August 2025 / Revised: 2 September 2025 / Accepted: 9 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Generative AI: Advanced Technologies, Applications, and Impacts)

Download Versions Notes

Abstract

Large language models (LLMs) have recently demonstrated remarkable capabilities in natural language processing, mathematical reasoning, and code generation. However, their potential for solving differential equations—fundamental to applied mathematics, physics, and engineering—remains insufficiently explored. For the first time, we applied LLMs as translators from the textual form of an equation into the textual representation of its analytical solution for a broad class of equations. More precisely, we introduced a benchmark and fine-tuning protocol for differential equation solving with pre-trained LLMs. We curated a dataset of 300,000 differential equations and corresponding solutions to fine-tune T5-small, Phi-4-mini, DeepSeek-R1-Distill-Qwen, and two Mamba variants (130M and 2.8B parameters). Performance was evaluated using BLEU and TeXBLEU metrics. Phi-4-mini achieved the best results, with average BLEU > 0.9 and TeXBLEU > 0.78 across all considered equation classes, which shows the strong generalization abilities of the model. Therefore, this model should be further investigated on a broader class of differential equations and potentially can be used as a part of mathematical agents for solving more complex particular tasks, for example, from physics or engineering. Based on our results, DeepSeek-R1-Distill-Qwen consistently underperformed, while T5 showed strong results for the most frequent equation type but degraded on less common ones. Mamba models achieved the highest TeXBLEU scores despite relatively low BLEU, attributable to their production of lengthy outputs mixing correct expressions with irrelevant ones.

Keywords:

large language models; differential equations; Mamba; T5; Phi-4; DeepSeek; BLEU

1. Introduction

First of all, we would like to introduce some basic notions used in our work. A large language model (LLM) is a deep neural network, trained on massive text corpora to learn statistical patterns of language and perform tasks such as text generation, reasoning, and problem-solving [1,2]. To adapt an LLM to a specific task or domain, fine-tuning is applied, which involves updating the model’s parameters on smaller, task-specific datasets while leveraging the knowledge acquired during pre-training [3,4]. However, excessive adaptation during fine-tuning may lead to overfitting, a phenomenon in which the model captures noise or spurious patterns in the training data, thereby reducing its generalization ability to unseen inputs [5,6].

Recent advances in LLMs have demonstrated remarkable capabilities in reasoning over natural language, code, and even mathematical problems. Transformer-based architectures, when scaled to billions of parameters, can perform surprisingly well on arithmetic, algebra, and word-problem benchmarks with little or no task-specific adaptation [1]. For example, GPT-3 (175B) achieved strong few-shot performance on various math tasks (including multi-digit arithmetic) without any fine-tuning, simply by virtue of its scale and in-context learning ability. More specialized efforts—for instance, Minerva—have further advanced the state of the art in STEM reasoning by augmenting LLM pre-training with technical content and employing targeted chain-of-thought prompting [7]. Minerva, built on the PaLM model [8], was additionally trained on a large corpus of scientific papers and mathematical web data. Minerva uses step-by-step solution prompts to greatly improve performance on quantitative reasoning tasks. Yet despite these successes, the capability of LLMs to solve differential equations—a cornerstone of modeling in physics, engineering, and applied mathematics—remains largely unexplored. Fulfilling this gap is an important task, given that differential equations (including partial differential equations) are fundamental to describing countless physical and biological processes. Empowering LLMs with the ability to solve differential equations would significantly broaden their utility in scientific computing and could lead to new AI-assisted tools for simulation and analysis.

In parallel, the field of scientific machine learning has embraced physics-informed neural networks (PINNs) [9] as a way to learn differential equation solutions directly from data. PINNs embed the governing differential equations into the training loss as soft constraints, ensuring that the neural network’s predictions approximately satisfy the physical laws at hand. This approach has achieved impressive results on a wide range of partial differential equations, both for forward and inverse problems. By minimizing residuals of the PDE together with data errors, PINNs can successfully approximate many complex systems [10]. However, PINN models require bespoke network architectures and physics-specific loss functions, and they do not leverage the vast language-based pre-training that LLMs enjoy.

Recently, there have been exploratory attempts to repurpose generative LLMs as ad hoc math solvers by prompting them to mimic symbolic algebra systems or numerical integrators. Shalyt et al. [11] introduce ASyMOB, a benchmark that tests LLMs on symbolic mathematical operations including integration and solving differential equations. Their evaluations reveal that even advanced models suffer substantial drops in accuracy under slight problem perturbations, suggesting that current LLM reasoning in math often relies on pattern matching rather than true generalization. Overall, a systematic study of fine-tuning LLMs on curated differential equation datasets—to directly teach them equation-solving skills—is still lacking in the literature.

In this work, we aim to fill this gap by developing a controlled benchmark and fine-tuning protocol for differential equation solving with pre-trained LLMs. Building on our previously released AGDES package [12]—a Python package for generation of differential equations and their solutions in LaTeX format—we construct a diverse dataset designed for training LLMs and fine-tune several Transformer models and state-space models of varying capacity, from moderate-sized (hundreds of millions of parameters) up to state-of-the-art billion-scale models. Our objectives are twofold: (1) to quantify the extent to which targeted fine-tuning can imbue an LLM with the ability to generate correct analytic solutions to differential equations, and (2) to assess how this capability scales with model size and pre-training pedigree (in terms of the distribution of training data on types of equations). The experimental setup uses consistent training data and prompts across models, enabling fair comparisons of their solution accuracy and generalization. We evaluate model outputs using two metrics: (1) we calculate character-wise BLEU score between generated and reference solutions; and (2) we compute TeXBLEU metric, which was specifically designed for comparison of LaTeX expressions.

Our findings provide the first comprehensive insight into LLMs as differential equation solvers. We observe how fine-tuning affects each model’s ability to handle various equation types and report trends with respect to model scale. We also identify common failure modes (e.g., algebraic manipulation errors or integration mistakes) through manual inspection of outputs. These results shed light on the promises and limitations of merging language model training with mathematical problem-solving. Below, we summarize our contributions:

Benchmark and Protocol: We extend the AGDES package to generate a more diverse collection of differential equations and their solutions. We also provide codes with standardized prompts and evaluation scripts, allowing reproducible fine-tuning and comparison between models.
Empirical Study: We fine-tune four different types of LLMs and report their solution accuracy and generalization ability.
Insights and Analysis: We examine the error patterns of the models and discuss the need to develop new metrics to assess the quality of the solutions generated in LaTeX format.

The rest of our work is organized as follows. In Section 2, we provide a literature review on the application of LLMs to math problems. Section 3 details the benchmark construction and fine-tuning methodology for each considered LLM, and contains the description of quality metrics used to estimate the generated solutions. Section 4 describes the experimental results and analyzes the strengths and limitations observed in the fine-tuned models. In Section 5, we summarize our findings and describe possible future research directions.

2. Related Work

2.1. LLMs in Mathematics

Over the past five years, research on LLMs has moved rapidly from showing whether such models can do mathematics to how they can be made reliable mathematical reasoners. Early evaluations were not very promising: Hendrycks et al. [13] introduced the MATH competition benchmark and found that even very large Transformers achieved single-digit accuracy on many problem types. Similar results on grade-school word-problem datasets such as GSM8K motivated Cobbe et al. [14] to pair a generator with a verifier, showing that post hoc checking improves answer quality but does not fully eliminate reasoning errors.

Further, two complementary research threads emerged. Prompt-engineering work showed that much of the raw capability is already latent in pre-trained LLMs and can be elicited at test time. Nye et al. [15] first trained models to write out “scratchpads,” demonstrating that exposing intermediate calculations unlocks multi-step arithmetic. Building on this idea, Wei et al. [16] introduced chain-of-thought (CoT) prompting, in which a few worked examples guide the model to generate its own step-by-step solutions; PaLM-540 B with CoT more than doubled GSM8K accuracy compared with direct answering. Kojima et al. [17] showed that simply appending “Let’s think step by step” elicits zero-shot reasoning, and Zelikman et al. [18] proposed STaR, an iterative fine-tuning loop that bootstraps a model on its own successful CoT traces. Later refinements such as self-consistency decoding [19] and least-to-most or tree-of-thought prompting [20] sample many diverse chains and select the answer agreed on by most valid paths, raising GSM8K accuracies well above 80%.

The second thread focuses on training and tool use. Lewkowycz et al. [7] fine-tuned PaLM on technical corpora to create Minerva, pushing quantitative-reasoning state-of-the-art without external calculators. Gao et al. [21] argued that LLMs should delegate brittle arithmetic to dependable software: their Program-Aided Language model (PAL) prompts the LLM to emit Python code whose execution yields the final answer, outperforming much larger language-only baselines. Schick et al. [22] generalized this idea with Toolformer, showing that modest models can learn, in context, when to call a calculator or search API. More recently, Luo et al. [23] combined instruction fine-tuning, reinforcement learning, and evolutionary data augmentation to produce WizardMath, a 70 B-parameter LLaMA-2 variant that rivals GPT-3.5 on MATH while being fully open source.

Researchers have also tackled self-correction. Madaan et al. [24] introduced Self-Refine, where the model critiques and repairs its own solutions, while Chen et al. [25] separated “program” tokens from “reasoning” tokens in Program-of-Thought prompting, reducing hallucinated arithmetic. Based on these studies, one can conclude the following: the strongest systems combine (i) explicit intermediate reasoning [7,15,16,17,18,19,20], (ii) access to symbolic tools [21,22,25], and (iii) training objectives that reward correct process, not just correct answers [18,23].

Despite rapid progress, open challenges remain. Benchmarks reveal lingering weaknesses in geometry, spatial reasoning, and problems requiring real-world grounding [26]. CoT explanations often look plausible but contain subtle logical gaps, so reliable verification is still essential [27,28]. Recent works show that high benchmark performance in mathematical reasoning often conflates genuine reasoning with memorization and overfitting. Several studies demonstrate that popular datasets such as GSM8K and MATH suffer from training-set contamination, with models achieving inflated scores by recalling seen solutions rather than reasoning, and dropping substantially on newly curated test sets [29]. Similar concerns arise in theorem proving, where random splits permit proof memorization, motivating harder benchmarks that enforce novelty [30]. Large-scale analyses further indicate that while math tasks elicit less rote memorization than factual QA, LLMs may still rely on superficial heuristics instead of systematic reasoning [31]. Moreover, models have been shown to leak verbatim training data [32] and exhibit bounded memorization capacity [33], underscoring the need for careful benchmark design and evaluation protocols to ensure progress reflects genuine mathematical reasoning rather than dataset memorization.

2.2. Solving Differential Equations with LLMs

Since the application of LLMs to solving differential equations has only recently begun to emerge, the number of studies in this area remains very limited. Lample & Charton [34] showed that Transformers trained on synthetic data can successfully solve many first- and second-order ODEs outperforming Mathematica (https://www.wolfram.com/mathematica/, accessed on 8 September 2025). General-purpose LLMs (e.g., GPT-4, Minerva) reach 60–70% accuracy on textbook ODE benchmarks with chain-of-thought prompting, but algebraic slips limit reliability. The key advance is tool augmentation: asking the model to emit executable Python (versions 2.7-3.10) or SymPy code. In the PAL framework [21] and the Mamo benchmark [35], code-execution boosts GPT-4 ODE accuracy above 90% and lets smaller GPT-3.5 models nearly match that score, because robust libraries handle the actual integration. Let us note that in the work [35], only numerical answers to the problems involving ordinary differential equations were examined.

Work on PDEs is a newer direction. Sun et al. [36] train a multimodal “foundation model” that jointly learns PDE operators and solution fields, enabling zero-shot extrapolation. Separately, Du et al. [37] use an LLM-guided symbolic-regression loop to rediscover governing PDEs (e.g., Burgers, Navier–Stokes) directly from data, achieving accuracy on par with state-of-the-art regression methods.

In summary, LLMs can already solve many ODEs outright and excel when paired with numerical or symbolic solvers. Moreover, the possibilities for solving partial differential equations are emerging thanks to hybrid neural–symbolic methods and equation-discovery pipelines. Thus, these techniques demonstrate that LLMs possess latent mathematical ability; yet, beyond a few isolated efforts, systematic benchmark-driven fine-tuning aimed at constructing analytical solutions to differential equations remains surprisingly underexplored, motivating the present work.

3. Materials and Methods

3.1. Dataset Description

To fine-tune large language models (LLMs) for solving differential equations, we compiled a dataset of 300,000 equation–solution pairs. The dataset is publicly available at https://github.com/hse-scila/dif_equations_fine_tune/tree/main/data (accessed on 8 September 2025). The construction process involved two main steps. First, we collected a broad set of differential equations from open-source problem books, primarily from [38,39]. Second, we augmented this collection with synthetic samples generated using the ADGES package (version 1) [12], which also provides a detailed description of the generation procedure.

The synthetic portion of the dataset includes several families of equation–solution pairs, outlined below:

Polynomial equations, of the form $y^{'} = \sum_{k = 1}^{n} a_{k} x^{k}$ .
Separable-variable equations, of the form $y^{'} = f (x)$ , where $f (x)$ is an integrable function.
Homogeneous equations of second and third order, i.e., $y^{″} + a y^{'} + b y = 0$ and $y^{‴} + a y^{″} + b y^{'} + c y = 0$ .
Inhomogeneous equations of second and third order, i.e., $y^{″} + a y^{'} + b y = f (x)$ and $y^{‴} + a y^{″} + b y^{'} + c y = f (x)$ , with $f (x)$ being an arbitrary function.

The distribution of equations across these categories is reported in Table 1.

To train LLMs, 95% of our dataset was used. The remaining 5% of the data was used for testing.

3.2. Description of Models and Fine-Tuning Protocols

To tackle our problem, we considered several LLMs. Our first focus was the T5 architecture [4], a Transformer-based encoder–decoder model that reformulates diverse NLP problems within a unified text-to-text framework. Its central idea is that tasks as varied as translation, classification, and summarization can all be expressed as generating an output text sequence conditioned on an input sequence. The model was originally pre-trained on the large-scale Colossal Clean Crawled Corpus (C4) using a denoising objective known as span corruption. Despite being introduced five years ago, T5 remains a popular choice because of its versatility and its adoption across a wide range of domains, including clinical NLP [40,41], legal [42,43,44] and financial applications [45], meteorology [46], and even encrypted traffic classification [47]. Moreover, recent work has explored novel fine-tuning approaches for adapting T5 to complex reasoning tasks [48].

In our experiments, we employed the “t5-small” variant (https://huggingface.co/google-t5/t5-small, accessed on 8 September 2025), implemented via the “transformers.T5ForConditionalGeneration” class. This version contains roughly 60 million parameters, with 8 encoder layers and 8 decoder layers, a hidden dimension of 512, and 6 attention heads. Fine-tuning was performed on our training dataset, where each input was formatted as “Solve: <equation>” and the target output was the corresponding analytical solution. This setup allows the model to remain consistent with the text-to-text paradigm. Training was carried out for 50 epochs with a batch size of 8, a learning rate of

3 \times 10^{- 4}

, and the AdamW optimizer, using weight decay of 0.01 and a linear learning-rate scheduler.

Second, we considered the Mamba model [49], which is a sequence model based on structured state-space models. Its core is a selective state-space (SSM) layer whose parameters depend on each input token, allowing content-based gating of information flow along the sequence. The architecture omits Transformer-style attention and feedforward layers, resulting in a homogeneous SSM-based network. This model is widely used in NLP, demonstrating results that exceed Transformers [50]. It is also used in computer vision [51], genomics [52], audio classification [53], and time-series forecasting [54]. In our computer experiments, we used two versions of the Mamba model, namely, Mamba 130M (https://huggingface.co/state-spaces/mamba-130m-hf, accessed on 8 September 2025) and Mamba 2.8B (https://huggingface.co/state-spaces/mamba-2.8b-hf, accessed on 8 September 2025). Mamba 130M has 130 million parameters, 24 layers, and a model dimension is 768, while Mamba 2.8B has 2.8 billion parameters, 64 layers, and a model dimension of 2560. Both models were used within a causal language modeling pipline and fine-tuned with the following training examples: “Translate <type> differential equation: <equation>. Solution: <answer>”. To test the models, sequences like “Translate <type> differential equation: <equation>. Solution:”, i.e., the information on equation type was also provided to the model. For Mamba 130M, the following training arguments were used: batch size = 16, learning rate =

3 \times 10^{- 5}

, number of epochs = 5, and AdamW optimizer with 0.005 weight decay and cosine lr scheduler. For Mamba 2.8B, the training parameters were set as follows: batch size = 8, learning rate =

1 \times 10^{- 5}

, number of epochs = 5, and AdamW optimizer with 0.005 weight decay and cosine lr scheduler.

Third, we considered the DeepSeek-R1-Distill-Qwen-7B model (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B, accessed on 8 September 2025), which is a recent 7 billion-parameter language model distilled from DeepSeek-R1 [55] using the open-source Qwen-2.5-Math-7B as the base checkpoint. This distillation transfers advanced chain-of-thought reasoning and mathematical competence into a compact architecture. The model is based on a Transformer decoder-only architecture with context length of 128K tokens. The model demonstrated good results on the MATH-500 (https://huggingface.co/datasets/HuggingFaceH4/MATH-500, accessed on 8 September 2025) and AIME2024 (https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, accessed on 8 September 2025) datasets. To fine-tune the model, the following training samples were used: “Solve: <equation> \n Answer: <solution>”. For model evaluation, the prompts “Solve: <equation> \n Answer:” were utilized. The training arguments we as follows: batch size = 4, learning rate =

3 \times 10^{- 5}

, number of epochs = 10, and AdamW_torch_fused optimizer with weight_decay = 0.01 and linear scheduler.

Forth, we considered the Phi-4-mini model [56], which is a recent compact decoder-only Transformer model developed by Microsoft. It shows good results in logic and math intensive tasks and provides effective balance between reasoning capabilities, context handling, and resource efficiency. In the computer experiments, we utilized the Phi-4-mini-instruct model (https://huggingface.co/microsoft/Phi-4-mini-instruct, accessed on 8 September 2025), which has 4.2 billion parameters and a context length of 128K tokens. The model was fine-tuned with the following training arguments: batch size = 4, learning rate =

3 \times 10^{- 5}

, number of epochs = 10, and AdamW_torch_fused optimizer with weight_decay = 0.01 and a linear scheduler. Each training sample consisted of a prompt in the form “Solve: <equation> \n Answer: <solution>”. For model evaluation, the prompts “Solve: <equation> \n Answer:” were used.

The chosen models constitute a diverse testbed for evaluating architectural paradigms, efficiency trade-offs, and the capacity of LLMs to solve differential equations. Phi-4-mini and DeepSeek-R1-Distill-Qwen-7B exemplify state-of-the-art Transformer-based designs, while T5, although comparatively older, remains well-suited to problems expressed in LaTeX due to its text-to-text formulation. At the same time, the quadratic computational cost of the Transformer attention mechanism motivates the exploration of alternatives: Mamba models, with their linear scaling in sequence length, provide a promising direction for efficiently capturing long-range dependencies.

3.3. Metric Description

To evaluate the quality of generated solutions, specifically, to compare output produced by large language models (LLMs) with ground-truth solutions, we employ the BLEU and TeXBLEU metrics. The BLEU (Bilingual Evaluation Understudy) metric [57] is a widely used automatic evaluation metric for machine-generated text, particularly in machine translation. BLEU measures the n-gram overlap between a candidate sentence and one or more reference translations. The final BLEU score ranges from 0 (no overlap) to 1 (perfect match). While originally designed for word-level comparison in natural language, BLEU has also been adapted to character-level evaluation, especially in tasks such as formula recognition, image-to-LaTeX conversion, and symbolic equation solving, where token boundaries may be ambiguous or domain-specific [58,59]. In this case, BLEU reflects the sequential similarity between the predicted LaTeX expression and the reference.

Despite its widespread use, several studies have raised concerns regarding the applicability of BLEU to mathematical domains. For example, Mahdavi and Zanibbi [60] show that character-level BLEU may not capture the structural or semantic correctness of formulae, especially when multiple LaTeX representations exist for the same expression. Wang et al. [58] further demonstrate that BLEU is poorly correlated with visual or functional equivalence and propose CDM, a metric that compares rendered formula images rather than source code. To address these limitations, Jung et al. [59] propose TeXBLEU, a refined variant of BLEU tailored to mathematical content. TeXBLEU incorporates improved tokenization strategies for LaTeX, specifically handling commands and mathematical symbols. Additionally, it leverages token embeddings and positional encodings to compute distances between tokens, which are then used to determine the n-gram similarity. According to the authors, TeXBLEU exhibits a stronger correlation with human judgments in formula generation tasks. Therefore, we also calculated TeXBLEU scores for our results.

4. Results

Table 2 presents the average BLEU scores and corresponding standard deviations achieved by various models across different classes of differential equations on the test dataset. The results reveal significant variability in performance both across models and equation types. Among all models, Phi-4-mini consistently outperforms others, achieving BLEU scores above 0.9 across all classes, including hand-labeled equations. This indicates a high degree of character-level alignment between the generated and reference solutions even for those types of equations that were underrepresented in the training dataset.

T5 exhibits moderate performance, with relatively high scores for equations with separable variables (0.7715) and polynomial equations (0.6030), but struggles with homogeneous equations, particularly with third-order homogeneous equations (0.08163) as well as the hand-labeled equations (0.0865). These results suggest T5’s limitations in generalizing to structurally complex and less frequently occurring equation types.

The Mamba models show contrasting trends depending on scale. The smaller Mamba 130M model performs better on equations with separable variables (0.4723 vs. 0.266), second- and third-order inhomogeneous equations (0.3298 vs. 0.2735 and 0.5048 vs. 0.4525), and hand-labeled equations (0.6346 vs. 0.6068), while the larger Mamba 2.8B model shows improved performance on homogeneous equations of second order (0.5496 vs. 0.1922) and third order (0.3433 vs. 0.2162). Overall, for both configurations of the Mamba model, the obtained BLEU scores indicate a relatively low level of similarity between the generated and the reference solutions.

DeepSeek Qwen demonstrates the worst results across all classes, with BLEU scores remaining in a narrow range (0.2391–0.2807), suggesting a limited ability to generate accurate or structurally aligned solutions. However, it is worth noting that this model exhibits a propensity for generating repetitive output, which is often prefixed with the token ‘Answer’ and may even include reasoning chains. This repetition and insertion of meta-commentary contribute significantly to the degradation of the BLEU score.

Table 3 presents the average TexBLEU scores and standard deviations for the considered models across different classes of differential equations on the test dataset. This measure gives very high scores for the Mamba, T5, Phi-4-mini, and Deepseek Qwen models, and relatively high scores for the T5 model. However, we have noticed that these high scores do not mean a good similarity between the true and generated answers. For example, for reference (true solution)

y = 3924 x + 1195 x^{2} + 3191 x^{3} + C

and candidate (generated solution)

y = 3924

x + 1195

x^{2} + 3191

x^{3} +

C_{1} +

C_{2}

x^{4} +

C_{5} x^{6} +

C_{7} x^{8} +

C_{8} x^{9} +

C_{10} x^{10} +

C_{11} x^{12} +

C_{13} x^{14} +

C_{14} x^{18} +

C_{15} x^{20} +

C_{17} x^{22} +

C_{18} x^{24} +

C_{19} x^{26} +

C_{20} x^{28} +

C_{21} x^{30} +

C_{22} x^{32} +

C_{23} x^{34} +

C_{25}

x^{38} +

C_{26}

x^{42} +

C_{27}

x^{44} +

C_{28}

x^{46} +

C_{29}

x^{48} +

C_{30}

x^{50} +

C_{31}

x^{52} +

C_{32}

x^{54} +

C_{33} x^{56} +

C_{34}

x^{58} +

C_{35} x^{60} +

C_{37}

x^{62} +

C_{38}

x^{64}

,

T e X B L E U = 1

while

B L E U = 0.062

. This means that the TeXBLEU metric overestimates the similarity between expressions in LaTeX, especially if the candidate expression contains the reference expression, but also includes some other irrelevant terms.

Below, Table 4 demonstrates common error types generated by each model for each class of equations, which were found through manual inspection of our results. The analysis of Table 4 demonstrates that error types are highly model- and equation-dependent. Mamba models (130M and 2.8B) predominantly generate redundant and arbitrary mathematical terms. T5 exhibits systematic formatting errors alongside arbitrary symbolic outputs. Phi-4-mini frequently generates incorrect coefficients, reflecting challenges in maintaining mathematical accuracy. DeepSeek Qwen often yields either redundant expressions or incomplete solutions, capturing only partial elements of the correct answer.

Overall, the results highlight the importance of both model architecture and scale in capturing mathematical patterns, and point to the need for evaluation metrics that go beyond n-gram overlap for structurally rich domains such as differential equations and their solutions.

5. Discussion and Conclusions

In this work, we investigate the application of large language models (LLMs) to solving differential equations. To the best of our knowledge, we are the first to construct a dataset comprising 300,000 equation–solution pairs, which can be used by researchers for training and evaluating LLMs on the task of solving differential equations. The dataset was generated by sampling specific types of equations, including separable equations, homogeneous and inhomogeneous second-order equations, homogeneous and inhomogeneous third-order equations, and polynomial differential equations. In addition to the generated equations and their solutions, we included 1078 equations that were manually collected from various textbooks on differential equations. These textbook equations span a wide range of types beyond those represented in the synthetic subset.

Computational experiments were conducted using four types of architectures: Mamba, T5, Phi-4-mini, and DeepSeek-R1-Distill-Qwen. To evaluate the quality of the generated solutions, we compared the model outputs against ground-truth solutions using BLEU and TeXBLEU metrics.

Based on the results obtained, several key findings emerge. First, contrary to the claim made by the authors of the TeXBLEU metric regarding its suitability for evaluating LaTeX-based expressions, we observed that it tends to produce overly optimistic scores when the ground-truth solution is a subexpression of the generated one—even when the output contains a large number of irrelevant terms alongside the correct expression.

Second, among the models evaluated, Phi-4-mini achieved the best results, with BLEU scores in the range of 0.90–0.92, both for equation types that were well-represented in the training set and for underrepresented, manually collected equations. This indicates strong generalization and transfer capabilities, suggesting that Phi-4-mini may have acquired abstract mathematical reasoning patterns rather than relying solely on memorization. This makes it a promising tool for solving diverse mathematical problems beyond its fine-tuning distribution.

Third, the Mamba model, tested in two different configurations, demonstrated relatively low agreement between generated and true solutions. However, its TeXBLEU scores were the highest among all models, due to its tendency to generate long outputs that include both correct expressions and extended segments of irrelevant or “noisy” content.

Fourth, the T5 model achieved relatively strong performance (

B L E U = 0.77

) on the most prevalent class of equations, while its performance dropped for less represented types. This suggests that the model relies heavily on data coverage and highlights its limited ability to generalize to underrepresented equation types, emphasizing the importance of balanced training data distributions for robust model performance.

Fifth, the DeepSeek-R1-Distill-Qwen model consistently produced the lowest BLEU scores across all equation types. Its poor performance, even after fine-tuning, indicates a limited capacity for symbolic mathematical problem-solving. This may reflect architectural or pre-training limitations that constrain the model’s ability to learn effective solution strategies, underscoring the need for task-specific adaptations or alternative training paradigms when applying general-purpose LLMs to mathematical domains.

Overall, our findings suggest that LLMs can be effectively used for solving differential equations. Among the models tested, Phi-4-mini appears most suitable for applied use by engineers and researchers since this model demonstrates high correspondence with the true solutions even for manually collected equations of diverse types, meaning that it possesses strong generalization capabilities. However, further investigation of its performance on additional types of equations is still warranted and constitutes a future direction of our work.

Speaking of potential future research directions, this work could be extended by investigating reasoning-enhanced LLMs for solving differential equations. A critical component of this approach would involve using secondary LLMs as specialized extractors to precisely isolate the final answer from the generated reasoning chains. Another possible direction would be to build a Multi-Agent Collaboration Platform (MCP) powered by fine-tuned LLMs. Within such a system, solving a differential equation could be handled by a sequence of specialized mathematical agents. For instance, an analytical solver agent could first attempt to derive a closed-form solution, the output of which would then be parsed and structured by a second agent. Contingently, if no analytical solution is found, a third numerical solver agent could be activated to compute an approximate solution, thus creating a robust and multi-faceted problem-solving pipeline. Such a system would hold significant potential for addressing complex problems in domains including physics, engineering, and applied mathematics.

Speaking of limitations of our work, it is important to note that our dataset is imbalanced with respect to equation types, which may limit its utility for fine-tuning models with restricted generalization abilities. Therefore, constructing a class-balanced version of this dataset represents another important avenue for future research.

In addition to the aforementioned points, it is important to emphasize the need for developing specialized similarity metrics for comparing generated and reference solutions in LaTeX format. Since such expressions are not natural language text, standard evaluation metrics commonly used in machine translation tasks (such as BLEU) are not optimal for this domain. In this study, we also used TeXBLEU, a metric specifically proposed to compare LaTeX-formatted expressions. However, our results demonstrate that in its current form, this metric is of limited utility for evaluating complex and long mathematical expressions. It is worth noting that another possible approach to evaluating the quality of the generated solutions would be the computation of the L2 norm between the prediction and the reference. In practice, however, we encountered several challenges: (1) not all solutions are defined on a common interval, which makes it difficult to select a universal domain for evaluating the L2 norm; and (2) converting LaTeX strings from the generated solutions into NumPy-like functions is nontrivial, as even fine-tuned models may produce incorrect LaTeX code or hallucinate by generating arbitrary symbols or words. For these reasons, we chose not to employ this metric in the present study.

Author Contributions

Conceptualization, S.K. and A.S.; methodology, S.K. and V.Z.; software, V.Z., V.I., and A.S.; validation, V.Z. and A.S.; formal analysis, V.Z. and V.I.; investigation, A.S., V.Z., and V.I.; resources, V.Z. and A.S.; data curation, V.Z.; writing—original draft preparation, A.S. and V.I.; writing—review and editing, V.I. and A.S.; visualization, V.Z. and V.I.; supervision, S.K.; project administration, V.I.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Russian Science Foundation (project No. 24-21-00332).

Data Availability Statement

The original data and codes presented in the study are openly available in Github at https://github.com/hse-scila/dif_equations_fine_tune/tree/main, accessed on 8 September 2025.

Acknowledgments

This research was supported in part through computational resources of the HPC facilities at HSE University. During the preparation of this manuscript, the authors used ChatGPT-4o for the purposes of correction of the paper’s style and grammar. During the preparation of this study, the authors used t5-small, Mamba 130M, Mamba 2.8B, DeepSeek-R1-Distill-Qwen-7B, and Phi-4-mini models for the purpose of fine-tuning and testing large language models to generate solutions to differential equations. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
BLEU	Bilingual Evaluation Understudy
PINN	Physics-Informed Neural Network
STEM	Science, Technology, Engineering, and Mathematics
SSM	State-Space Model

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 328–339. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; The MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]
Lewkowycz, A.; Andreassen, A.; Dohan, D.; Dyer, E.; Michalewski, H.; Ramasesh, V.; Slone, A.; Anil, C.; Schlag, I.; Gutman-Solo, T.; et al. Solving quantitative reasoning problems with language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, Red Hook, NY, USA, 28 November 2022. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 240:1–240:113. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Cuomo, S.; Di Cola, V.S.; Giampaolo, F.; Rozza, G.; Raissi, M.; Piccialli, F. Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next. J. Sci. Comput. 2022, 92, 88. [Google Scholar] [CrossRef]
Shalyt, M.; Elimelech, R.; Kaminer, I. ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark. arXiv 2025, arXiv:2505.23851. [Google Scholar] [CrossRef]
Zakharov, V.; Surkov, A.; Koltcov, S. AGDES: A Python package and an approach to generating synthetic data for differential equation solving with LLMs. Procedia Comput. Sci. 2025, 258, 1169–1178. [Google Scholar] [CrossRef]
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 5852–5864. [Google Scholar]
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv 2021, arXiv:2112.00114. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 22199–22213. [Google Scholar]
Zelikman, E.; Wu, Y.; Mu, J.; Goodman, N.D. STaR: Self-taught reasoner bootstrapping reasoning with reasoning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, Red Hook, NY, USA, 9 December 2022. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS’23, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided Language Models. In Machine Learning Research, Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: New York, NY, USA, 2023; Volume 202, pp. 10764–10799. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessi, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.G.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; Tang, Y.; et al. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv 2023, arXiv:2303.17651. [Google Scholar] [CrossRef]
Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. arXiv 2023, arXiv:2211.12588. [Google Scholar] [CrossRef]
Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv 2024, arXiv:2310.02255. [Google Scholar] [CrossRef]
Zhao, C.; Tan, Z.; Ma, P.; Li, D.; Jiang, B.; Wang, Y.; Yang, Y.; Liu, H. Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens. arXiv 2025, arXiv:2508.01191. [Google Scholar]
Ling, Z.; Fang, Y.; Li, X.; Huang, Z.; Lee, M.; Memisevic, R.; Su, H. Deductive Verification of Chain-of-Thought Reasoning. arXiv 2023, arXiv:2306.03872. [Google Scholar]
Zhang, H.; Da, J.; Lee, D.; Robinson, V.; Wu, C.; Song, W.; Zhao, T.; Raja, P.; Zhuang, C.; Slack, D.; et al. A Careful Examination of Large Language Model Performance on Grade School Arithmetic. arXiv 2024, arXiv:2405.00332. [Google Scholar] [CrossRef]
Yang, K.; Swope, A.M.; Gu, A.; Chalamala, R.; Song, P.; Yu, S.; Godil, S.; Prenger, R.; Anandkumar, A. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. arXiv 2023, arXiv:2306.15626. [Google Scholar]
Nikankin, Y.; Reusch, A.; Mueller, A.; Belinkov, Y. Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics. arXiv 2025, arXiv:2410.21272. [Google Scholar]
Nasr, M.; Carlini, N.; Hayase, J.; Jagielski, M.; Cooper, A.F.; Ippolito, D.; Choquette-Choo, C.A.; Wallace, E.; Tramèr, F.; Lee, K. Scalable Extraction of Training Data from (Production) Language Models. arXiv 2023, arXiv:2311.17035. [Google Scholar] [CrossRef]
Morris, J.X.; Sitawarin, C.; Guo, C.; Kokhlikyan, N.; Suh, G.E.; Rush, A.M.; Chaudhuri, K.; Mahloujifar, S. How much do language models memorize? arXiv 2025, arXiv:2505.24832. [Google Scholar] [CrossRef]
Lample, G.; Charton, F. Deep Learning for Symbolic Mathematics. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Huang, X.; Shen, Q.; Hu, Y.; Gao, A.; Wang, B. LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 2678–2710. [Google Scholar] [CrossRef]
Sun, J.; Liu, Y.; Zhang, Z.; Schaeffer, H. Towards a foundation model for partial differential equations: Multioperator learning and extrapolation. Phys. Rev. E 2025, 111, 035304. [Google Scholar] [CrossRef]
Du, M.; Chen, Y.; Wang, Z.; Nie, L.; Zhang, D. Large language models for automatic equation discovery of nonlinear dynamics. Phys. Fluids 2024, 36, 097121. [Google Scholar] [CrossRef]
Kuriam, J. Differential-Equations. 2016. Available online: https://github.com/JaKXz/Differential-Equations (accessed on 9 August 2025).
Filippov, A. Sbornik Zadach Po Differentsialnym Uravneniiam [Collection of Problems on Differential Equations]; Regular and Chaotic Dynamics: Izhevsk, Russia, 2000. [Google Scholar]
Lu, Q.; Dou, D.; Nguyen, T. ClinicalT5: A Generative Language Model for Clinical Text. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5436–5443. [Google Scholar] [CrossRef]
Li, Y.; Harrigian, K.; Zirikly, A.; Dredze, M. Are Clinical T5 Models Better for Clinical Text? arXiv 2024, arXiv:2412.05845. [Google Scholar] [CrossRef]
Athugodage, M.; Mitrofanove, O.; Gudkov, V. Transfer Learning for Russian Legal Text Simplification. In Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024, Torino, Italy, 20 May 2024; pp. 59–69. [Google Scholar]
Zhang, W.; Shen, H.; Lei, T.; Wang, Q.; Peng, D.; Wang, X. GLQA: A Generation-based Method for Legal Question Answering. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Queensland, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Poornima, A.; Nagaraja, K.V.; Venugopalan, M. Legal Contract Analysis and Risk Assessment Using Pre-Trained Legal-T5 and Law-GPT. In Proceedings of the 2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 21–22 February 2025; pp. 1–8. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Ding, H.; Chen, H. Large Language Models in Finance: A Survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, ICAIF’23, New York, NY, USA, 27–29 November 2023; pp. 374–382. [Google Scholar] [CrossRef]
Xia, Y.; Huang, Y.; Qiu, Q.; Zhang, X.; Miao, L.; Chen, Y. A Question and Answering Service of Typhoon Disasters Based on the T5 Large Language Model. ISPRS Int. J. Geo-Inf. 2024, 13, 165. [Google Scholar] [CrossRef]
Luo, J.; Chen, Z.; Chen, W.; Lu, H.; Lyu, F. A study on the application of the T5 large language model in encrypted traffic classification. Peer-Netw. Appl. 2025, 18, 15. [Google Scholar] [CrossRef]
Liao, X.; Zhu, B.; He, J.; Liu, G.; Zheng, H.; Gao, J. A Fine-Tuning Approach for T5 Using Knowledge Graphs to Address Complex Tasks. arXiv 2025, arXiv:2502.16484. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Waleffe, R.; Byeon, W.; Riach, D.; Norick, B.; Korthikanti, V.; Dao, T.; Gu, A.; Hatamizadeh, A.; Singh, S.; Narayanan, D.; et al. An Empirical Study of Mamba-based Language Models. arXiv 2024, arXiv:2406.07887. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Schiff, Y.; Kao, C.H.; Gokaslan, A.; Dao, T.; Gu, A.; Kuleshov, V. Caduceus: Bi-directional equivariant long-range DNA sequence modeling. In Proceedings of the 41st International Conference on Machine Learning, JMLR.org, ICML’24, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Erol, M.H.; Senocak, A.; Feng, J.; Chung, J.S. Audio Mamba: Bidirectional State Space Model for Audio Representation Learning. IEEE Signal Process. Lett. 2024, 31, 2975–2979. [Google Scholar] [CrossRef]
Ma, H.; Chen, Y.; Zhao, W.; Yang, J.; Ji, Y.; Xu, X.; Liu, X.; Jing, H.; Liu, S.; Yang, G. A Mamba Foundation Model for Time Series Forecasting. arXiv 2024, arXiv:2411.02941. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Microsoft; Abouelenin, A.; Ashfaq, A.; Atkinson, A.; Awadalla, H.; Bach, N.; Bao, J.; Benhaim, A.; Cai, M.; Chaudhary, V.; et al. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs. arXiv 2025, arXiv:arXiv:2503.01743. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL’02, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Wang, B.; Wu, F.; Ouyang, L.; Gu, Z.; Zhang, R.; Xia, R.; Zhang, B.; He, C. Image over Text: Transforming Formula Recognition Evaluation with Character Detection Matching. arXiv 2025, arXiv:2409.03643. [Google Scholar]
Jung, K.; Kim, N.J.; Ryu, H.G.; Hyeon, S.; Lee, S.J.; Lee, H.J. TeXBLEU: Automatic Metric for Evaluate LaTeX Format. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Mahdavi, M.; Zanibbi, R. Tree-based structure recognition evaluation for math expressions. In Proceedings of the GREC, Sydney, Australia, 20–21 September 2019; pp. 9–10. [Google Scholar]

Table 1. The distribution of the number of equations by each class.

Class	Number of Equations
Polynomial (generated)	7600
With separable variables (generated)	242,753
Second-order homogeneous (generated)	9405
Third-order homogeneous (generated)	6498
Second-order inhomogeneous (generated)	15,751
Third-order inhomogeneous (generated)	6369
Manually collected from textbooks	1078

Table 2. BLEU scores for each class of differential equations from test dataset.

Model	Separable Variables	Inhomog. 2nd Order	Homog. 2nd Order	Polynomial	Inhomog. 3rd Order	Homog. 3rd Order	Hand-Labeled
Mamba 130M	0.47230 (±0.39)	0.3298 (±0.17)	0.1922 (±0.15)	0.1013 (±0.08)	0.5048 (±0.21)	0.2162 (±0.19)	0.6346 (±0.36)
Mamba 2.8B	0.2666 (±0.19)	0.2735 (±0.17)	0.5496 (±0.46)	0.1688 (±0.24)	0.4525 (±0.19)	0.3433 (±0.41)	0.6068 (±0.38)
T5	0.7715 (±0.08)	0.5328 (±0.06)	0.1477 (±0.05)	0.6030 (±0.17)	0.5142 (±0.08)	0.08163 (±0.02)	0.08653 (±0.11)
Phi-4-mini	0.9102 (±0.12)	0.9216 (±0.14)	0.9140 (±0.07)	0.9219 (±0.006)	0.9009 (±0.39)	0.9107 (±0.02)	0.9279 (±0.23)
Deepseek Qwen	0.2715 (±0.11)	0.2773 (±0.16)	0.2681 (±0.05)	0.2731 (±0.24)	0.2728 (±0.27)	0.2807 (±0.02)	0.2391 (±0.07)

Table 3. TeXBLEU scores for each class of differential equations from test dataset.

Model	Separable Variables	Inhomog. 2nd Order	Homog. 2nd Order	Polynomial	Inhomog. 3rd Order	Homog. 3rd Order	Hand-Labeled
Mamba 130M	0.999994 (±0.0001)	1 (±0)	0.999937 (±0.0004)	1 (±0)	0.999999 (±0.00001)	1 (±0)	0.999995 (±0.00004)
Mamba 2.8B	1 (±0)	1 (±0.00001)	1 (±0)	1 (±0)	1 (±0)	1 (±0)	0.999968 (±0.00014)
T5	0.753553 (±0.015)	0.733906 (±0.008)	0.795395 (±0.005)	0.884749 (±0.0569)	0.741548 (±0.008)	0.790599 (±0.009)	0.791326 (±0.061)
Phi-4-mini	0.926765 (±0.083)	0.982052 (±0.046)	0.943722 (±0.078)	0.999815 (±0.002)	0.957788 (±0.078)	0.917312 (±0.08)	0.787775 (±0.05)
Deepseek Qwen	0.926764 (±0.08)	0.982052 (±0.017)	0.943722 (±0.076)	0.999815 (±0.0004)	0.957787 (±0.036)	0.917312 (±0.08)	0.787775 (±0.051)

Table 4. Common error types for each class of differential equations.

Equation Type	Mamba 130M	Mamba 2.8B	T5	Phi-4-mini	Deepseek Qwen
separable variables	redundant terms of form $C_{i} x^{i}$ , arbitrary symbols and words from natural language	redundant arbitrary math expressions	absence of “\” and “{}” in math functions; arbitrary math expressions	incorrect functions (for example, arctg instead of cos)	incorrect functions (e.g., exponential instead of trigonometric)
inhomog. 2nd order	long redundant arbitrary math expressions	redundant arbitrary math expressions	absence of “\” and “{}” in math functions	incorrect coefficients in math functions	redundant expressions
homog. 2nd order	redundant terms of form $\frac{k x^{i}}{n}$ , arbitrary symbols	redundant terms of form $\frac{k x^{i}}{n}$ and $\frac{C_{i}}{e^{k x}}$	absence of “\” and “{}” in math functions	incorrect coefficients in exponential functions	no typical mistakes
polynomial	many redundant terms of form $C_{i} x^{i}$	many redundant terms of form $C_{i} x^{i}$ and $\frac{x^{k}}{n}$	absence of “\” and “{}” in math functions	incorrect coefficients in power functions	many redundant power functions
inhomog. 3rd order	long redundant arbitrary math expressions	redundant arbitrary terms with trigonometric functions	absence of “\” and “{}” in math functions; shorter expression than the true answers	generation of shorter expression containing parts of true answers	generation of shorter expression containing parts of true answers
homog. 3rd order	redundant arbitrary math expressions, symbols and words from natural language	redundant arbitrary math terms	absence of “\” before math functions and “{}” in math functions	incorrect coefficients in exponential functions	no typical mistakes
hand-labeled	arbitrary math expressions	redundant arbitrary math expressions and symbols	absence of “\” and “{}” in math functions; arbitrary math expressions	arbitrary math functions	arbitrary math functions

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ignatenko, V.; Surkov, A.; Zakharov, V.; Koltcov, S. Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations. Sci 2025, 7, 130. https://doi.org/10.3390/sci7030130

AMA Style

Ignatenko V, Surkov A, Zakharov V, Koltcov S. Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations. Sci. 2025; 7(3):130. https://doi.org/10.3390/sci7030130

Chicago/Turabian Style

Ignatenko, Vera, Anton Surkov, Vladimir Zakharov, and Sergei Koltcov. 2025. "Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations" Sci 7, no. 3: 130. https://doi.org/10.3390/sci7030130

APA Style

Ignatenko, V., Surkov, A., Zakharov, V., & Koltcov, S. (2025). Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations. Sci, 7(3), 130. https://doi.org/10.3390/sci7030130

Article Menu

Transformers and State-Space Models: Fine-Tuning Techniques for Solving Differential Equations

Abstract

1. Introduction

2. Related Work

2.1. LLMs in Mathematics

2.2. Solving Differential Equations with LLMs

3. Materials and Methods

3.1. Dataset Description

3.2. Description of Models and Fine-Tuning Protocols

3.3. Metric Description

4. Results

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI