A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study

Nguyen, Phuong Thao; Nuss, Bernd; Dressler, Roswita; Ovens, Katie

doi:10.3390/app15052476

Open AccessArticle

A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study

by

Phuong Thao Nguyen

^1,*

,

Bernd Nuss

²,

Roswita Dressler

² and

Katie Ovens

^1,*

¹

Department of Computer Science, Faculty of Science, University of Calgary, Calgary, AB T2N 1N4, Canada

²

Werklund School of Education, University of Calgary, Calgary, AB T2N 1N4, Canada

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2476; https://doi.org/10.3390/app15052476

Submission received: 11 January 2025 / Revised: 14 February 2025 / Accepted: 18 February 2025 / Published: 25 February 2025

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

Download

Browse Figures

Versions Notes

Abstract

:

Grammatical error correction (GEC) has become increasingly important for enhancing the quality of OCR-scanned texts. This small-scale study explores the application of Large Language Models (LLMs) for GEC in German children’s literature, a genre with unique linguistic challenges due to modified language, colloquial expressions, and complex layouts that often lead to OCR-induced errors. While conventional rule-based and statistical approaches have been used in the past, advancements in machine learning and artificial intelligence have introduced models capable of more contextually nuanced corrections. Despite these developments, limited research has been conducted on evaluating the effectiveness of state-of-the-art LLMs, specifically in the context of German children’s literature. To address this gap, we fine-tuned encoder-based models GBERT and GELECTRA on German children’s literature, and compared their performance to decoder-based models GPT-4o and Llama series (versions 3.2 and 3.1) in a zero-shot setting. Our results demonstrate that all pretrained models, both encoder-based (GBERT, GELECTRA) and decoder-based (GPT-4o, Llama series), failed to effectively remove OCR-generated noise in children’s literature, highlighting the necessity of a preprocessing step to handle structural inconsistencies and artifacts introduced during scanning. This study also addresses the lack of comparative evaluations between encoder-based and decoder-based models for German GEC, with most prior work focusing on English. Quantitative analysis reveals that decoder-based models significantly outperform fine-tuned encoder-based models, with GPT-4o and Llama-3.1-70B achieving the highest accuracy in both error detection and correction. Qualitative assessment further highlights distinct model behaviors: GPT-4o demonstrates the most consistent correction performance, handling grammatical nuances effectively while minimizing overcorrection. Llama-3.1-70B excels in error detection but occasionally relies on frequency-based substitutions over meaning-driven corrections. Unlike earlier decoder-based models, which often exhibited overcorrection tendencies, our findings indicate that state-of-the-art decoder-based models strike a better balance between correction accuracy and semantic preservation. By identifying the strengths and limitations of different model architectures, this study enhances the accessibility and readability of OCR-scanned German children’s literature. It also provides new insights into the role of preprocessing in digitized text correction, the comparative performance of encoder- and decoder-based models, and the evolving correction tendencies of modern LLMs. These findings contribute to language preservation, corpus linguistics, and digital archiving, offering an AI-driven solution for improving the quality of digitized children’s literature while ensuring linguistic and cultural integrity. Future research should explore multimodal approaches that integrate visual context to further enhance correction accuracy for children’s books with image-embedded text.

Keywords:

grammatical error correction; large language models; German children’s literature

1. Introduction

Advancements in machine learning and artificial intelligence (AI) have significantly enhanced the field of GEC, which is particularly important for improving the accuracy of texts processed through Optical Character Recognition (OCR). This is a crucial step in transforming digitized versions of historical and literary works into clean, usable formats. Large language models (LLMs) have emerged as powerful tools for this purpose, demonstrating a remarkable ability to understand and correct complex linguistic patterns. However, some types of texts present unique challenges, such as German children’s literature, characterized by modified language, colloquial expressions, and narrative styles tailored to young readers. Additionally, children’s books often contain a mix of images and text, which can make the OCR process more complex. The presence of illustrations, decorative fonts, and varied text layouts can also introduce additional noise and errors during scanning, leading to a higher rate of misrecognition. Therefore, cleaning texts digitized using OCR can be challenging due to the wide variety of errors that can be introduced including distorted characters, misplaced text, and other layout-related issues.

The field of GEC in OCR-scanned documents has seen substantial advancements, driven by the need to improve the quality of digitized texts for analysis and reading. Early approaches relied on rule-based methods that utilized predefined grammar rules and dictionaries to correct errors, but these methods often lacked the flexibility to handle context-specific variations [1]. With the advent of machine learning, statistical models offered greater adaptability, and more recently, neural network-based models, including large language models, have become the standard for GEC tasks [2,3].

German-specific GEC faces unique challenges due to the morphological complexity of the language. Evaluating the robustness of a model to OCR-induced noise is critical for ensuring accuracy in processed texts [4]. Despite the progress in this area, there is a notable gap in evaluating the performance of state-of-the-art LLMs on the task of GEC in German texts, specifically those found in children’s literature. While LLMs have shown promise in general error correction tasks, their effectiveness in handling the particular linguistic features of German children’s books has not been fully explored. Addressing this gap is crucial for leveraging the full potential of LLMs, allowing for a more automated and accurate approach to cleaning OCR-scanned documents.

The goal of this study is to address this research gap by comparing the performance of various LLMs in correcting grammatical errors in OCR-scanned German children’s literature. By evaluating different models, we aim to identify the one that performs best for this specific task, considering the unique linguistic aspects of the texts.

The significance of this research lies in its potential to improve the accessibility and readability of OCR-scanned German children’s literature. Given that OCR processes frequently introduce errors, effective grammatical correction is essential for preserving the original meaning and educational value of these works. By comparing different large language models, this study seeks to identify the strengths and weaknesses of these models, including the most suitable model for handling such corrections, contributing valuable insights into the practical applications of AI in language preservation. Furthermore, this work holds significance for corpus linguistics research by providing cleaner, more accurate textual data, which is crucial for analyzing linguistic patterns and studying language use in culturally significant genres. Ultimately, this research contributes to expanding the available sources of German literature, particularly children’s literature, helping to address the scarcity of such resources. By improving the quality of digitized texts, it accelerates the process of analyzing text complexity in German children’s literature [5], facilitating more comprehensive linguistic and educational studies.

2. Literature Review

GEC is a NLP task that aims to improve text quality by detecting and correcting grammatical, syntactic, and stylistic errors. Effective GEC enhances text clarity, making it particularly useful in educational applications, automated proofreading, and document digitization [2]. Early approaches to GEC relied on rule-based methods [1,6], which were effective in structured contexts but lacked adaptability. These were later replaced by statistical models that utilized large corpora to improve correction accuracy [3]. The field has since evolved with the advent of deep learning, particularly transformer-based architectures [7,8,9], which have demonstrated superior performance in grammatical correction through advanced contextual understanding.

OCR is a technique utilized to convert printed text into digital format, yet it often introduces errors that compromise text integrity. These errors stem from character misrecognition, incorrect word segmentation, and inconsistencies in formatting, particularly when dealing with complex layouts and degraded print quality [10]. Correcting OCR-generated text requires robust GEC models capable of handling both grammatical and OCR-specific errors. Traditional approaches, such as rule-based spell-checkers, have shown limited success due to their inability to capture contextual meaning. More recent research has explored the application of neural-based GEC models to address OCR-induced errors, improving overall text quality and usability [7]. However, these challenges are particularly pronounced in children’s literature, which presents additional OCR-specific difficulties. Unlike conventional texts, children’s books often feature short, fragmented text interwoven with illustrations, highly stylized fonts, and playful layouts, making OCR misrecognition more prevalent. The presence of decorative elements and non-standard typography can lead to misclassified characters, missing words, and misplaced text blocks, further complicating grammatical error correction. While neural-based models have improved text restoration for general OCR tasks, limited research has investigated their effectiveness in the unique context of digitized children’s literature. The heavy reliance on visual context in these books also presents a challenge, as current text-only GEC models struggle to infer meaning when words are extracted from image-based elements. However, there remains a need to assess the performance of different model architectures in correcting OCR-generated errors, particularly in complex domains such as digitized literature.

German presents unique challenges for GEC due to its morphologically complex nature and syntactic variability. Inflectional variations, case-based word changes, and flexible word order introduce additional difficulties in identifying and correcting errors. Unlike English, German exhibits a verb-second (V2) structure in main clauses and verb-final positioning in subordinate clauses, requiring models to account for long-distance dependencies in sentence structure [11]. Another key challenge is German’s extensive use of compound words, which are often misrecognized by OCR systems and incorrectly segmented in text-processing tasks. For example, OCR may incorrectly split “Donaudampfschifffahrtsgesellschaftskapitän” into multiple words, leading to syntactic errors. As a result, effective GEC in German must incorporate advanced contextual understanding to distinguish between correct compound structures and OCR-induced errors. While previous work has demonstrated that fine-tuned encoder-based models, such as GBERT and GELECTRA, offer improvements in handling German syntax [12], recent research suggests that decoder-based models, such as GPT-4o and Llama, provide competitive performance without fine-tuning [13,14].

Recent advancements in deep learning have led to the development of two dominant approaches for GEC: encoder-based and decoder-based models. Encoder-based models, such as GBERT and GELECTRA, process input bidirectionally, allowing them to capture syntactic relationships more effectively [12]. These models have the potential to perform well when fine-tuned on domain-specific datasets. However, their reliance on labeled training data limits their adaptability to new text domains, such as OCR-corrected literature.

In contrast, decoder-based models, such as GPT-4o and Llama, operate in an autoregressive manner, generating text one token at a time based on prior context [15]. These models have gained popularity due to their zero-shot and few-shot learning capabilities, enabling them to generalize across different languages and text structures without extensive fine-tuning [7,8,9,16]. Recent studies have also explored whether improved prompt engineering can enhance the grammatical correction capabilities of decoder-based models, such as GPT-3.5. Coyne et al. [17] found that carefully structured prompts, incorporating explicit correction instructions and contextual constraints, significantly improved the ability of GPT-3.5 to maintain semantic integrity while reducing unnecessary corrections. Despite their advantages, decoder-based models may exhibit overcorrection tendencies, where they modify text unnecessarily, altering its original meaning [13]. In scanned documents of children’s literature—especially where illustrations are omitted—the sentences are often quite short and may contain very few words. If language models like GPT-3.5 or others attempt to enhance these sentences by adding extra words, they risk altering the simplicity and clarity that are essential for young readers. This could lead to the loss of the original meaning, making the text more complex or introducing unintended interpretations.

While significant progress has been made in GEC, several research gaps remain:

Limited research on GEC for OCR-generated German text, particularly in the domain of digitized literature.
A lack of comparative studies on encoder-based vs. decoder-based models for German GEC, with most prior work focusing on English.
An insufficient analysis of overcorrection tendencies in decoder-based models, which may impact text accuracy in real-world applications like digitized children’s literature.

This study aims to address these gaps by systematically evaluating different GEC models for OCR-generated German text and assessing their strengths and weaknesses in handling grammatical errors in children’s literature.

3. Methods

3.1. Data

The dataset for this study consists of OCR-scanned German children’s literature from the library of an elementary school where students receive up to 50% of instruction in German. Permission for scanning was obtained from the school principal and librarian, within a project under ethical approval from both the University of Calgary and the school board. The collection, developed as a disposable corpus for research purposes, primarily includes modern narrative books for children aged 5–12, as well as a smaller selection of children’s non-fiction books. Books were either purchased by the school or donated by parents over the German program’s 40-year history. Corpus creation was funded by the University of Calgary and the German government, completed between May and June 2022. The documents were scanned using the JoyUsing L140 (Fujian Joyusing Technology Co., Ltd., Fuzhou, China) document scanner with the Abbyy OCR system.

For model evaluation, we used seven documents from this corpus, totaling 297 sentences and 3049 words. Sentence and word counts varied across the documents, with sentence counts ranging from 21 to 72 and word counts from 109 to 533, as shown in Table 1. Ground truth labels for detecting and correcting errors in these documents were created by a German language expert, ensuring high accuracy and reliability. The expert provided us with the corrected and uncorrected versions of the texts.

3.2. Preprocessing

There are several significant challenges related to the quality and structure of the dataset. The nature of these challenges was diverse, ranging from technical scanning errors to contextual inaccuracies in word usage. Understanding the specific characteristics of these challenges is crucial for developing an effective preprocessing method to prepare the dataset for our GEC pipeline.

The Presence of Meaningless Symbols: The dataset contained numerous instances of meaningless symbols within the documents, likely artifacts of the text conversion or scanning process. These extraneous symbols posed a significant hurdle in accurately interpreting and processing the text data.
Scanning Errors: A notable problem was the misrecognition of letters by the scanning system. A common error observed was the scanner mistaking ‘tt’ for ‘H’. Such errors could lead to substantial misinterpretations of the text, affecting the integrity of our text complexity analysis.
Contextual Inaccuracies in Word Usage: Words were incorrectly used in context. For example, the dataset used “horten” (to hoard) instead of “hörten” (heard), as seen in the sentence: “Aber da horten sie hinter sich schon das gefährliche Knurren von Stormellas Wölfen”. These inaccuracies could potentially skew the assessment of text complexity and readability.
Redundant Information: The dataset retained headers, footers, and tables of contents within the documents. These elements, while essential in a book’s layout, are extraneous for the purpose of analyzing text complexity and thus need to be removed.

To enhance the document’s conciseness and relevance, a pattern-matching algorithm was developed. This algorithm is particularly tailored to search through documents and methodically identify as well as remove sections deemed non-essential, thereby streamlining the textual content. The current results of the algorithm are illustrated in Figure 1.

The algorithm demonstrates a high rate of accuracy, identifying and removing superfluous lines at a rate that exceeds the one at which it inappropriately discards necessary lines by more than sixteenfold, and correctly identifies and purges 95.7% of lines that are classified as non-critical, a testament to its accuracy.

Further, Figure 2 provides an illustrative example of the algorithm in action. It takes a segment of a document and demonstrates the transformative effect of removing redundant parts. This visual representation exhibits the enhancement in the readability and clarity of the document post-processing. The elimination of superfluous elements declutters the text and significantly improves the ease with which the reader can assimilate and comprehend the text.

3.3. Language Models

In this study, we evaluate the performance of various language models on grammatical error detection and correction tasks specific to German children’s literature, focusing on both fine-tuned encoder-based and decoder-based models. The encoder-based models, GBERT and GELECTRA, are German-specific transformers that have been fine-tuned to enhance their grammatical correction capabilities, especially in German children’s literature, leveraging their specialized language processing for German text. Alongside these, we assess zero-shot, large-scale, decoder-based, multipurpose models like GPT and Llama, which are designed for diverse linguistic tasks across multiple languages, including German. Even though GBERT and GELECTRA are much smaller than GPT and Llama models, in the past, smaller models have demonstrated surprising effectiveness under constrained resource settings [18], so there is a hope for them to outperform these giant models. By comparing the encoder-based models’ focused, language-specific capabilities with the broader adaptability of the decoder-based models, this study aims to provide insights into each model’s suitability for nuanced grammatical corrections in OCR-scanned German literature.

3.3.1. Encoder-Based Models

The development of large language models (LLMs) in natural language processing (NLP) has progressed through several key innovations, each building on the capabilities of its predecessors. BERT (Bidirectional Encoder Representations from Transformers), introduced in 2019 [19], marked a significant shift by training models bidirectionally. Unlike previous models that read text in one direction, BERT’s approach allowed it to consider the context from both the left and right sides of a word, greatly enhancing its performance in understanding context and correcting grammar. Following BERT, ELECTRA was introduced [20], offering a more efficient training mechanism for language models. Instead of using BERT’s masked language modeling, where words are randomly masked and then predicted, ELECTRA trains by replacing words and having the model detect whether each word is original or replaced. This “replaced token detection” approach enabled ELECTRA to learn from a larger number of examples in the same amount of time, making it faster and more effective for tasks like grammatical error correction and handling noisy text, such as OCR outputs. Building on the foundations of BERT and ELECTRA, adaptations like GBERT and GELECTRA emerged, specifically tailored for the German language [12]. GBERT adapts the bidirectional training of BERT for German text, fine-tuning the model on German-specific linguistic features, while GELECTRA applies the efficiency of ELECTRA’s replaced token detection to German corpora. These adaptations make GBERT and GELECTRA particularly well-suited for understanding the complex morphological and syntactic structures of the German language, offering improved performance in grammatical error correction tasks for German texts.

We used the original architectures of GBERT and GELECTRA introduced by Chan et al. [12]. Both GBERT and GELECTRA were then fine-tuned on the dataset of German children’s literature mentioned in Section 3.1, after splitting each document into individual sentences. For training, 80% of the sentences were allocated to the training set, with 10% designated for validation and the remaining 10% for testing. These subsets were randomly sampled from the complete pool of sentences across all documents, ensuring a representative distribution for evaluation purposes.

Originally pretrained on an extensive German text corpus, GBERT was fine-tuned for a targeted token classification task to suit the specific demands of this study. The model underwent customization by modifying its final layer to predict two categories (incorrect or correct), in alignment with the dataset’s requirements. Addressing the issue of class imbalance prevalent in the dataset, a bespoke loss function was integrated through an extension of the Trainer class. This adjustment allowed for the application of class weights, thereby balancing the impact of each category during the training phase. The fine-tuning process involved training the model on a carefully assembled dataset of German sentences, annotated explicitly for token classification. The validation set was employed to monitor the model’s generalization performance during the training process. This approach enabled the model to acclimate to the distinct linguistic and contextual nuances present within the data. For the fine-tuning process, specific parameters were utilized through multiple tuning trials to optimize the model’s performance. A learning rate of 0.0001 was chosen, alongside batch sizes of 16 for both the training and evaluation phases and the optimizer used was AdamW, selected after multiple trials.

Like GBERT, GELECTRA benefits from pretraining on a vast corpus of German language texts, equipping it with a deep understanding of the linguistic intricacies inherent to the German language. The potential advantage of GELECTRA for this task lies in its unique pretraining approach, which focuses on generating and distinguishing between correct and artificially generated incorrect tokens. The fine-tuning parameters used for GELECTRA are similar to the ones for GBERT. This approach could theoretically provide a more nuanced understanding of language errors, making it particularly suitable for the error detection phase of word correction. The fine-tuning process for GELECTRA mirrored that of GBERT.

3.3.2. Decoder-Based Models

The evolution of LLMs continued with the release of the GPT (Generative Pretrained Transformer) series, including GPT-2 in 2019 [21], GPT-3 in 2020 [15], and GPT-4 in 2023 [22], which introduced autoregressive modeling for generating coherent and contextually accurate paragraphs. Unlike BERT and ELECTRA, which focus on understanding text, GPT models excel in generating and completing sentences, making them versatile for a wider range of language tasks, from text generation to nuanced error correction.

Following GPT, Meta AI’s Llama (Large Language Model Meta AI) [14] has aimed to combine high performance with greater accessibility. Llama offers several advantages over previous models, including being open-source, which allows researchers and developers to access its model weights and code freely. This openness encourages innovation, enabling the community to fine-tune the model for specific applications or research needs. Additionally, Llama is designed with efficiency in mind, requiring fewer computational resources compared to models like GPT-4. It is able to run effectively on smaller hardware setups while still achieving competitive performance, making it an attractive option for those without access to extensive computing infrastructure.

To evaluate the capabilities of various LLMs in performing nuanced linguistic tasks, we conducted a zero-shot evaluation on five models: GPT-4o, Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B, and Llama-3.1-70B. In this context, zero-shot evaluation refers to assessing each model’s ability to perform grammatical error detection and correction on a dataset of German children’s literature without any additional fine-tuning specific to this task. Zero-shot learning allows the models to rely on their inherent language processing capabilities derived from extensive pretraining, providing insights into their adaptability to tasks they were not explicitly trained to handle.

Each model brings unique attributes to this evaluation. GPT-4o, the latest high-intelligence flagship model, is designed for complex, multi-step tasks and represents an advancement in the GPT series with over 200 billion parameters, which surpasses GPT-4’s 175 billion. This allows for nuanced and highly detailed responses while maintaining efficiency and speed, making GPT-4o suitable for intricate tasks at a lower cost and faster response time than its predecessor, GPT-4 Turbo. However, using GPT-4o still incurs a cost, with pricing set at USD 2.50 per million input tokens and CAD 10.00 per million output tokens, with a discounted rate of CAD 1.25 per million input tokens and CAD 5.00 per million output tokens through the Batch API. Llama-3.2, a more recent release from Meta AI, offers lightweight and highly efficient models with parameter counts of 1.24 billion for Llama-3.2-1B and 3.21 billion for Llama-3.2-3B, aimed at decoder-based text applications. Meanwhile, Llama-3.1, an earlier model, includes larger parameter versions like the 8B and 70B we tested, delivering potentially richer linguistic and contextual understanding due to their higher parameter counts. Unlike GPT-4o, all Llama models are open-source and freely accessible on the Hugging Face and Meta websites, allowing for broader experimentation and customization without associated usage costs. For this evaluation, we utilized the OpenAI API to access GPT-4o, while all Llama models were tested through their instruct versions available on Hugging Face.

4. Results

Prior to assessing the models’ effectiveness in correcting errors, we first evaluated their proficiency in detecting errors, as accurate detection is crucial to the overall quality of the error correction process. The results of the models are summarized in Table 2.

The pretrained GBERT and fine-tuned GBERT exhibit notable differences in their error detection performance. The pretrained GBERT has an accuracy rate of 0.65, with a precision of 0.06, recall of 0.24, and an F1 score of 0.09. In contrast, the fine-tuned GBERT showed improved accuracy at 0.82, though its precision remained low at 0.07 and recall slightly decreased to 0.19, leading to a marginal increase in the F1 score to 0.10. These results demonstrate that fine-tuning enhances the model’s overall accuracy by substantially lowering the rate of false positives and negatives, although it slightly affects the precision and recall.

The comparison between the pretrained and fine-tuned GELECTRA models highlights notable differences in their performance for error detection. The pretrained model achieves an accuracy of 0.43, a precision of 0.06, and a recall of 0.48, resulting in an F1 score of 0.11. The fine-tuned model demonstrates significant improvement, with an accuracy of 0.64, precision of 0.19, recall of 0.38, and an F1 score of 0.26. This improvement indicates a better balance between precision and recall, though GELECTRA’s performance still lags behind GBERT in terms of accuracy.

Among all models, GPT-4o and Llama models demonstrated the best performance in error detection. GPT-4o achieved an accuracy of 0.97, a precision of 0.88, and a recall of 0.83, resulting in an F1 score of 0.86. This reflects its ability to effectively detect errors while maintaining high precision. The Llama models followed closely, with Llama-3.2-1B achieving an accuracy of 0.92, but with a much lower precision of 0.26 and recall of 0.63, leading to an F1 score of 0.36. Llama-3.2-3B performed better with an accuracy of 0.96, precision of 0.78, recall of 0.81, and F1 score of 0.80. Llama-3.1-8B achieved an accuracy of 0.97, precision of 0.84, recall of 0.85, and an F1 score of 0.85, demonstrating a more balanced performance between precision and recall. The best-performing model was Llama-3.1-70B, with an accuracy of 0.98, precision of 0.87, recall of 0.89, and an F1 score of 0.88, making it the most robust model in terms of both precision and recall.

Based on these results, we proceeded to the grammatical error correction stage with GPT-4o, Llama-3.1-8B, and Llama-3.1-70B, as they exhibited the most consistent and reliable performance in error detection.

The results in Table 3 show that GPT-4o achieves the highest average correction accuracy at 92.52%, maintaining strong performance across most documents, with a peak of 97.26% (Document 7) and a low of 50.00% (Document 2). Llama-3.1-70B follows closely with 89.68% accuracy, also performing well across documents, ranging from 96.77% (Document 3) to 50.00% (Document 2). Both models demonstrate reliability in grammatical error correction, with slight fluctuations based on text complexity. In contrast, Llama-3.1-8B shows greater variability, averaging 76.92%. While it achieves a perfect 100.00% accuracy in Document 2, it drops significantly to 14.29% in Document 4, indicating inconsistency in handling diverse linguistic structures. Overall, GPT-4o and Llama-3.1-70B are the most reliable for error correction, while Llama-3.1-8B’s fluctuating performance suggests it may require further fine-tuning for stability.

Following the quantitative evaluation, the qualitative analysis further examines the nature of errors detected and corrected by each model. As can be seen in Figure 3, Llama-8B primarily focuses on surface-level corrections, effectively handling typos and morphological adjustments but struggling with deeper grammatical and semantic errors. It correctly detected and fixed “mall” (“mall” → “mal”, meaning “time” or “once”), demonstrating strong lexical correction abilities, and identified “Uchtgeistem” as incorrect but only modified it to “Uchtgeistern” instead of the correct “Lichtgeistern” (“Uchtgeistem” → “Lichtgeistern”, meaning “light spirits”). This suggests that the model relies on high-probability substitutions, favoring common pluralization patterns rather than meaning-based corrections. Its failure to detect grammatical errors, such as the imperative “Sieh” (“look”) instead of “Sich” (“oneself”), highlights its limitations in syntactic restructuring. Rather than applying linguistic rules, Llama-8B selects corrections based on frequency in its training data, often improving word forms while missing deeper context.

Llama-70B (Figure 4) improves upon Llama-8B, particularly in lexical retrieval and probabilistic grammar correction. It successfully corrected “Lcuchtnase” (“Lcuchtnase” → “Leuchtnase” meaning “glowing nose”), demonstrating strong lexical retrieval likely derived from a large German corpus. However, its correction of “cs” to “sie” (“they/her”) instead of “es” (“it”) suggests it prioritizes high-frequency word choices over strict grammatical rules. While it effectively handles spelling errors due to clear one-to-one mappings, its reliance on frequency-based substitutions results in occasional errors in pronoun selection and syntactic agreement. This tendency indicates that Llama-70B has a better grasp of surface-level language patterns but still lacks deep structural understanding.

GPT-4o (Figure 5 demonstrates the strongest grammatical and structural correction capabilities among the three models. It correctly fixed “cs” to “es”, showing a possibly strong grasp of German pronoun structures, and accurately inferred “zu” (“7u” → “zu”, meaning “to”) from a garbled token, indicating robust syntactic processing. However, it incorrectly changed “Rentieronkcl” to “Rentieronkel” (“reindeer uncle”) instead of “Rentiere” (“reindeer” in plural), suggesting a bias toward phonetic and morphological similarity over semantic correctness. While GPT-4o significantly outperforms Llama-8B and Llama-70B in error detection and correction, it still occasionally prioritizes frequent patterns over meaning-driven replacements. Overall, GPT-4o is the most reliable for grammatical error correction, but its occasional misinterpretation of context suggests room for improvement in semantic reasoning.

5. Discussion

This study evaluated the performance of several LLMs for GEC in OCR-scanned German children’s literature, including both encoder-based models (GBERT and GELECTRA) and state-of-the-art decoder-based models (GPT-4o, Llama-3.2-1B, Llama3.2-3B, Llama-3.1-8B, and Llama-3.1-70B). Our results indicate that zero-shot evaluation on advanced decoder-based models, particularly GPT-4o and Llama models, outperformed fine-tuned encoder-based models. This finding suggests that state-of-the-art decoder-based models possess robust language processing capabilities that generalize well to specific tasks, such as GEC, even without task-specific fine-tuning.

One of the primary challenges at the initial stage of this study was the high level of noise in the original OCR-scanned documents. The OCR scans of German children’s literature often contain a mix of headers, footers, and text extracted from images, which introduces substantial irregularities in the scanned text. Unlike traditional books with continuous paragraphs, children’s books frequently feature short texts embedded in illustrations, highly stylized fonts, and fragmented text structures. Due to these layout complexities, no model, whether encoder-based or decoder-based, was fully capable of cleaning all or most of the noise from the text. As a result, a pattern-matching preprocessing step was essential to remove redundant or misrecognized text elements before feeding the data into the models for GEC. However, the pattern-matching steps used in this study rely heavily on the characteristics of the documents and they might not work efficiently for every OCR-scanned document.

Furthermore, since certain text elements originate from images and are closely tied to the visual content, the models struggled to fully comprehend and correct such text in a meaningful way. These words often lacked sufficient standalone context, making their interpretation particularly difficult. For instance, if a book features an image of a bear with the text “großer Bär” (big bear) underneath, the models might struggle to correct any OCR-induced errors without visual context. Given this limitation, a multimodal approach, where text-based LLMs are integrated with vision models, could significantly enhance the ability to process and correct text derived from images. Future research could explore the potential of multimodal models in German children’s literature, particularly in addressing OCR errors that arise from the interplay between visual and textual elements.

Among the models tested, Llama-3.1-70B exhibited the best performance in error detection, accurately identifying grammatical mistakes across various texts. However, GPT-4o demonstrated superior performance in error correction, successfully fixing detected errors with high accuracy and minimal overcorrection—an issue previously observed in older models like GPT-3.5. While Llama-70B effectively identified errors, its corrections were often frequency-based rather than meaning-driven, leading to instances where it selected plausible but incorrect substitutions. In contrast, GPT-4o showed stronger grammatical and structural correction capabilities, particularly in handling pronoun agreement and infinitive verb structures, though it occasionally exhibited biases toward phonetic similarity over semantic accuracy.

Llama models, particularly Llama-70B, provide a reliable, open-source alternative to GPT-4o, offering strong performance in grammatical error detection while allowing for greater flexibility in customization. Its accessibility as an open-source model on platforms like Hugging Face makes it an attractive option for budget-constrained applications, where fine-tuning on domain-specific text could help to mitigate some of its correction limitations. Meanwhile, GPT-4o’s ability to apply grammatical rules more effectively makes it the most suitable choice for applications requiring precise and context-aware error correction.

The implications of these findings are significant for digital archiving, language preservation, and educational applications. Effective GEC is essential for enhancing the readability and accessibility of digitized children’s literature, which often suffers from unique OCR-induced errors. The zero-shot performance of GPT-4o and Llama models suggests that institutions and researchers can rely on these models to process OCR-scanned texts with high accuracy, even without fine-tuning on extensive domain-specific data.

However, this study also highlights important limitations. The small dataset of German children’s literature restricted the potential for fine-tuning encoder-based models like GBERT and GELECTRA, limiting their ability to compete with the generalization power of decoder-based models in zero-shot scenarios. This limitation underscores the need for larger, genre-specific datasets in children’s literature to facilitate more effective fine-tuning for language-specific models.

Beyond technical considerations, using AI for children’s literature comes with ethical concerns. While AI can improve accessibility and text quality, its corrections may introduce subtle biases, altering tone, style, or cultural nuances, especially important in books for young readers. Ensuring linguistic diversity and avoiding homogenization is key to preserving the richness of children’s literature. Transparency is also crucial so that educators, parents, and publishers understand the role of AI in content modification. Future research should address these challenges, ensuring AI-assisted text refinement respects pedagogical and cultural sensitivities while supporting creative storytelling.

6. Conclusions

This study demonstrates that zero-shot evaluation with decoder-based models like GPT-4o and Llama is a viable and effective approach for GEC in OCR-scanned texts, especially when data limitations hinder model fine-tuning. By identifying high-performing models for this task, our research contributes valuable insights to the use of AI for preserving and enhancing accessibility to culturally significant children’s literature.

In future work, expanding the dataset to include a broader range of German children’s literature would allow for more effective fine-tuning and enhance the adaptability of encoder-based models to this task. Additionally, exploring further applications of zero-shot learning with state-of-the-art decoder-based models in other genres and languages would provide deeper insights into their capabilities. Integrating these LLMs directly within OCR workflows could further streamline the text-cleaning process, improving overall quality and reducing manual correction requirements.

Author Contributions

Conceptualization, P.T.N., R.D. and K.O.; methodology, P.T.N. and K.O.; data curation, B.N. and R.D.; writing—original draft preparation, P.T.N.; writing—review and editing, B.N., R.D. and K.O.; supervision, K.O. and R.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), grant number RGPIN-2023-04245.

Data Availability Statement

The data are not publicly available due to privacy restrictions.

Acknowledgments

The authors would like to thank the school and district which allowed us access to the school library. We would like to acknowledge funding from the Vice Provost Research, University of Calgary, and Sonderprogramm USA/Kanada des Auswärtigen Amts zur Förderung der deutschen Sprache [USA/Canada Special Program of the External Agency for the Support of the German Language] which enabled the creation of the corpus.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sidorov, G.; Gupta, A.; Tozer, M.; Catala, D.; Catena, A.; Fuentes, S. Rule-based system for automatic grammar correction using syntactic n-grams for English language learning (L2). In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, 8–9 August 2013; pp. 96–101. [Google Scholar]
Bryant, C.; Yuan, Z.; Qorib, M.R.; Cao, H.; Ng, H.T.; Briscoe, T. Grammatical error correction: A survey of the state of the art. Comput. Linguist. 2023, 49, 643–701. [Google Scholar] [CrossRef]
Rozovskaya, A.; Roth, D. Grammatical error correction: Machine translation and classifiers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 2205–2215. [Google Scholar]
Heigold, G.; Varanasi, S.; Neumann, G.; van Genabith, J. How robust are character-based word embeddings in tagging and MT against word scrambling or random noise? In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Boston, MA, USA, 17–21 March 2018; pp. 68–80. [Google Scholar]
Dressler, R.; Nuss, B.; Mueller, K. The readability of books for immersion schools: Understanding the role of text complexity, context, and literary aspects. J. Immers. Content-Based Lang. Educ. 2024. [Google Scholar] [CrossRef]
Sidorov, G. Syntactic dependency-based n-grams in rule-based automatic English as second language grammar correction. Int. J. Comput. Linguist. Appl. 2013, 4, 169–188. [Google Scholar]
Lytvyn, V.; Pukach, P.; Vysotska, V.; Vovk, M.; Kholodna, N. Identification and correction of grammatical errors in Ukrainian texts based on machine learning technology. Mathematics 2023, 11, 904. [Google Scholar] [CrossRef]
Li, Y.; Qin, S.; Huang, H.; Li, Y.; Qin, L.; Hu, X.; Jiang, W.; Zheng, H.T.; Yu, P.S. Rethinking the roles of large language models in Chinese grammatical error correction. arXiv 2024, arXiv:2402.11420. [Google Scholar]
Park, J.; Park, C.; Lim, H. ChatLang-8: An LLM-based synthetic data generation framework for grammatical error correction. arXiv 2024, arXiv:2406.03202. [Google Scholar]
Volk, M.; Furrer, L.; Sennrich, R. Strategies for reducing and correcting OCR errors. In Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series; Springer: Berlin, Germany, 2011; pp. 3–22. [Google Scholar]
Gorrell, P. Parsing theory and phrase-order variation in German V2 clauses. J. Psycholinguist. Res. 1996, 25, 135–156. [Google Scholar] [CrossRef]
Chan, B.; Schweter, S.; Möller, T. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6788–6796. [Google Scholar]
Katinskaia, A.; Yangarber, R. GPT-3.5 for Grammatical Error Correction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italy, 20–25 May 2024; pp. 7831–7843. Available online: https://aclanthology.org/2024.lrec-main.692/ (accessed on 15 January 2025).
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 159, pp. 1–25. [Google Scholar]
Xiang, Y.; Zhang, Y.; Wang, X.; Wei, C.; Zheng, W.; Zhou, X.; Hu, Y.; Qin, Y. Grammatical error correction using feature selection and confidence tuning. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–18 October 2013; pp. 1067–1071. [Google Scholar]
Coyne, S.; Sakaguchi, K.; Galvan-Sosa, D.; Zock, M.; Inui, K. Analyzing the performance of GPT-3.5 and GPT-4 in grammatical error correction. arXiv 2023, arXiv:2303.14342. [Google Scholar]
Schick, T.; Schütze, H. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2339–2352. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]

Figure 1. Bar chart showing the number of correctly and incorrectly deleted lines as well as missed lines after running the pattern matching algorithm on the children’s literature texts.

Figure 2. A part of the document before (left) and after (right) the pattern-matching code. This text is from one of our German children’s documents. It is still not perfectly correct grammatically.

Figure 3. Example of Llama-8B’s GEC performance. The left side shows the text before processing, the middle section presents the model’s output, and the right side displays the ground truth. In this example, Llama-8B successfully corrected a grammatical error (e.g., changing “mall” to “mal”), but failed to correct another detected error (“Uchtgeistem”). Translation: “‘Look’ Santa’s wife called. ‘There’s a lot going on with the light spirits today.’”.

Figure 4. Example of Llama-70B’s GEC performance. The left side shows the text before processing, the middle section presents the model’s output, and the right side displays the ground truth. In this example, Llama-70B successfully corrected a grammatical error (e.g., changing “Lcuchtnase” to “Leuchtnase”), but failed to correct another detected error (“cs”) and it missed one error (“Sich”). Translation: “Blitzen’s son was different from the other reindeer and now everyone knew it. The dwarves made jokes about the strange glowing nose and the reindeer shook their heads.”.

Figure 5. Example of GPT-4o’s GEC performance. The left side shows the text before processing, the middle section presents the model’s output, and the right side displays the ground truth. In this example, GPT-4o successfully corrected 2 grammatical errors (e.g., changing “cs” to “es” and “7u” to “zu”.), but failed to correct another detected error (“Rentieronkcl”). Translation: “Enthralled by the white splendor, he frolicked through the snow, sending up dusty clouds. Three reindeer also came to greet Rudolph.”.

Table 1. Sentence and word counts for each document.

Document	Number of Sentences	Number of Words
Document 1	50	507
Document 2	21	209
Document 3	47	472
Document 4	44	533
Document 5	26	380
Document 6	72	443
Document 7	37	505

Table 2. Performance metrics of different models for error detection.

Model	Accuracy	Precision	Recall	F1 Score
GBERT (Pretrained)	0.65	0.06	0.24	0.09
GBERT (Fine-tuned)	0.82	0.07	0.19	0.10
GELECTRA (Pretrained)	0.43	0.06	0.48	0.11
GELECTRA (Fine-tuned)	0.64	0.19	0.38	0.26
GPT-4o	0.97	0.88	0.83	0.86
Llama-3.2-1B	0.92	0.26	0.63	0.36
Llama-3.2-3B	0.96	0.78	0.81	0.80
Llama-3.1-8B	0.97	0.84	0.85	0.85
Llama-3.1-70B	0.98	0.87	0.89	0.88

Table 3. Percentage of detected erroneous tokens that were correctly fixed by each model across different documents.

	Llama-70B	Llama-8B	GPT-4o
1	93.75	78.79	93.94
2	50.00	100.00	50.00
3	96.77	86.67	96.88
4	57.14	14.29	71.43
5	89.47	69.09	87.93
6	94.00	83.33	91.84
7	86.30	79.45	97.26
Average	89.68	76.92	92.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nguyen, P.T.; Nuss, B.; Dressler, R.; Ovens, K. A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study. Appl. Sci. 2025, 15, 2476. https://doi.org/10.3390/app15052476

AMA Style

Nguyen PT, Nuss B, Dressler R, Ovens K. A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study. Applied Sciences. 2025; 15(5):2476. https://doi.org/10.3390/app15052476

Chicago/Turabian Style

Nguyen, Phuong Thao, Bernd Nuss, Roswita Dressler, and Katie Ovens. 2025. "A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study" Applied Sciences 15, no. 5: 2476. https://doi.org/10.3390/app15052476

APA Style

Nguyen, P. T., Nuss, B., Dressler, R., & Ovens, K. (2025). A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study. Applied Sciences, 15(5), 2476. https://doi.org/10.3390/app15052476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Data

3.2. Preprocessing

3.3. Language Models

3.3.1. Encoder-Based Models

3.3.2. Decoder-Based Models

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI