1. Introduction
Advancements in machine learning and artificial intelligence (AI) have significantly enhanced the field of GEC, which is particularly important for improving the accuracy of texts processed through Optical Character Recognition (OCR). This is a crucial step in transforming digitized versions of historical and literary works into clean, usable formats. Large language models (LLMs) have emerged as powerful tools for this purpose, demonstrating a remarkable ability to understand and correct complex linguistic patterns. However, some types of texts present unique challenges, such as German children’s literature, characterized by modified language, colloquial expressions, and narrative styles tailored to young readers. Additionally, children’s books often contain a mix of images and text, which can make the OCR process more complex. The presence of illustrations, decorative fonts, and varied text layouts can also introduce additional noise and errors during scanning, leading to a higher rate of misrecognition. Therefore, cleaning texts digitized using OCR can be challenging due to the wide variety of errors that can be introduced including distorted characters, misplaced text, and other layout-related issues.
The field of GEC in OCR-scanned documents has seen substantial advancements, driven by the need to improve the quality of digitized texts for analysis and reading. Early approaches relied on rule-based methods that utilized predefined grammar rules and dictionaries to correct errors, but these methods often lacked the flexibility to handle context-specific variations [
1]. With the advent of machine learning, statistical models offered greater adaptability, and more recently, neural network-based models, including large language models, have become the standard for GEC tasks [
2,
3].
German-specific GEC faces unique challenges due to the morphological complexity of the language. Evaluating the robustness of a model to OCR-induced noise is critical for ensuring accuracy in processed texts [
4]. Despite the progress in this area, there is a notable gap in evaluating the performance of state-of-the-art LLMs on the task of GEC in German texts, specifically those found in children’s literature. While LLMs have shown promise in general error correction tasks, their effectiveness in handling the particular linguistic features of German children’s books has not been fully explored. Addressing this gap is crucial for leveraging the full potential of LLMs, allowing for a more automated and accurate approach to cleaning OCR-scanned documents.
The goal of this study is to address this research gap by comparing the performance of various LLMs in correcting grammatical errors in OCR-scanned German children’s literature. By evaluating different models, we aim to identify the one that performs best for this specific task, considering the unique linguistic aspects of the texts.
The significance of this research lies in its potential to improve the accessibility and readability of OCR-scanned German children’s literature. Given that OCR processes frequently introduce errors, effective grammatical correction is essential for preserving the original meaning and educational value of these works. By comparing different large language models, this study seeks to identify the strengths and weaknesses of these models, including the most suitable model for handling such corrections, contributing valuable insights into the practical applications of AI in language preservation. Furthermore, this work holds significance for corpus linguistics research by providing cleaner, more accurate textual data, which is crucial for analyzing linguistic patterns and studying language use in culturally significant genres. Ultimately, this research contributes to expanding the available sources of German literature, particularly children’s literature, helping to address the scarcity of such resources. By improving the quality of digitized texts, it accelerates the process of analyzing text complexity in German children’s literature [
5], facilitating more comprehensive linguistic and educational studies.
2. Literature Review
GEC is a NLP task that aims to improve text quality by detecting and correcting grammatical, syntactic, and stylistic errors. Effective GEC enhances text clarity, making it particularly useful in educational applications, automated proofreading, and document digitization [
2]. Early approaches to GEC relied on rule-based methods [
1,
6], which were effective in structured contexts but lacked adaptability. These were later replaced by statistical models that utilized large corpora to improve correction accuracy [
3]. The field has since evolved with the advent of deep learning, particularly transformer-based architectures [
7,
8,
9], which have demonstrated superior performance in grammatical correction through advanced contextual understanding.
OCR is a technique utilized to convert printed text into digital format, yet it often introduces errors that compromise text integrity. These errors stem from character misrecognition, incorrect word segmentation, and inconsistencies in formatting, particularly when dealing with complex layouts and degraded print quality [
10]. Correcting OCR-generated text requires robust GEC models capable of handling both grammatical and OCR-specific errors. Traditional approaches, such as rule-based spell-checkers, have shown limited success due to their inability to capture contextual meaning. More recent research has explored the application of neural-based GEC models to address OCR-induced errors, improving overall text quality and usability [
7]. However, these challenges are particularly pronounced in children’s literature, which presents additional OCR-specific difficulties. Unlike conventional texts, children’s books often feature short, fragmented text interwoven with illustrations, highly stylized fonts, and playful layouts, making OCR misrecognition more prevalent. The presence of decorative elements and non-standard typography can lead to misclassified characters, missing words, and misplaced text blocks, further complicating grammatical error correction. While neural-based models have improved text restoration for general OCR tasks, limited research has investigated their effectiveness in the unique context of digitized children’s literature. The heavy reliance on visual context in these books also presents a challenge, as current text-only GEC models struggle to infer meaning when words are extracted from image-based elements. However, there remains a need to assess the performance of different model architectures in correcting OCR-generated errors, particularly in complex domains such as digitized literature.
German presents unique challenges for GEC due to its morphologically complex nature and syntactic variability. Inflectional variations, case-based word changes, and flexible word order introduce additional difficulties in identifying and correcting errors. Unlike English, German exhibits a verb-second (V2) structure in main clauses and verb-final positioning in subordinate clauses, requiring models to account for long-distance dependencies in sentence structure [
11]. Another key challenge is German’s extensive use of compound words, which are often misrecognized by OCR systems and incorrectly segmented in text-processing tasks. For example, OCR may incorrectly split “Donaudampfschifffahrtsgesellschaftskapitän” into multiple words, leading to syntactic errors. As a result, effective GEC in German must incorporate advanced contextual understanding to distinguish between correct compound structures and OCR-induced errors. While previous work has demonstrated that fine-tuned encoder-based models, such as GBERT and GELECTRA, offer improvements in handling German syntax [
12], recent research suggests that decoder-based models, such as GPT-4o and Llama, provide competitive performance without fine-tuning [
13,
14].
Recent advancements in deep learning have led to the development of two dominant approaches for GEC: encoder-based and decoder-based models. Encoder-based models, such as GBERT and GELECTRA, process input bidirectionally, allowing them to capture syntactic relationships more effectively [
12]. These models have the potential to perform well when fine-tuned on domain-specific datasets. However, their reliance on labeled training data limits their adaptability to new text domains, such as OCR-corrected literature.
In contrast, decoder-based models, such as GPT-4o and Llama, operate in an autoregressive manner, generating text one token at a time based on prior context [
15]. These models have gained popularity due to their zero-shot and few-shot learning capabilities, enabling them to generalize across different languages and text structures without extensive fine-tuning [
7,
8,
9,
16]. Recent studies have also explored whether improved prompt engineering can enhance the grammatical correction capabilities of decoder-based models, such as GPT-3.5. Coyne et al. [
17] found that carefully structured prompts, incorporating explicit correction instructions and contextual constraints, significantly improved the ability of GPT-3.5 to maintain semantic integrity while reducing unnecessary corrections. Despite their advantages, decoder-based models may exhibit overcorrection tendencies, where they modify text unnecessarily, altering its original meaning [
13]. In scanned documents of children’s literature—especially where illustrations are omitted—the sentences are often quite short and may contain very few words. If language models like GPT-3.5 or others attempt to enhance these sentences by adding extra words, they risk altering the simplicity and clarity that are essential for young readers. This could lead to the loss of the original meaning, making the text more complex or introducing unintended interpretations.
While significant progress has been made in GEC, several research gaps remain:
Limited research on GEC for OCR-generated German text, particularly in the domain of digitized literature.
A lack of comparative studies on encoder-based vs. decoder-based models for German GEC, with most prior work focusing on English.
An insufficient analysis of overcorrection tendencies in decoder-based models, which may impact text accuracy in real-world applications like digitized children’s literature.
This study aims to address these gaps by systematically evaluating different GEC models for OCR-generated German text and assessing their strengths and weaknesses in handling grammatical errors in children’s literature.
3. Methods
3.1. Data
The dataset for this study consists of OCR-scanned German children’s literature from the library of an elementary school where students receive up to 50% of instruction in German. Permission for scanning was obtained from the school principal and librarian, within a project under ethical approval from both the University of Calgary and the school board. The collection, developed as a disposable corpus for research purposes, primarily includes modern narrative books for children aged 5–12, as well as a smaller selection of children’s non-fiction books. Books were either purchased by the school or donated by parents over the German program’s 40-year history. Corpus creation was funded by the University of Calgary and the German government, completed between May and June 2022. The documents were scanned using the JoyUsing L140 (Fujian Joyusing Technology Co., Ltd., Fuzhou, China) document scanner with the Abbyy OCR system.
For model evaluation, we used seven documents from this corpus, totaling 297 sentences and 3049 words. Sentence and word counts varied across the documents, with sentence counts ranging from 21 to 72 and word counts from 109 to 533, as shown in
Table 1. Ground truth labels for detecting and correcting errors in these documents were created by a German language expert, ensuring high accuracy and reliability. The expert provided us with the corrected and uncorrected versions of the texts.
3.2. Preprocessing
There are several significant challenges related to the quality and structure of the dataset. The nature of these challenges was diverse, ranging from technical scanning errors to contextual inaccuracies in word usage. Understanding the specific characteristics of these challenges is crucial for developing an effective preprocessing method to prepare the dataset for our GEC pipeline.
The Presence of Meaningless Symbols: The dataset contained numerous instances of meaningless symbols within the documents, likely artifacts of the text conversion or scanning process. These extraneous symbols posed a significant hurdle in accurately interpreting and processing the text data.
Scanning Errors: A notable problem was the misrecognition of letters by the scanning system. A common error observed was the scanner mistaking ‘tt’ for ‘H’. Such errors could lead to substantial misinterpretations of the text, affecting the integrity of our text complexity analysis.
Contextual Inaccuracies in Word Usage: Words were incorrectly used in context. For example, the dataset used “horten” (to hoard) instead of “hörten” (heard), as seen in the sentence: “Aber da horten sie hinter sich schon das gefährliche Knurren von Stormellas Wölfen”. These inaccuracies could potentially skew the assessment of text complexity and readability.
Redundant Information: The dataset retained headers, footers, and tables of contents within the documents. These elements, while essential in a book’s layout, are extraneous for the purpose of analyzing text complexity and thus need to be removed.
To enhance the document’s conciseness and relevance, a pattern-matching algorithm was developed. This algorithm is particularly tailored to search through documents and methodically identify as well as remove sections deemed non-essential, thereby streamlining the textual content. The current results of the algorithm are illustrated in
Figure 1.
The algorithm demonstrates a high rate of accuracy, identifying and removing superfluous lines at a rate that exceeds the one at which it inappropriately discards necessary lines by more than sixteenfold, and correctly identifies and purges 95.7% of lines that are classified as non-critical, a testament to its accuracy.
Further,
Figure 2 provides an illustrative example of the algorithm in action. It takes a segment of a document and demonstrates the transformative effect of removing redundant parts. This visual representation exhibits the enhancement in the readability and clarity of the document post-processing. The elimination of superfluous elements declutters the text and significantly improves the ease with which the reader can assimilate and comprehend the text.
3.3. Language Models
In this study, we evaluate the performance of various language models on grammatical error detection and correction tasks specific to German children’s literature, focusing on both fine-tuned encoder-based and decoder-based models. The encoder-based models, GBERT and GELECTRA, are German-specific transformers that have been fine-tuned to enhance their grammatical correction capabilities, especially in German children’s literature, leveraging their specialized language processing for German text. Alongside these, we assess zero-shot, large-scale, decoder-based, multipurpose models like GPT and Llama, which are designed for diverse linguistic tasks across multiple languages, including German. Even though GBERT and GELECTRA are much smaller than GPT and Llama models, in the past, smaller models have demonstrated surprising effectiveness under constrained resource settings [
18], so there is a hope for them to outperform these giant models. By comparing the encoder-based models’ focused, language-specific capabilities with the broader adaptability of the decoder-based models, this study aims to provide insights into each model’s suitability for nuanced grammatical corrections in OCR-scanned German literature.
3.3.1. Encoder-Based Models
The development of large language models (LLMs) in natural language processing (NLP) has progressed through several key innovations, each building on the capabilities of its predecessors. BERT (Bidirectional Encoder Representations from Transformers), introduced in 2019 [
19], marked a significant shift by training models bidirectionally. Unlike previous models that read text in one direction, BERT’s approach allowed it to consider the context from both the left and right sides of a word, greatly enhancing its performance in understanding context and correcting grammar. Following BERT, ELECTRA was introduced [
20], offering a more efficient training mechanism for language models. Instead of using BERT’s masked language modeling, where words are randomly masked and then predicted, ELECTRA trains by replacing words and having the model detect whether each word is original or replaced. This “replaced token detection” approach enabled ELECTRA to learn from a larger number of examples in the same amount of time, making it faster and more effective for tasks like grammatical error correction and handling noisy text, such as OCR outputs. Building on the foundations of BERT and ELECTRA, adaptations like GBERT and GELECTRA emerged, specifically tailored for the German language [
12]. GBERT adapts the bidirectional training of BERT for German text, fine-tuning the model on German-specific linguistic features, while GELECTRA applies the efficiency of ELECTRA’s replaced token detection to German corpora. These adaptations make GBERT and GELECTRA particularly well-suited for understanding the complex morphological and syntactic structures of the German language, offering improved performance in grammatical error correction tasks for German texts.
We used the original architectures of GBERT and GELECTRA introduced by Chan et al. [
12]. Both GBERT and GELECTRA were then fine-tuned on the dataset of German children’s literature mentioned in
Section 3.1, after splitting each document into individual sentences. For training, 80% of the sentences were allocated to the training set, with 10% designated for validation and the remaining 10% for testing. These subsets were randomly sampled from the complete pool of sentences across all documents, ensuring a representative distribution for evaluation purposes.
Originally pretrained on an extensive German text corpus, GBERT was fine-tuned for a targeted token classification task to suit the specific demands of this study. The model underwent customization by modifying its final layer to predict two categories (incorrect or correct), in alignment with the dataset’s requirements. Addressing the issue of class imbalance prevalent in the dataset, a bespoke loss function was integrated through an extension of the Trainer class. This adjustment allowed for the application of class weights, thereby balancing the impact of each category during the training phase. The fine-tuning process involved training the model on a carefully assembled dataset of German sentences, annotated explicitly for token classification. The validation set was employed to monitor the model’s generalization performance during the training process. This approach enabled the model to acclimate to the distinct linguistic and contextual nuances present within the data. For the fine-tuning process, specific parameters were utilized through multiple tuning trials to optimize the model’s performance. A learning rate of 0.0001 was chosen, alongside batch sizes of 16 for both the training and evaluation phases and the optimizer used was AdamW, selected after multiple trials.
Like GBERT, GELECTRA benefits from pretraining on a vast corpus of German language texts, equipping it with a deep understanding of the linguistic intricacies inherent to the German language. The potential advantage of GELECTRA for this task lies in its unique pretraining approach, which focuses on generating and distinguishing between correct and artificially generated incorrect tokens. The fine-tuning parameters used for GELECTRA are similar to the ones for GBERT. This approach could theoretically provide a more nuanced understanding of language errors, making it particularly suitable for the error detection phase of word correction. The fine-tuning process for GELECTRA mirrored that of GBERT.
3.3.2. Decoder-Based Models
The evolution of LLMs continued with the release of the GPT (Generative Pretrained Transformer) series, including GPT-2 in 2019 [
21], GPT-3 in 2020 [
15], and GPT-4 in 2023 [
22], which introduced autoregressive modeling for generating coherent and contextually accurate paragraphs. Unlike BERT and ELECTRA, which focus on understanding text, GPT models excel in generating and completing sentences, making them versatile for a wider range of language tasks, from text generation to nuanced error correction.
Following GPT, Meta AI’s Llama (Large Language Model Meta AI) [
14] has aimed to combine high performance with greater accessibility. Llama offers several advantages over previous models, including being open-source, which allows researchers and developers to access its model weights and code freely. This openness encourages innovation, enabling the community to fine-tune the model for specific applications or research needs. Additionally, Llama is designed with efficiency in mind, requiring fewer computational resources compared to models like GPT-4. It is able to run effectively on smaller hardware setups while still achieving competitive performance, making it an attractive option for those without access to extensive computing infrastructure.
To evaluate the capabilities of various LLMs in performing nuanced linguistic tasks, we conducted a zero-shot evaluation on five models: GPT-4o, Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B, and Llama-3.1-70B. In this context, zero-shot evaluation refers to assessing each model’s ability to perform grammatical error detection and correction on a dataset of German children’s literature without any additional fine-tuning specific to this task. Zero-shot learning allows the models to rely on their inherent language processing capabilities derived from extensive pretraining, providing insights into their adaptability to tasks they were not explicitly trained to handle.
Each model brings unique attributes to this evaluation. GPT-4o, the latest high-intelligence flagship model, is designed for complex, multi-step tasks and represents an advancement in the GPT series with over 200 billion parameters, which surpasses GPT-4’s 175 billion. This allows for nuanced and highly detailed responses while maintaining efficiency and speed, making GPT-4o suitable for intricate tasks at a lower cost and faster response time than its predecessor, GPT-4 Turbo. However, using GPT-4o still incurs a cost, with pricing set at USD 2.50 per million input tokens and CAD 10.00 per million output tokens, with a discounted rate of CAD 1.25 per million input tokens and CAD 5.00 per million output tokens through the Batch API. Llama-3.2, a more recent release from Meta AI, offers lightweight and highly efficient models with parameter counts of 1.24 billion for Llama-3.2-1B and 3.21 billion for Llama-3.2-3B, aimed at decoder-based text applications. Meanwhile, Llama-3.1, an earlier model, includes larger parameter versions like the 8B and 70B we tested, delivering potentially richer linguistic and contextual understanding due to their higher parameter counts. Unlike GPT-4o, all Llama models are open-source and freely accessible on the Hugging Face and Meta websites, allowing for broader experimentation and customization without associated usage costs. For this evaluation, we utilized the OpenAI API to access GPT-4o, while all Llama models were tested through their instruct versions available on Hugging Face.
4. Results
Prior to assessing the models’ effectiveness in correcting errors, we first evaluated their proficiency in detecting errors, as accurate detection is crucial to the overall quality of the error correction process. The results of the models are summarized in
Table 2.
The pretrained GBERT and fine-tuned GBERT exhibit notable differences in their error detection performance. The pretrained GBERT has an accuracy rate of 0.65, with a precision of 0.06, recall of 0.24, and an F1 score of 0.09. In contrast, the fine-tuned GBERT showed improved accuracy at 0.82, though its precision remained low at 0.07 and recall slightly decreased to 0.19, leading to a marginal increase in the F1 score to 0.10. These results demonstrate that fine-tuning enhances the model’s overall accuracy by substantially lowering the rate of false positives and negatives, although it slightly affects the precision and recall.
The comparison between the pretrained and fine-tuned GELECTRA models highlights notable differences in their performance for error detection. The pretrained model achieves an accuracy of 0.43, a precision of 0.06, and a recall of 0.48, resulting in an F1 score of 0.11. The fine-tuned model demonstrates significant improvement, with an accuracy of 0.64, precision of 0.19, recall of 0.38, and an F1 score of 0.26. This improvement indicates a better balance between precision and recall, though GELECTRA’s performance still lags behind GBERT in terms of accuracy.
Among all models, GPT-4o and Llama models demonstrated the best performance in error detection. GPT-4o achieved an accuracy of 0.97, a precision of 0.88, and a recall of 0.83, resulting in an F1 score of 0.86. This reflects its ability to effectively detect errors while maintaining high precision. The Llama models followed closely, with Llama-3.2-1B achieving an accuracy of 0.92, but with a much lower precision of 0.26 and recall of 0.63, leading to an F1 score of 0.36. Llama-3.2-3B performed better with an accuracy of 0.96, precision of 0.78, recall of 0.81, and F1 score of 0.80. Llama-3.1-8B achieved an accuracy of 0.97, precision of 0.84, recall of 0.85, and an F1 score of 0.85, demonstrating a more balanced performance between precision and recall. The best-performing model was Llama-3.1-70B, with an accuracy of 0.98, precision of 0.87, recall of 0.89, and an F1 score of 0.88, making it the most robust model in terms of both precision and recall.
Based on these results, we proceeded to the grammatical error correction stage with GPT-4o, Llama-3.1-8B, and Llama-3.1-70B, as they exhibited the most consistent and reliable performance in error detection.
The results in
Table 3 show that GPT-4o achieves the highest average correction accuracy at 92.52%, maintaining strong performance across most documents, with a peak of 97.26% (Document 7) and a low of 50.00% (Document 2). Llama-3.1-70B follows closely with 89.68% accuracy, also performing well across documents, ranging from 96.77% (Document 3) to 50.00% (Document 2). Both models demonstrate reliability in grammatical error correction, with slight fluctuations based on text complexity. In contrast, Llama-3.1-8B shows greater variability, averaging 76.92%. While it achieves a perfect 100.00% accuracy in Document 2, it drops significantly to 14.29% in Document 4, indicating inconsistency in handling diverse linguistic structures. Overall, GPT-4o and Llama-3.1-70B are the most reliable for error correction, while Llama-3.1-8B’s fluctuating performance suggests it may require further fine-tuning for stability.
Following the quantitative evaluation, the qualitative analysis further examines the nature of errors detected and corrected by each model. As can be seen in
Figure 3, Llama-8B primarily focuses on surface-level corrections, effectively handling typos and morphological adjustments but struggling with deeper grammatical and semantic errors. It correctly detected and fixed “mall” (“mall” → “mal”, meaning “time” or “once”), demonstrating strong lexical correction abilities, and identified “Uchtgeistem” as incorrect but only modified it to “Uchtgeistern” instead of the correct “Lichtgeistern” (“Uchtgeistem” → “Lichtgeistern”, meaning “light spirits”). This suggests that the model relies on high-probability substitutions, favoring common pluralization patterns rather than meaning-based corrections. Its failure to detect grammatical errors, such as the imperative “Sieh” (“look”) instead of “Sich” (“oneself”), highlights its limitations in syntactic restructuring. Rather than applying linguistic rules, Llama-8B selects corrections based on frequency in its training data, often improving word forms while missing deeper context.
Llama-70B (
Figure 4) improves upon Llama-8B, particularly in lexical retrieval and probabilistic grammar correction. It successfully corrected “Lcuchtnase” (“Lcuchtnase” → “Leuchtnase” meaning “glowing nose”), demonstrating strong lexical retrieval likely derived from a large German corpus. However, its correction of “cs” to “sie” (“they/her”) instead of “es” (“it”) suggests it prioritizes high-frequency word choices over strict grammatical rules. While it effectively handles spelling errors due to clear one-to-one mappings, its reliance on frequency-based substitutions results in occasional errors in pronoun selection and syntactic agreement. This tendency indicates that Llama-70B has a better grasp of surface-level language patterns but still lacks deep structural understanding.
GPT-4o (
Figure 5 demonstrates the strongest grammatical and structural correction capabilities among the three models. It correctly fixed “cs” to “es”, showing a possibly strong grasp of German pronoun structures, and accurately inferred “zu” (“7u” → “zu”, meaning “to”) from a garbled token, indicating robust syntactic processing. However, it incorrectly changed “Rentieronkcl” to “Rentieronkel” (“reindeer uncle”) instead of “Rentiere” (“reindeer” in plural), suggesting a bias toward phonetic and morphological similarity over semantic correctness. While GPT-4o significantly outperforms Llama-8B and Llama-70B in error detection and correction, it still occasionally prioritizes frequent patterns over meaning-driven replacements. Overall, GPT-4o is the most reliable for grammatical error correction, but its occasional misinterpretation of context suggests room for improvement in semantic reasoning.
5. Discussion
This study evaluated the performance of several LLMs for GEC in OCR-scanned German children’s literature, including both encoder-based models (GBERT and GELECTRA) and state-of-the-art decoder-based models (GPT-4o, Llama-3.2-1B, Llama3.2-3B, Llama-3.1-8B, and Llama-3.1-70B). Our results indicate that zero-shot evaluation on advanced decoder-based models, particularly GPT-4o and Llama models, outperformed fine-tuned encoder-based models. This finding suggests that state-of-the-art decoder-based models possess robust language processing capabilities that generalize well to specific tasks, such as GEC, even without task-specific fine-tuning.
One of the primary challenges at the initial stage of this study was the high level of noise in the original OCR-scanned documents. The OCR scans of German children’s literature often contain a mix of headers, footers, and text extracted from images, which introduces substantial irregularities in the scanned text. Unlike traditional books with continuous paragraphs, children’s books frequently feature short texts embedded in illustrations, highly stylized fonts, and fragmented text structures. Due to these layout complexities, no model, whether encoder-based or decoder-based, was fully capable of cleaning all or most of the noise from the text. As a result, a pattern-matching preprocessing step was essential to remove redundant or misrecognized text elements before feeding the data into the models for GEC. However, the pattern-matching steps used in this study rely heavily on the characteristics of the documents and they might not work efficiently for every OCR-scanned document.
Furthermore, since certain text elements originate from images and are closely tied to the visual content, the models struggled to fully comprehend and correct such text in a meaningful way. These words often lacked sufficient standalone context, making their interpretation particularly difficult. For instance, if a book features an image of a bear with the text “großer Bär” (big bear) underneath, the models might struggle to correct any OCR-induced errors without visual context. Given this limitation, a multimodal approach, where text-based LLMs are integrated with vision models, could significantly enhance the ability to process and correct text derived from images. Future research could explore the potential of multimodal models in German children’s literature, particularly in addressing OCR errors that arise from the interplay between visual and textual elements.
Among the models tested, Llama-3.1-70B exhibited the best performance in error detection, accurately identifying grammatical mistakes across various texts. However, GPT-4o demonstrated superior performance in error correction, successfully fixing detected errors with high accuracy and minimal overcorrection—an issue previously observed in older models like GPT-3.5. While Llama-70B effectively identified errors, its corrections were often frequency-based rather than meaning-driven, leading to instances where it selected plausible but incorrect substitutions. In contrast, GPT-4o showed stronger grammatical and structural correction capabilities, particularly in handling pronoun agreement and infinitive verb structures, though it occasionally exhibited biases toward phonetic similarity over semantic accuracy.
Llama models, particularly Llama-70B, provide a reliable, open-source alternative to GPT-4o, offering strong performance in grammatical error detection while allowing for greater flexibility in customization. Its accessibility as an open-source model on platforms like Hugging Face makes it an attractive option for budget-constrained applications, where fine-tuning on domain-specific text could help to mitigate some of its correction limitations. Meanwhile, GPT-4o’s ability to apply grammatical rules more effectively makes it the most suitable choice for applications requiring precise and context-aware error correction.
The implications of these findings are significant for digital archiving, language preservation, and educational applications. Effective GEC is essential for enhancing the readability and accessibility of digitized children’s literature, which often suffers from unique OCR-induced errors. The zero-shot performance of GPT-4o and Llama models suggests that institutions and researchers can rely on these models to process OCR-scanned texts with high accuracy, even without fine-tuning on extensive domain-specific data.
However, this study also highlights important limitations. The small dataset of German children’s literature restricted the potential for fine-tuning encoder-based models like GBERT and GELECTRA, limiting their ability to compete with the generalization power of decoder-based models in zero-shot scenarios. This limitation underscores the need for larger, genre-specific datasets in children’s literature to facilitate more effective fine-tuning for language-specific models.
Beyond technical considerations, using AI for children’s literature comes with ethical concerns. While AI can improve accessibility and text quality, its corrections may introduce subtle biases, altering tone, style, or cultural nuances, especially important in books for young readers. Ensuring linguistic diversity and avoiding homogenization is key to preserving the richness of children’s literature. Transparency is also crucial so that educators, parents, and publishers understand the role of AI in content modification. Future research should address these challenges, ensuring AI-assisted text refinement respects pedagogical and cultural sensitivities while supporting creative storytelling.
6. Conclusions
This study demonstrates that zero-shot evaluation with decoder-based models like GPT-4o and Llama is a viable and effective approach for GEC in OCR-scanned texts, especially when data limitations hinder model fine-tuning. By identifying high-performing models for this task, our research contributes valuable insights to the use of AI for preserving and enhancing accessibility to culturally significant children’s literature.
In future work, expanding the dataset to include a broader range of German children’s literature would allow for more effective fine-tuning and enhance the adaptability of encoder-based models to this task. Additionally, exploring further applications of zero-shot learning with state-of-the-art decoder-based models in other genres and languages would provide deeper insights into their capabilities. Integrating these LLMs directly within OCR workflows could further streamline the text-cleaning process, improving overall quality and reducing manual correction requirements.