Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language

Mandravickaitė, Justina; Rimkienė, Eglė; Kapkan, Danguolė Kotryna; Kalinauskaitė, Danguolė; Čenys, Antanas; Krilavičius, Tomas

doi:10.3390/math13030465

Open AccessArticle

Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language

by

Justina Mandravickaitė

^1,*

,

Eglė Rimkienė

¹

,

Danguolė Kotryna Kapkan

¹

,

Danguolė Kalinauskaitė

¹

,

Antanas Čenys

^2,*

and

Tomas Krilavičius

¹

Faculty of Informatics, Vytautas Magnus University, Kaunas District, 53361 Akademija, Lithuania

²

Faculty of Fundamental Sciences, Vilnius Gediminas Technical University, 10223 Vilnius, Lithuania

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(3), 465; https://doi.org/10.3390/math13030465

Submission received: 16 December 2024 / Revised: 17 January 2025 / Accepted: 27 January 2025 / Published: 30 January 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we present the results of experiments on text simplification for the Lithuanian language, where we aim to simplify administrative-style texts to the Plain Language level. We selected mT5, mBART, and LT-Llama-2 as the foundational models and fine-tuned them for the text simplification task. Additionally, we evaluated ChatGPT for this purpose. Also, we conducted a comprehensive assessment of the simplification results provided by these models both quantitatively and qualitatively. The results demonstrated that mBART was the most effective model for simplifying Lithuanian administrative text, achieving the highest scores across all the evaluation metrics. A qualitative evaluation of the simplified sentences complemented our quantitative findings. Attention analysis provided insights into model decisions, highlighting strengths in lexical and syntactic simplifications but revealing challenges with longer, complex sentences. Our findings contribute to advancing text simplification for lesser-resourced languages, with practical applications for more effective communication between institutions and the general public, which is the goal of Plain Language.

Keywords:

text simplification; Lithuanian; plain language; transformers; mBART; mT5; Llama-2; ChatGPT; fine-tuning; natural language processing

MSC:

68T50

1. Introduction

Text simplification means reducing vocabulary and syntactic complexity while preserving the essential information necessary for understanding. This process allows to increase the accessibility of information for a wide range of groups, including individuals with cognitive disorders, non-native speakers, and children [1]. Additionally, simplifying text is crucial for improving the general public’s comprehension of legal and administrative documents. These types of texts often act as communication bridges between institutions and audiences with varying reading abilities [2].

In this paper, we report the results on text simplification for Lithuanian, targeting the simplification of texts written in administrative (clerical) styles. Public authorities often employ quasilegal language when communicating with their audiences, which can hinder the effective dissemination of information to those groups or individuals who lack specialized knowledge [2]. Text simplification can facilitate reading comprehension by transforming complex natural language into a more straightforward form, which includes vocabulary, sentence structure, and other key elements while maintaining the core content of the original text. As far as we know, this is the first such study for the Lithuanian language.

Our objective is to simplify Lithuanian administrative texts to the level of Plain Language. Plain Language is characterized by clear wording, structure, and design, so the target audience can easily find, understand, and use the information [3], without requiring special education or knowledge [4]. Therefore, we aimed to simplify Lithuanian administrative texts according to the principles of Plain Language. Additionally, we conducted a comprehensive analysis of the simplification results to identify which sentence elements and structures were effectively simplified by our models, as well as to detect potential errors and develop strategies to mitigate them.

For our experiments, we selected mT5, mBART, and LT-Llama-2 as the baseline models and fine-tuned them for the text simplification task. These models were chosen because they support the Lithuanian language, which considers that there is still a lack of large language models that would adequately support lesser-resourced languages. Furthermore, we evaluated the performance of ChatGPT (GPT-4o) in this task. Our approach integrated lexical and syntactic simplification techniques to reduce sentence complexity and replace complex words or phrases with simpler and more commonly used alternatives.

The rest of the paper is structured as follows: in Section 2, we briefly introduce the related work; in Section 3, we describe the data and methods; and in Section 4, we present the results. Finally, in Section 5 we discuss these results and end the paper with conclusions in Section 6.

2. Related Work

In recent years, significant progress has been made in the field of text simplification, moving from rule-based methodologies (e.g., [5,6]) to data-driven methods (e.g., [7,8]). Contemporary research has shown that using these methods texts could be simplified according to diverse goals and perspectives, such as by specifying the desired reading level [9] or directly defining the necessary simplification operations [10]. For example, the BERT model has been used for lexical simplification [11], text simplification using monolingual machine translation [12], and hybrid text simplification combining different methods [13]. Similarly, the T5 model has been applied for controllable text simplification [14,15] and for simplification tasks in low-resource scenarios [16,17]. In particular, the SimpleT5 model, designed to rank and explain complex concepts in a text, outperformed GPT-3 in key metrics related to feedback and time constraints [18]. Additionally, BART has been used for controllable text simplification [14], as well as to simplify entire paragraphs [19] and full documents [20].

Among the various large language models (LLMs), GPT models have attracted particular interest, especially in low-resource settings [21,22,23]. For example, ChatGPT has shown potential in text simplification tasks [24,25]. However, performance evaluations of ChatGPT show that while it can effectively simplify the text, it may miss important information [26] and perform less accurately and in depth than human experts [27]. On the other hand, when provided with additional context, ChatGPT has been observed to simplify complex texts more effectively [28] and provide information that is essentially correct [29].

Various methods have been used to analyze, evaluate, and interpret the results of automatic text simplification. One such approach focuses on controlling and predicting text complexity. A key component of this task is to determine whether a text requires simplification and identify the specific segments that need editing. These tasks can be accomplished using a range of models, including LSTM [30] and BERT [31], alongside model-agnostic techniques such as Local Interpretable Model-Agnostic Explanations (LIME) [32] and SHapley Additive exPlanations (SHAP) [33], which help to gain insight into model decisions [34]. For instance, LIME has been applied with logistic regression (LR) and LSTM classifiers, while SHAP has been used with LR to explain text complexity predictions as demonstrated in [30]. Additionally, they explored the potential of the extraction of adversarial networks for simultaneous prediction and explanation. These methods improve the explainability of text simplification models by providing insights into their behavior and decision-making processes.

Given that text simplification is a complex and multifaceted task, and often treated as a generative task, automatic evaluation metrics alone are not sufficient to capture all its aspects. Consequently, new evaluation metrics for text generation have been proposed. In particular, trained scoring metrics have been developed, such as InstructScore [35], which is based on a fine-tuned Llama model that generates diagnostic reports aligned with human judgment, and TIGERScore [36], which uses Llama-2 to analyze the errors based on the instructions of natural language. Furthermore, LENS [37] has demonstrated stronger correlations with human judgment compared to existing metrics. LENS is a learnable evaluation metric that decomposes and reconstructs sentences to assess both similarity and difficulty within a text simplification system. These examples represent a few of the innovative metrics that have been introduced. Furthermore, new formulas have been proposed to integrate grammaticality, meaning retention, and simplicity into a unified metric for the evaluation of automatically simplified text [38], revealing strong correlations with established metrics such as BERTscore and BLEU.

Moreover, document-level planning has been proposed to ensure controlled simplification and clarity [39]. This approach involved developing a planning model that assigned appropriate simplification operations to each sentence within a document for better control and explanation of the generation process. To further analyze the processes and errors associated with text simplification, an analytical evaluation framework has been introduced, which includes a comprehensive taxonomy of simplification strategies and errors [40]. This study revealed a discrepancy between human and system approaches, where systems often relied on deletions and local changes without incorporating new information essential for meaningful simplification.

Additionally, several studies have explored the use of guided text generation for simplification purposes. For example, SIMSUM [41] directed text generation via the main keywords of a source text. Similarly, research in [10] focused on simplifying text for specific grade levels via the prediction of necessary edit operations. Another approach involved a reinforcement learning system equipped with a readability classifier for iterative simplification until the text reached a desired readability level [42]. Furthermore, the definition and usage of simplification operations have been investigated, highlighting LLMs’ role in the automated recognition of these operations [43].

The Rhetorical Structure Theory (RST) framework proposed by [44] was used to analyze and interpret the results of text simplification, taking into account the contribution of structural aspects to text complexity. Additionally, explainable text simplification was explored through the use of knowledge graphs (KGs), such as KGSimple [45], which simplifies texts at both the graph and text levels, providing explainable progress at each simplification step. This highlights the potential of KGs to increase the interpretability of text simplification models.

Recent research in text simplification has employed a variety of approaches to enhance the explainability and interpretability of model behavior and simplification outcomes. As the field advances, increasing attention is being directed toward addressing the discrepancies between human judgments and the outputs produced by simplification systems. Although simplified texts have been shown to improve reading comprehension [46], existing methods often overlook the inherent complexity of individual inputs, which results in simplifications that require editing and corrections [10]. Additionally, most current text simplification systems do not adequately consider contextual information, which leads to outputs that are difficult to control [47]. Finally, persistent issues related to cultural and commonsense knowledge highlight the necessity for ongoing research in this area [48].

3. Materials and Methods

3.1. Materials

Simplification guidelines. We aimed to simplify the Lithuanian administrative texts so that they would meet the standard of Plain Language, and become more accessible and easier to understand for the general public. To accomplish this, lexical and syntactic simplification rules were developed for our 2 corpora—Parallel Corpus 1 and Parallel Corpus 2. These rules were primarily based on cross-lingual Plain Language principles [49,50]. Additionally, we incorporated text simplification rules from languages with grammatical structures similar to Lithuanian where applicable [51,52]. Furthermore, we developed rules tailored specifically to the Lithuanian language, particularly for managing participles. As a result, the Plain Lithuanian guidelines are organized into three levels of simplification operations, which can be outlined as follows:

Paragraph-level simplification.
(a)
Sentence splitting: sentences longer than 12 words should be broken down into smaller units, preferably by turning embedded relative clauses into independent clauses.
(b)
List creation: where appropriate, homogeneous elements need to be transformed into vertical lists, i.e., if there are more than two coordinated elements (e.g., object or subject noun phrases and clauses) with a homogeneous function in a sentence, they need to be converted into vertical lists.
Lexical-level simplification.
(a)
Preference should be given to the more frequent synonyms determined by the Lithuanian Frequency Dictionary [53], even if the normal formal requirements of the register are not followed.
(b)
Avoid metaphors and uncommon acronyms.
(c)
Define obscure terms in separate sentences.
Syntactic-level simplification.
(a)
Transform passive voice constructions into active voice.
(b)
Replace active participle and gerund constructions with relative clauses.
(c)
Minimize the use of nominalizations.
(d)
Affirmatives are preferred to negatives.
(e)
If necessary, introduce demonstrative pronouns and nouns for clarity.

Corpora for Fine-Tuning. To explore the effect of differently prepared data in terms of model performance, we created 2 datasets which we used for fine-tuning mT5, mBART, and LT-Llama-2 (see Table 1):

Parallel Corpus 1—a dataset where each original (complex) sentence had 1 simplified equivalent. The complex sentences were taken from websites of governmental and non-governmental public institutions.
Parallel Corpus 2—a dataset in which Parallel Corpus 1 was augmented with additional complex sentences, some of which had more than 1 simplified counterpart (2–3), based on text simplification corpora such as SimPA [54] and Human Simplification with Sentence Fusion Dataset (HSSF) [55]. For complex sentences, we used the same list of sources as for Parallel Corpus 1.

Both corpora for this study were developed using administrative texts from websites of Lithuanian governmental and non-governmental organizations, covering diverse topics such as social benefits, migration, utilities, copyright, etc. Texts were selected to ensure a broad representation of administrative communication styles and topics. Potential bias related to the over-representation of certain topics was mitigated by diversifying sources and including texts with varying complexity. Additionally, manual review by 4 experts ensured that manually simplified versions of original (complex) sentences in the corpora would follow Plain Lithuanian guidelines.

Data for Testing. We used 554 sentence pairs that were not included in Parallel Corpora 1 and 2, although the list of sources was the same. The dataset was created based on the criteria of topic diversity and sentence complexity. This dataset has been prepared following the same guidelines and procedures as the corpora used for fine-tuning.

3.2. Methods

mT5. For our task, we fine-tuned the mT5 model, a multilingual adaptation of T5 (Text-to-Text Transfer Transformer) [56]. This model reformulates all language-processing tasks as text generation problems, following the foundational principles of the T5 architecture [57]. For its pre-training, the developers of mT5 utilized a multilingual version of the C4 dataset, which comprises textual data in 101 languages, including Lithuanian, sourced from a Common Crawl web scrape. Because the mT5 model has been pre-trained on diverse multilingual data, it is well suited for handling less-resourced languages such as Lithuanian.

mT5 is also a sequence-to-sequence model trained with the span corruption objective [57].

Input:
The input X is tokenized into subword units and corrupted by masking spans of tokens.
The model predicts the masked spans $Z = z_{1}, z_{2}, \dots, z_{k}$ .
Span corruption function:
The model is trained to maximize the probability of the output spans:

$L_{mT 5} = - \sum_{i = 1}^{k} log P (z_{i} | X, z_{< i})$

where P is the probability distribution modeled by mT5, which predicts the likelihood of the next span $z_{i}$ ; $z_{i}$ is the i-th span being predicted; X is the input sequence, which is corrupted by masking spans of tokens; and $z_{< i}$ is all spans predicted prior to $z_{i}$ , ensuring that predictions are made sequentially and contextually.

mBART. We also fine-tuned mBART, a multilingual version of the BART (Bidirectional and Auto-Regressive Transformers) model, which integrates both auto-encoder and auto-regressive components [58]. Although mBART was originally developed for machine translation, its flexible architecture makes it well suited for text simplification tasks as well. It supports Lithuanian as well, even though Lithuanian is only a small part of its pre-training data (only 1835 characters in a 13.7-gigabyte corpus) [59].

The primary training objective of mBART is denoising sequence-to-sequence reconstruction, where the model learns to reconstruct corrupted input sequences [59].

Encoder–Decoder architecture:
The encoder processes the input sequence $X = {x_{1}, x_{2}, \dots, x_{n}}$ , generating hidden states $H = {h_{1}, h_{2}, \dots, h_{n}}$ .
The decoder generates the output sequence $Y = {y_{1}, y_{2}, \dots, y_{m}}$ , conditioned on the encoder’s hidden states and previous decoder outputs.
Denoising sequence-to-sequence reconstruction:
The denoising involves corrupting the input sequence X via token masking and sentence permutation. The model minimizes the negative log-likelihood:

$L_{mBART} = - \sum_{t = 1}^{m} log P (y_{t} | y_{< t}, H)$

where m is the total number of tokens in the output sequence Y, t is the index of the current token being predicted in the output sequence, $y_{t}$ is the token at position t in the output sequence $Y = {y_{1}, y_{2}, \dots, y_{m}}$ , $y_{< t}$ is all tokens in the output sequence that precede the current token $y_{t}$ , H is the encoder’s hidden states, and $P (y_{t} ∣ y_{< t}, H)$ is the conditional probability of the current token $y_{t}$ being correct, given the previous tokens in the sequence ( $y_{< t}$ ) and the encoded input H.

LT-Llama-2. LT-Llama-2 was the last model we fine-tuned for the simplification of Lithuanian administrative texts. It was pre-trained on the Lithuanian question/answer dataset and popular LLM benchmarks translated into Lithuanian—a total dataset of 2 trillion tokens. The model was then fine-tuned on publicly available instructional datasets and supplemented with manually annotated data via Reinforcement Learning with Human Feedback (RLHF). As reported in [60], the performance of LT-Llama-2 was generally comparable with many open-source alternatives.

Llama-2 is an autoregressive transformer-based model [61], and its key mathematical components include the following.

Autoregressive component:
The model generates the next token $y_{t}$ based on the sequence of previous tokens $y_{< t}$ :

$P (Y) = \prod_{t = 1}^{m} P (y_{t} | y_{< t})$

where $P (Y)$ is the probability of generating the entire output sequence $Y = {y_{1}, y_{2}, \dots, y_{m}}$ , m is the total number of tokens in the output sequence Y, t is the index of the current token being predicted in the sequence, $y_{t}$ is the token at position t in the output sequence, $y_{< t}$ is all tokens that have been generated before the current token $y_{t}$ in the sequence, and $P (y_{t} ∣ y_{< t})$ is the conditional probability of the token $y_{t}$ , given the tokens that precede it in the sequence ( $y_{< t}$ ).
Transformer architecture:
The self-attention mechanism computes the attention scores

$Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V$

where Q, K, and V are the query, key, and value matrices, $d_{k}$ is the dimensionality of keys, softmax is a normalization function applied to the scaled dot product of Q and $K^{⊤}$ , and $Q K^{⊤}$ is the dot product of the query matrix Q and the transpose of the key matrix K, which computes the similarity scores between the query and all keys.
Optimization:
Llama-2 uses adaptive optimization techniques and fine-tunes pre-trained weights for specific tasks.

ChatGPT. We evaluated ChatGPT for the text simplification task without fine-tuning. ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) model [62] that leverages a self-attention mechanism that allows the model to understand and process the words in a sentence or document based on their relationship to each other [63,64]. To explore its effectiveness in a low-resource scenario, we tested ChatGPT (GPT-4o) on Lithuanian text simplification using its standard configuration through OpenAI’s browser interface. Although ChatGPT was not specifically designed for the Lithuanian language, it possesses multilingual capabilities to handle general Lithuanian text-processing tasks.

3.3. Evaluation Methods

To achieve a comprehensive assessment, our evaluation consisted of several steps:

Preliminary evaluation:
(a)
Preliminary manual inspection to obtain a general idea of the results of all the models used in this study.
(b)
Quantitative assessment of the impact of our 2 corpora in terms of model performance. We used SARI [65], BERTScore [66] and 3 variants of ROUGE (ROUGE-1, ROUGE-2 and ROUGE-L) [67] for this task.
In-depth evaluation of results of the best performing model:
(a)
Quantitative and qualitative approaches.
We utilized the EASSE [68] and multilingual tseval [69] libraries to facilitate and standardize the automatic evaluation of our best text simplification model. The EASSE library includes reference-based evaluation metrics and methods, such as BLEU [70], SARI [65], and Levenshtein similarity [68]. In contrast, the tseval library provides reference-less simplification assessment metrics and methods, including the proportions of additions and deletions made during the simplification process.
To complement these quantitative metrics, we conducted a qualitative evaluation of the text simplification outputs based on three widely recognized criteria: simplicity, meaning retention, and grammaticality (whether the model-simplified sentence is grammatical and understandable) [71,72]. Two experts independently evaluated model-simplified sentences. Additionally, these qualitative assessments were integrated with the tseval library to investigate the correlation between automatic evaluation metrics and the qualitative criteria.
(b)
Attention analysis:
To gain more insights into the model’s decision-making process, we also applied BertViz [73] to the best and worst examples of the model’s simplified sentences.

4. Results

4.1. Experimental Setup

We fine-tuned three models, mT5, mBART and LT-Llama-2, for text simplification. All three models were directly fine-tuned separately using Parallel Corpora 1 and 2. The parameter configuration (batch size, learning rate, and number of epochs) was chosen based on the best results observed in previous iterations of fine-tuning. For the LT-Llama-2 model (Lt-Llama-2-7b-instruct-hf (accessible at https://huggingface.co/neurotechnology/Lt-Llama-2-7b-instruct-hf (accessed on 30 October 2024)), we applied 4-bit quantization for resource optimization, preserving performance without excessive memory demands. Additionally, Low-Rank Adaptation (LoRA) [74] with a dropout rate of 0.05 was applied to specific linear layers. Custom prompt formatting was used to guide the model toward producing simplified Lithuanian text responses, in alignment with our simplification task requirements. Finally, we included ChatGPT (GPT-4o) in our study by adding simplification rules from our simplification guidelines to the prompt to test its text simplification capabilities in Lithuanian without fine-tuning.

4.2. Fine-Tuning Process

The fine-tuning focused on the following key aspects:

Hyperparameters: Based on our previous research, we selected the hyperparameters (batch size and learning rate) that provided the best results.
Data Augmentation: We evaluated the impact of differently prepared data, using the original data (Parallel Corpus 1) and the updated (augmented) dataset (Parallel Corpus 2), on the results of fine-tuned mT5 and mBART.

The parameter configuration was chosen based on the best results observed in previous iterations of fine-tuning. For mBART, a batch size of 4 and a learning rate of

10^{- 4}

were chosen, while for mT5, a batch size of 2 and a learning rate of

10^{- 4}

were used. Given that mT5 showed better results with more epochs, the number of epochs was increased from 8 to 16, while for mBART and LT-Llama-2, 8 epochs were used. Also, to ensure efficient model adaptation while balancing memory and computational efficiency, we used a batch size of 4 and a learning rate of

2 \times 10^{- 4}

for LT-Llama-2.

By systematically varying these hyperparameters, we aimed to identify the optimal settings that maximize the performance of models on the text simplification task. This approach allowed us to assess the impact of fine-tuning strategies and data augmentation on the quality of the simplified texts produced by the models and on the desired level of readability, i.e., Plain Language.

4.3. Preliminary Evaluation

4.3.1. Fine-Tuning Results: Parallel Corpus 1 vs. Parallel Corpus 2

We evaluated the performance of fine-tuned mT5 and mBART using a selected set of metrics (SARI, BERTScore and ROUGE) to assess their capacity to simplify text while maintaining the original meaning and intent. This analysis provided insights into the effectiveness of each model in generating high-quality simplified Lithuanian text. A summary of the results is presented in Table 2. For testing our models, we used our test dataset (see Section 3).

Contrary to expectations, the “old” mBART model (SARI score of 72.9781), which was fine-tuned on Parallel Corpus 1, outperformed the “new” mBART model (SARI score of 57.2374) as well as both versions of mT5. This suggests that the “old” mBART was more effective in simplifying text by retaining, adding, and deleting words during the simplification process. Consistent with this trend, the “old” mBART also achieved the highest BERTScore, surpassing the other three models and indicating that its generated simplifications were semantically more aligned with the reference sentences.

In terms of ROUGE scores, which evaluate n-gram overlap between model outputs and reference texts, the “old” mBART consistently scored higher across all ROUGE variants: ROUGE-1 (0.7797), ROUGE-2 (0.6753), and ROUGE-L (0.7555). This shows that the “old” mBART was more effective in capturing the relevant information from the original (complex) sentences.

Overall, the findings show that the “old” mBART was superior to the “new” mBART and both mT5 variants on all evaluation metrics. It generated simplified sentences with higher precision, semantic proximity to the reference texts, and higher readability.

4.3.2. Results of Fine-Tuned LT-Llama-2

We fine-tuned LT-Llama-2 using Parallel Corpus 1, as this configuration had previously given better results with the mT5 and mBART models. However, the outcomes were inadequate. In many cases, instead of simplifying the provided sentences, the fine-tuned model simply expanded them by adding information that was not present in the original complex sentences. Also, while the majority of the generated sentences were grammatically correct, the model exhibited excessive repetition of words and word sequences.

The foundational LT-Llama-2 was primarily fine-tuned for instructional or conversational tasks [60], whereas sentence simplification typically requires restructuring rather than open-ended generation. Therefore, the model appears to be more suited for content expansion or engaging in dialogue than for text simplification. Furthermore, although seven billion parameters of LT-Llama-2 make it a substantial model, the complexity of the text simplification task may demand more nuanced understanding and generation capabilities, which suggests that a larger model might perform the task more effectively. After evaluating the performance and tendencies of the fine-tuned LT-Llama-2, we decided not to use it for further development of our simplification system.

4.3.3. ChatGPT Results

To test out ChatGPT (GPT-4o) in simplifying the Lithuanian administrative texts, we devised a prompt with 18 Plain Lithuanian rules which were accompanied by examples of complex and simplified sentences. The provided rules offered clear indications regarding the replacement of syntactic structures and lexical elements, as well as sentence splitting. However, the experimental results demonstrated that, while ChatGPT responded quite well to a simple and short prompt for simplifying the given text without any further elaboration on the desired outcome, it was prone to disregarding the majority of the rules outlined in a more comprehensive prompt.

While ChatGPT can demonstrate a basic understanding of English grammar, this was not the case for Lithuanian. For example, it was unable to correctly identify the Lithuanian passive voice, nor was it able to convert it into an active voice. This can be explained by the general lack of Lithuanian resources, as studies such as [75] have shown that ChatGPT has limited metalinguistic awareness in the context of minor languages. Hence, while ChatGPT could be employed for non-professional purposes to summarize or simplify a variety of texts, the results of our experiment indicated that its results are too unstable to use in the real-world simplification of Lithuanian administrative texts.

4.4. In-Depth Evaluation

As “old” mBART, fine-tuned on Parallel Corpus 1, provided the best results, we performed a full in-depth evaluation only of this model and its outputs. When a comparison was needed, we compared “old” mBART with “old” mT5 (for the sake of simplicity, in the following subsections, these models will be referred to as mBART and mT5) to obtain more comprehensive insights.

4.4.1. Qualitative Evaluation

To assess the results of fine-tuned mBART and mT5, two linguists, familiar with Plain Lithuanian, independently evaluated model-simplified sentences from our test dataset (see Section 3) according to 3 criteria, indicated in Table 3, i.e., simplicity, meaning retention and grammaticality. The linguists evaluated all the sentences for each criterion from 0 to 5. Though mBART performed best based on quantitative as well as qualitative assessment, we included mT5 in the qualitative evaluation for the sake of comparison.

Table 3 demonstrates that mBART consistently outperformed the other model across all three criteria. A more detailed qualitative analysis revealed that both mBART and mT5 were relatively effective at simplifying shorter sentences. However, their performance varied significantly in handling longer sentences, which are characteristic of the Lithuanian administrative language [76]. Specifically, mT5 tended to perform inadequately with extended passages, often generating nonsensical word repetitions instead of coherent sentences or omitting entire sections of the original text rather than transforming them into new, shorter sentences. Consequently, this significantly reduced its scores across all three evaluation criteria.

Additionally, both models showed higher scores in meaning retention and grammaticality compared to simplicity. This occurred because, on some occasions, the sentences were only slightly simplified, while certain elements that needed simplification were left unchanged. These largely unchanged sentences remained grammatically correct and fully preserved the original meaning, resulting in a score of 5.

During our qualitative analysis, we observed that both models had successfully captured key linguistic patterns characteristic of Plain Lithuanian. These patterns include the insertion of subject and possessive pronouns, the use of demonstratives, and the conversion of passive constructions into active voice. Although transforming passive sentences could potentially create challenges in identifying the subject, particularly when the original passive sentence lacked an oblique agent phrase, this issue was rarely encountered. In most cases, it was sufficient to simply add a first-person plural pronoun in the subject position. Additionally, mBART model excelled at converting participial clauses into relative clauses with finite verbs. However, a significant limitation was that very few sentences achieved perfect scores of 5-5-5. This suggests that our models may need more and more varied training data to achieve the level of robustness required for text simplification.

4.4.2. EASSE Report for mBART Results

Comparison with Baselines

This section provides a comparative analysis of our fine-tuned mBART against two baselines: the Identity baseline and the Truncate baseline [77]. The results were evaluated using various metrics calculated with EASSE library (see Table 4). The Identity baseline outputs the input text unchanged. It provides a reference to evaluate how much a system modifies the input. The Truncate baseline generates simplified text by truncating the input text to a specific length or removing less important parts without rewriting.

The mBART model outperformed both the Identity and Truncate baselines across calculated key metrics. Notably, mBART achieved a SARI score of 49.63, which was substantially higher than the Identity baseline’s 13.43 and the Truncate baseline’s 23.41, which indicated the robust simplification capabilities of our model. In terms of readability, the mBART attained FKGL (Flesch–Kincaid Grade Level) [78] score of 13.84 was lower than the Identity baseline’s score (17.41) and the Truncate baseline’s score (15.98), showing the better readability of mBART outputs.

The Compression Ratio [79] for mBART was 0.94, showing that the model effectively preserved content compared to the Truncate baseline’s 0.76. Additionally, mBART recorded a Sentence Splits value of 1.04, which was slightly above the Truncate baseline’s 0.98, which may reflect occasional sentence splitting while maintaining natural text structure.

Regarding text modifications, mBART achieved 0.84 of Levenshtein Similarity [68], indicating substantial but meaningful edits in comparison to the Identity baseline’s 1.0 score while staying close to the Truncate baseline’s 0.85 score. The model ensured that 19% of sentences remained exact copies, where the Identity baseline had 100% of them and the Truncate baseline had 0%, which highlights that simplification was performed effectively.

Furthermore, mBART exhibited 0.20 in the Additions Proportion [80], which was higher than both baselines. This meant that texts were enriched during simplification where it was necessary. Meanwhile, the Deletions Proportion for mBART was 0.23, which was less aggressive than the Truncate baseline’s 0.26, revealing balancing simplicity with informativeness. The Lexical Complexity [68] of mBART simplified sentences decreased to 9.66, compared to the Identity baseline’s 9.73 and the Truncate baseline’s 9.84, which showed the model’s efforts in choosing simpler vocabulary during the simplification.

In summary, the mBART model performed well in simplifying text effectively while retaining essential content and improving readability. It had a balanced approach to content preservation, selective additions and deletions, and vocabulary simplification in comparison to both baselines.

Analysis of Sentence-Length Effect

Table 5 illustrates the mBART performance across different intervals of sentence lengths, measured in characters. Short sentences ([8;55]) showed more distinguished performance on several metrics, e.g., achieving higher BLEU [70] (51.04) and SARI (58.51) scores compared to longer sentences ([242;4830]), which achieved a BLEU score of 22.10 and SARI score of 45.06. This disparity may be attributed to the inherent complexity of longer sentences [81], which negatively affected the model’s performance. Additionally, simplification metrics such as Levenshtein Similarity and Exact Copies [68] declined with increasing sentence length, which indicated more intensive editing for longer sentences. Specifically, short sentences maintained greater structural integrity with a Levenshtein Similarity of 0.87 and a lower proportion of deletions (0.13) compared to longer sentences (0.78 and 0.33, respectively).

Further analysis revealed that readability, assessed via FKGL score, increased from 8.77 for shorter sentences to 16.80 for longer ones, suggesting that simplifications of longer sentences maintained lower readability due to structural or vocabulary complexity. The Compression Ratio decreased from 1.13 in short sentences to 0.81 in longer sentences, which indicates more intensive length reductions while simplifying longer texts [79]. Sentence Splits showed a slight increase from 1.00 to 1.11 as sentence length increased, reflecting the more frequent splitting of longer sentences during the process of simplification.

The Exact Copies scores significantly decreased from 0.51 for short sentences to 0.04 for long sentences, demonstrating that mBART makes more extensive modifications to longer sentences. On the other hand, the Additions Proportion remained relatively stable across sentence lengths, ranging from 0.17 to 0.23, except for the longest sentences. In contrast, the Deletions Proportion [80] rose significantly from 0.13 for short sentences to 0.33 for longer ones, which highlighted the need for more substantial content removal in longer sentences. Additionally, the Lexical Complexity score decreased from 10.36 for shorter sentences to 8.88 for longer sentences, revealing a more significant reduction in vocabulary complexity during the simplification of longer sentences.

The mBART model effectively simplified shorter sentences, achieving high BLEU and SARI scores while preserving readability and structural coherence. The metrics indicated a balanced approach to simplification, with minimal length reduction and selective sentence splitting for shorter texts.
For longer sentences ([242;4830]), mBART’s performance decreased, as showed by lower BLEU and SARI scores, increased FKGL, and greater reliance on deletions and structural modifications. These challenges emphasized the complexity of maintaining both structure and meaning in the simplification of longer sentences.

Analysis of Best and Worst Simplifications

As our analysis revealed, mBART can perform one of the most essential text simplification operations in Lithuanian, i.e., to replace participial constructions, which are typical of the more formal Lithuanian language varieties, but infrequent in spoken and less formal language, with finite verb forms. Lithuanian has a wide variety of passive and active participles [82], so this simplification operation involves correctly identifying the noun phrase arguments of the verb from which the participle is formed, and changing their cases accordingly (Lithuanian has a system of 7 noun cases, with nominative being usually (though not always) used to mark the subject, while the other verb arguments generally are marked with accusative, genitive, or dative cases). This is especially relevant for the transformation of passive voice constructions, which also involve participles, into active clauses. The best simplifications according to SARI (part of EASSE report) exemplify the instances in which the model perform a variety of such operations correctly (Figure 1 and Figure 2).

While case adaptation is not necessary for active participle constructions (Figure 1), it is essential when changing the passive voice into active (Figure 2). The report showed that mBART does not always perform this operation correctly, as in Figure 3 where the highlighted noun suffixes are supposed to be accusative. However, mBART can also successfully add a personal pronoun which boosts the readability of the texts—Lithuanian being a pro-drop language [83], the pronouns in finite clauses are often dropped in administrative-style texts (Figure 2). In general, the best simplifications according to SARI are all examples of fairly successful transformations from administrative style to Plain Language. Thus, the EASSE report provided reliable results in this section.

The examples of the lowest SARI-scoring pairs of sentences did not involve any simplifications, i.e., the text remained unchanged between the original and the simplified versions. The lowest scores were applied both when simplifications were necessary but the model did not apply any, as well as when simplifications were advisable but not required, as a short phrase (usually 2–3 words) was already simple enough. Additionally, some simplifications with significant clause deletion were also assigned low scores. This is not ideal, as clause deletion in itself should not be penalized—Lithuanian administrative texts contain a significant amount of unnecessary information that sometimes should be deleted, instead of the model trying to simplify the titles of specific laws or other documents that are being referred to in the text.

However, quite understandably, EASSE does not have any means to distinguish between the deletions applied to essential information, resulting in incomprehensible sentences, and those that eliminate overspecification. The most significant drawback of the EASSE report on the worst simplifications according to SARI is that they rarely included wrong simplifications that did change the text but failed to produce grammatical and comprehensible sentences. Thus, this part of the report is less reliable.

Meanwhile, simplifications with the most compression provided by EASSE mainly included the following:

Sentences that were shortened due to clause deletion, which may sometimes result in incomprehensible sentences when essential information is deleted, as already mentioned in the worst SARI-scoring simplifications.
Sentences that were shortened due to a replacement of a noun phrase with an anaphoric pronoun, which, again, may or may not be a desirable simplification operation, depending on the complexity of the noun phrase and on the importance of the information it provides. Generally, anaphoric pronouns are discouraged in Plain Language guidelines [3].

Regarding the simplifications with the highest amount of paraphrasing, EASSE proposed only a few pairs of sentences with actual paraphrases. Mainly, it did not recognize the same root in both the original and the simplified version, most likely due to the morphological complexity of the Lithuanian language. Morphological complexity is surely also an issue for mBART: it can be seen from ungrammatical verb forms that sometimes appear in simplified texts, such as pakeičiuoja instead of pakeičia ‘changes’, probably per analogy with a common Lithuanian conjugation paradigm that has a -uoja suffix in the 3rd person present (e.g., skaičiuoti ‘to calculate’—skaičiuoja ‘calculates’, and matuoti ‘to measure’—matuoja) but does not apply to the verb pakeisti ‘to change’.

Finally, in some cases, the sentences selected by EASSE as including paraphrasing were those where mBART performed poorly and did not retain the original meaning of the sentence (e.g., 1 žingsnis ’Step 1’ simplified as Antras žingsnis ‘Second step’), or where mBART rewrote dates in words, such as ‘2022.04–2023.03’—nuo 2022 metų balandžio pradžios iki 2023 metų balandžio pabaigos ‘from the beginning of April 2022 until the end of April (sic!) 2023’.

4.4.3. Tseval Results for mBART

To streamline further analysis, we processed qualitatively evaluated model-simplified sentences using a classification approach for simplicity, meaning retention, and grammaticality scores, to explore the correlation of quantitative evaluation metrics and qualitative evaluation criteria. Initially, these scores ranged from 0 to 5, as the evaluators who qualitatively evaluated model-simplified sentences used this evaluation scheme. To classify them, we converted the continuous scores into categorical labels based on defined thresholds:

Bad: Scores between 0 and 2.5 indicate a low simplification evaluation level.
Ok: Scores between 2.5 and 3.5 represent a medium or acceptable simplification evaluation level.
Good: Scores between 3.5 and 5 indicate a high simplification evaluation level.

This transformation enabled us to categorize the evaluation metrics, thereby simplifying their interpretation via a diverse range of methods that do not require reference texts for comparison. To achieve this, we utilized the multilingual tseval library [69], which offered functions for assessing the quality of text simplification, including scores for simplicity, meaning retention, and grammaticality. We conducted this analysis only on the results from our best-performing model, mBART. The label distribution of processed data is shown in Figure 4.

The following tables summarize the key metrics and correlations for each of the three qualitative criteria in assessing text simplification results. For calculating correlation, Pearson correlation coefficients [84] were calculated. Only the metrics that showed statistically significant correlations are presented, i.e., with p

\leq 0.05

.

Simplicity

Table 6 reports 10 metrics that show statistically significant correlations with the simplicity criterion, although the correlation coefficients are weak (<0.19) [85]. Thus, positive correlations were found for the proportion of deleted words, Flesh Reading Ease (FRE) [86], compression ratio, and Levenshtein distance [87]. This indicated that sentences with higher deletion proportions, better readability scores, shorter lengths, and more substantial editing tended to be simpler.

On the other hand, negative correlations were observed for smoothed BLEU [88], average cosine similarity between pre-trained word embeddings [89] of complex and simplified sentences, lexical complexity, the average position of words in simplified sentences in a frequency table, percentage of common lemmas between complex and simplified sentences, and Levenshtein similarity. These findings suggested that sentences with greater lexical overlap, semantic similarity, or complexity were less simplified.

All in all, the aforementioned metrics aid in evaluating sentence simplicity by measuring the extent of reduction, readability, and semantic alterations between original (complex) and simplified sentences. However, the weak correlations imply that these metrics, while statistically significant, are not strong predictors of simplicity individually.

Meaning Preservation

Table 7 presents 12 metrics that reveal statistically significant correlations with the meaning preservation criterion, although with weak correlation coefficients (<0.19). Positive correlations were observed for smoothed BLEU, Levenshtein similarity, the percentage of words left unchanged after simplification, the percentage of shared word forms between complex and simplified sentences, and Hungarian cosine similarity between the pre-trained word embeddings of complex and simplified sentences [89]. These positive correlations demonstrated that greater overlap in content, vocabulary, and semantic similarity contributed to better meaning retention.

Conversely, negative correlations were identified for the percentage of deleted words during simplification, the number of words per sentence, the number of characters per sentence, the number of syllables per sentence, the maximum position of output words in the frequency table, FKGL scores, the difference in the number of characters between original (complex) and simplified sentences, and lexical complexity scores. These negative correlations suggested that excessive content reduction or overly simplified sentences may result in a loss of meaning.

In summary, while these metrics provide valuable insights into factors influencing meaning preservation by assessing reduction extent, readability, and semantic changes between original and simplified texts, the weak correlation scores imply that they are not strong individual predictors of meaning retention. Therefore, a comprehensive evaluation incorporating multiple factors may be necessary.

Grammaticality

Table 8 presents 14 metrics that demonstrated statistically significant although weak correlations with the grammaticality criterion.

Positive correlations were identified for smoothed BLEU, Levenshtein similarity between complex and simplified sentences [87], the percentage of words left unchanged during simplification, the percentage of lemmas shared between complex and simplified sentences, and the percentage of words retained during simplification. These results suggested that retaining more of the original (complex) sentence structure and vocabulary maintained the grammatical correctness of the simplified sentences.

However, negative correlations were found for the number of words in the simplified text, the number of characters in the simplified text, the number of syllables per sentence, the Levenshtein distance between original and simplified sentences, the number of words per sentence, the number of characters per sentence, the maximum position of output words in the frequency table, the percentage of common lemmas between complex and simplified sentences, the percentage of common word forms between complex and simplified sentences, and FKGL. These negative correlations indicated that increasing the complexity of simplified sentences may lead to grammatical errors.

In summary, while the evaluation metrics provided insights into factors that influenced grammaticality by assessing the preservation of sentence structure and vocabulary, weak correlation scores implied that these metrics are not strong individual predictors of grammatical correctness. Therefore, a more complex approach incorporating multiple metrics may be necessary.

4.4.4. Attention Analysis with BertViz

BertViz facilitates the visualization of attention weights, which enables examining which segments of the input text the model emphasizes during the simplification process [73]. For comparative analysis, BertViz was applied to simplified examples with the highest and lowest SARI scores, representing high-quality and low-quality simplifications, respectively. In transformer models, attention mechanisms allow the model to focus on different parts of the input when generating predictions [64]. Each attention head in the model’s layers calculates a weight matrix that measures how much attention each token gives to the other tokens in the sequence, which determines their relevance. BertViz’s attention plots visualize the weight distributions, showing which specific words, subwords, or other elements the model focuses on for each token in the input sequence. This analysis focuses on the model’s last layer, as it usually has the most relevant information for the fine-tuned task which is text simplification in our case.

Cross-attention, also known as encoder–decoder attention [90], shows how each token in the decoder’s output (the simplified sentence) relates to tokens in the encoder’s input (the complex sentence). This mechanism reveals how the model uses the original sentence to create its simplified version. Strong attention connections indicate which elements in the input sentence influence specific (sub)words and other types of tokens in the output. The effective simplification or paraphrasing is marked by logical alignments between key elements of the input and output. Conversely, misaligned or weak attention may signal difficulties in simplification or loss of important context.

Figure 5 and Figure 6 illustrate cross-attention patterns for sentences with the highest and lowest SARI scores, respectively. In the low-SARI example, attention in most heads relied heavily on delimiters, suggesting the model struggled to identify meaningful parts of the input to focus on [91]. A similar but weaker trend appeared in the high-SARI example. In this case, some attention heads focused on (sub)words that predicted the target (sub)word without directly focusing on the target (sub)word itself, while others distributed attention more evenly across the entire sentence. This suggested that the model was attempting to consider the whole sentence rather than giving priority to specific tokens or segments.

Decoder attention shows how each token in the simplified sentence interacts with other tokens within the same generated sequence. This provides insight into how the model maintains coherence and fluency in its output [92]. For example, it demonstrates how the model ensures grammaticality, such as aligning pronouns with their antecedents or matching verbs with their subjects. Repetitive or diagonal attention patterns often indicate that the model relies on recently generated (sub)words for maintaining fluency.

Figure 7 and Figure 8 display decoder attention for sentences with the highest and lowest SARI scores, respectively. In both cases, attention heads are mostly focused on the same (sub)words or a small number (mostly up to 5) of preceding (sub)words. This pattern was more pronounced in the example with the lowest SARI score, which suggested that the model often copied (sub)words from the input and limited its attention to a few recently generated tokens, instead of considering the broader sequence. While this behavior can benefit text simplification tasks that require minimal changes to the original sentence, it also highlighted potential limitations in the model’s ability to capture long-range dependencies or fully understand the overall context of the input which became apparent in the analysis of model performance in terms of sentence lengths.

Encoder attention visualizes the attention weights within the encoder layers, providing insight into how the model processes the original (complex) sentence before generating a simplified version. This visualization highlights which parts of the input sentence were most important for encoding its meaning [93]. Strong attention weights between specific (sub)words indicate that the model identified these (sub)words as significant or related, which is essential for building a meaningful contextual representation.

Figure 9 and Figure 10 show encoder attention patterns. In the low-SARI example, most attention heads focused on delimiters, which suggests that the model leaned heavily on special tokens to segment the input text. Thus, this reliance on structural cues may have caused the model to overlook key content (sub)words that carry the sentence’s core meaning. In contrast, the high-SARI example showed a different pattern. While some attention heads still focused on delimiters, in many cases they distributed their attention more evenly across all words in the sentence. This suggested that the encoder was working to develop a more balanced understanding of the sentence while taking all components into account to capture the overall meaning or context. Such an approach may improve the model’s ability to generalize to diverse input data, especially when the input does not follow consistent patterns.

5. Discussion

Our study on the simplification of Lithuanian administrative texts to plain language brings several findings related to performance, data augmentation, evaluation metrics, and the capabilities of different language models. These findings are discussed further in this section.

5.1. Data Augmentation and Training Data Quality

Contrary to expectations, data augmentation did not improve the performance of our text simplification models. The original dataset alone was sufficient to achieve satisfactory results, which highlighted the importance of high-quality and representative data in low-resource language contexts [94]. Instead of enhancing the training data, augmented data may have introduced inconsistencies or reduced the overall quality of the dataset, ultimately decreasing the models’ performance.
When compared to related studies in text simplification and low-resource NLP tasks, these findings reveal a notable contrast. While data augmentation has been effective for other languages, e.g., English [95], its ineffectiveness in this study points to challenges associated with Lithuanian. This could be due to the broader limitations of simplistic augmentation techniques [96] or specific features of the Lithuanian language.

Unlike studies, targeting large languages, such as English, where many large-scale datasets and pre-trained models are available, we demonstrate the feasibility of simplification in a lower-resource context. Our results underscore the importance of carefully curating datasets to ensure linguistic alignment and high quality. Relying on data augmentation as a universal solution may not always be effective. Instead, the priority should be on developing representative datasets specifically tailored to the unique characteristics of the language to ensure robust model performance.

5.2. Limitations of Automated Metrics

Our study highlighted the limitations of automated metrics like BLEU and ROUGE in evaluating the nuances of text simplification. Human evaluations proved essential for capturing aspects such as grammaticality, semantic accuracy, and the overall quality of simplified outputs, which uncovered trends and patterns that automated metrics missed, such as applying appropriate context in simplifications.

For instance, BLEU prioritized exact matches between simplified and reference sentences and penalized valid paraphrases or structural changes that improved simplification. This limitation stressed the need for more advanced evaluation frameworks. Human evaluators, on the other hand, were able to assess qualitative aspects of simplification that automated metrics overlooked, which accented the importance of integrating human judgment into evaluation processes.

These findings aligned with discussions in the NLP field that advocated for hybrid evaluation approaches that combine automated metrics and human assessments [97,98,99]. Such frameworks combine the scalability of automated metrics with the nuanced insights from human evaluations, which enable a more comprehensive and accurate evaluation of text simplification quality.

5.3. Model Performance

While mBART performed well in structural simplifications, such as adding pronouns or transforming a passive voice into active voice constructions, it, however, struggled with very long or syntactically complex sentences, which indicated areas where we need further improvements. In comparison, mT5 had difficulties processing longer passages, which likely occurred due to memory limitations or insufficient fine-tuning. This suggested that methods like hierarchical modeling [100] or pre-processing strategies such as sentence splitting could enhance its performance.
In addition, experiments with LT-Llama-2 revealed limitations, such as the unintended expansion of content, where the model added information not present in the original sentence. These issues likely arose from the model’s design, as it is more suited for conversational or instructional tasks rather than structural rewriting which is more typical for sentence simplification. These findings highlighted the potential need for larger models, alternative architectures, or more targeted fine-tuning techniques to aid in better aligning the model with the requirements of text simplification.
Finally, ChatGPT, though effective in many general-purpose applications, faced challenges with Lithuanian-specific grammatical rules. This raised a question about the suitability of general-purpose language models for low-resource tasks without extensive customization, such as Reinforcement Learning with Human Feedback (RLHF) [101].

5.4. Sentence-Length Dependency and Structural Changes

The more in-depth analysis of mBART performance showed a clear dependency on sentence length. The model performed well with short to moderate sentences but struggled with longer, more complex inputs. While techniques like sentence splitting and lexical simplification were effective for shorter sentences, simplifications of longer sentences occasionally resulted in the loss of essential information due to deletions.
This decline in performance for longer sentences can be attributed to increased syntactic complexity and semantic density. Longer sentences often have complex structures and contain a higher concentration of information [102,103], which makes them more difficult to simplify without losing or altering their meaning. For example, the analysis of input–output pairs revealed that mBART performed well with simple sentence structures but struggled with compound or complex sentences, occasionally producing incomplete simplifications.

To address these challenges, strategies such as iterative simplification [47] or introducing semantic constraints during generation [104] could help reduce information loss when splitting sentences. Additionally, balancing lexical simplification with content preservation is essential. Incorporating semantic similarity metrics into model training [105] could help ensure that simplified text preserves its core meaning while its complexity is decreased.

5.5. Evaluation Metrics and Correlation Analysis

In our in-depth study, we identified weak correlations between automated evaluation metrics and human-judged criteria for simplicity, meaning retention, and grammaticality, which highlighted the challenges of evaluating text simplification, as automated metrics often disagree with human perceptions, i.e., simpler texts often reduced lexical complexity, but for retaining the original meaning a high degree of structural and lexical similarity between the original and simplified versions was required.

These weak correlations likely resulted from a misalignment between automated metrics and human evaluation criteria. Metrics such as BLEU and ROUGE focus on surface-level similarities, like word overlap, and fail to capture deeper aspects [38] such as meaning preservation and consistency, and grammatical correctness. Alternative evaluation methods or task-specific metrics may provide more accurate assessments in capturing contextual and semantic relationships between words. Finally, using diverse perspectives, e.g., including not only domain experts but also target audiences, could improve evaluation frameworks considering the subjective and complex nature of text simplification.

5.6. Attention Mechanisms and Model Behavior

An analysis of attention mechanisms revealed that mBART effectively linked input and output content but relied too much on delimiters, which impacted semantic richness. Specifically, the encoder attention frequently focused on delimiters, which indicated that the model relied on these structural cues rather than giving sufficient attention to content (sub)words. This reliance may have resulted in not using important information within the text on some occasions.
On the other hand, decoder attention prioritized fluency by focusing mainly on identical (sub)words or a few preceding (sub)words. This was particularly evident in the examples with lower SARI scores. While this strategy supported grammatical correctness and coherence, it limited the model’s ability to capture long-range dependencies or fully interpret the broader context of the input [106]. This indicated that mBART primarily focused on local context instead of processing and integrating information from the entire sentence, which could limit its ability to produce semantically meaningful and contextually accurate simplifications.

To address these limitations, potential improvements could include implementing hierarchical attention mechanisms [100] or memory augmentation techniques to help the model better capture long-range dependencies, e.g., as in [107,108,109]. Bidirectional decoding strategies could also be explored to achieve a more comprehensive understanding of the input sentence, which would allow the model to generate simplifications that are semantically richer and more accurate. These strategies could improve model performance not only in Lithuanian text simplification but also in other tasks involving low-resource languages.

6. Conclusions

In this study, we investigated text simplification for Lithuanian administrative documents, aiming to improve information accessibility through the use of Plain Language. We fine-tuned the mT5, mBART, and LT-Llama-2 models and evaluated the capabilities of ChatGPT. Among the evaluated models, mBART demonstrated the best performance, effectively handling lexical and syntactic simplifications that corresponded to the requirements of Plain Language.

However, challenges remain in processing longer and more complex sentences, highlighting opportunities for further improvements in sentence-splitting techniques and syntactic complexity management. Also, the results emphasized the importance of high-quality, representative datasets for low-resource languages and underlined the limitations of general-purpose models like ChatGPT for language-specific tasks. Human evaluations proved essential in assessing grammaticality, meaning retention, and simplicity, highlighting the need for task-specific metrics that align with human judgment.

Our future plans will focus on improving model performance through refined fine-tuning techniques, experimenting with larger and more diverse datasets, and optimizing our models for handling longer and more complex texts. While the data we used for fine-tuning the models were carefully curated, the reliance on publicly available administrative texts may introduce sampling bias, particularly in underrepresented domains. The sample size, although sufficient for this study, may limit generalizability to other administrative or informal text types. Therefore, future work will focus on expanding the dataset to include a more comprehensive range of administrative texts.

Additionally, we aim to deepen our understanding of model behavior by analyzing the decision-making processes regarding simplifications, with particular attention to improving factual accuracy and identifying potential biases. These advancements aim to broaden the applicability of text simplification tools in facilitating communication between institutions and the general public.

Author Contributions

Formal analysis, D.K.K.; Investigation, J.M.; Data curation, E.R.; Writing—original draft, D.K.; Writing—review & editing, A.Č.; Supervision, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by The Research Council of Lithuania (LMTLT), grant agreement No. S-LIP-22-77.

Data Availability Statement

The data we used for fine-tuning will be available by request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BART	Bidirectional and Auto-Regressive Transformers
BERT	Bidirectional Encoder Representations from Transformers
BLEU	Bilingual Evaluation Understudy
EASSE	Easier Automatic Sentence Simplification Evaluation
FKGL	Flesch–Kincaid Grade Level
GPT	Generative Pre-trained Transformer
HSSF	Human Simplification with Sentence Fusion
KG	Knowledge Graph
LIME	Local Interpretable Model Agnostic Explanation
Llama	Large Language Model Meta AI
LLMs	Large Language Models
LoRA	Low-Rank Adaptation
LR	Logistic Regression
LSTM	Long Short-Term Memory
mBART	Multilingual BART
mT5	Multilingual T5
RLHF	Reinforcement Learning with Human Feedback
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
RST	Rhetorical Structure Theory
SARI	System Output Against References and Against the Input Sentence
SHAP	SHapley Additive exPlanations
T5	Text-to-Text Transfer Transformer
Glosses:
ptcp	participle
pst	past tense
pr	present tense
1	first person
sg	singular
pl	plural
nom	nominative case
acc	accusative case
f	feminine gender
m	masculine gender
pass	passive voice
act	active voice

References

Štajner, S. Automatic text simplification for social good: Progress and challenges. ACL-IJCNLP 2021, 2021, 2637–2652. [Google Scholar]
François, T.; Müller, A.; Rolin, E.; Norré, M. AMesure: A Web platform to assist the clear writing of administrative texts. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, Suzhou, China, 4–7 December 2020; pp. 1–7. [Google Scholar]
Adler, M. The Plain Language Movement. In The Oxford Handbook of Language and Law; Tiersma, P.M., Solan, L.M., Eds.; Oxford University Press: Oxford, UK, 2012. [Google Scholar]
Maaß, C. Easy Language—Plain Language—Easy Language Plus: Balancing Comprehensibility and Acceptability; Frank & Timme: Berlin, Germany, 2020. [Google Scholar]
Rennes, E.; Jönsson, A. A tool for automatic simplification of swedish texts. In Proceedings of the 20th Nordic Conference of Computational Linguistics, Vilnius, Lithuania, 11–13 May 2015; Linköping University Electronic Press: Linköping, Sweden, 2015; pp. 317–320. [Google Scholar]
Suter, J.; Ebling, S.; Volk, M. Rule-based Automatic Text Simplification for German. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany, 19–21 September 2016; pp. 279–287. [Google Scholar] [CrossRef]
Štajner, S.; Saggion, H. Data-driven text simplification. In Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts, Santa Fe, NM, USA, 20 August 2018; pp. 19–23. [Google Scholar]
Srikanth, N.; Li, J.J. Elaborative simplification: Content addition and explanation generation in text simplification. arXiv 2020, arXiv:2010.10035. [Google Scholar]
Huang, C.Y.; Wei, J.; Huang, T.H. Generating Educational Materials with Different Levels of Readability using LLMs. arXiv 2024, arXiv:2406.12787. [Google Scholar]
Agrawal, S.; Carpuat, M. Controlling Pre-trained Language Models for Grade-Specific Text Simplification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12807–12819. [Google Scholar]
Qiang, J.; Li, Y.; Zhu, Y.; Yuan, Y.; Wu, X. Lexical simplification with pretrained encoders. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8649–8656. [Google Scholar]
Alissa, S.; Wald, M. Text simplification using transformer and BERT. Comput. Mater. Contin. 2023, 75, 3479–3495. [Google Scholar] [CrossRef]
Maddela, M.; Alva-Manchego, F.; Xu, W. Controllable text simplification with explicit paraphrasing. arXiv 2020, arXiv:2010.11004. [Google Scholar]
Sheang, K.C.; Saggion, H. Controllable sentence simplification with a unified text-to-text transfer transformer. In Proceedings of the 14th International Conference on Natural Language Generation (INLG), Aberdeen, UK, 20–24 September 2021; Association for Computational Linguistics: Aberdeen, UK, 2021. [Google Scholar]
Seidl, T.; Vandeghinste, V. Controllable Sentence Simplification in Dutch. Comput. Linguist. Neth. J. 2024, 13, 31–61. [Google Scholar]
Monteiro, J.; Aguiar, M.; Araújo, S. Using a pre-trained SimpleT5 model for text simplification in a limited corpus. In Proceedings of the Working Notes of CLEF, Bologna, Italy, 5–8 September 2022. [Google Scholar]
Schlippe, T.; Eichinger, K. Multilingual Text Simplification and its Performance on Social Sciences Coursebooks. In Proceedings of the International Conference on Artificial Intelligence in Education Technology, Berlin, Germany, 30 June–2 July 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 119–136. [Google Scholar]
Ohnesorge, F.; Gutiérrez, M.Á.; Plichta, J. CLEF 2023: Scientific Text Simplification and General Audience. In Proceedings of the Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 18–21 September 2023. [Google Scholar]
Devaraj, A.; Wallace, B.C.; Marshall, I.J.; Li, J.J. Paragraph-level simplification of medical texts. In Proceedings of the Conference Association for Computational Linguistics, North American Chapter, Meeting, Online Event, 6–11 June 2021; NIH Public Access: Bethesda, MD, USA, 2021; Volume 2021, p. 4972. [Google Scholar]
Vásquez-Rodríguez, L.; Shardlow, M.; Przybyła, P.; Ananiadou, S. Document-level Text Simplification with Coherence Evaluation. In Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability, Varna, Bulgaria, 7 September 2023; pp. 85–101. [Google Scholar]
Wen, Z.; Fang, Y. Augmenting low-resource text classification with graph-grounded pre-training and prompting. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 506–516. [Google Scholar]
Deilen, S.; Hernandez Garrido, S.; Lapshinova-Koltunski, E.; Maaß, C. Using ChatGPT as a CAT tool in Easy Language translation. In Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability, Varna, Bulgaria, 7 September 2023; Štajner, S., Saggio, H., Shardlow, M., Alva-Manchego, F., Eds.; INCOMA Ltd.: Shoumen, Bulgaria, 2023; pp. 1–10. [Google Scholar]
Li, Z.; Shardlow, M.; Alva-Manchego, F. Comparing Generic and Expert Models for Genre-Specific Text Simplification. In Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability, Varna, Bulgaria, 7 September 2023; Štajner, S., Saggio, H., Shardlow, M., Alva-Manchego, F., Eds.; INCOMA Ltd.: Shoumen, Bulgaria, 2023; pp. 51–67. [Google Scholar]
Ayre, J.; Mac, O.; McCaffery, K.; McKay, B.R.; Liu, M.; Shi, Y.; Rezwan, A.; Dunn, A.G. New frontiers in health literacy: Using ChatGPT to simplify health information for people in the community. J. Gen. Intern. Med. 2024, 39, 573–577. [Google Scholar] [CrossRef]
Sudharshan, R.; Shen, A.; Gupta, S.; Zhang-Nunes, S. Assessing the Utility of ChatGPT in Simplifying Text Complexity of Patient Educational Materials. Cureus 2024, 16, e55304. [Google Scholar] [CrossRef]
Tariq, R.; Malik, S.; Roy, M.; Islam, M.Z.; Rasheed, U.; Bian, J.; Zheng, K.; Zhang, R. Assessing ChatGPT for Text Summarization, Simplification and Extraction Tasks. In Proceedings of the 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI), Houston, TX, USA, 26–29 June 2023; pp. 746–749. [Google Scholar] [CrossRef]
Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv 2023, arXiv:2301.07597. [Google Scholar] [CrossRef]
Doshi, R.; Amin, K.S.; Khosla, P.; Bajaj, S.; Chheang, S.; Forman, H.P. Utilizing Large Language Models to Simplify Radiology Reports: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Bard, and Microsoft Bing. medRxiv 2023. [Google Scholar] [CrossRef]
Jeblick, K.; Schachtner, B.; Dexl, J.; Mittermeier, A.; Stüber, A.T.; Topalis, J.; Weber, T.; Wesp, P.; Sabel, B.O.; Ricke, J.; et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports. Eur. Radiol. 2023, 34, 2817–2825. [Google Scholar] [CrossRef] [PubMed]
Garbacea, C.; Guo, M.; Carton, S.; Mei, Q. Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online Event, 1–6 August 2021. [Google Scholar]
Ormaechea, L.; Tsourakis, N.; Schwab, D.; Bouillon, P.; Lecouteux, B. Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification. In Proceedings of the International Conference on Natural Language and Speech Processing, Online Event, 16–17 December 2023; pp. 120–133. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Qiao, Y.; Li, X.; Wiechmann, D.; Kerz, E. (Psycho-)Linguistic Features Meet Transformer Models for Improved Explainable and Controllable Text Simplification. arXiv 2022, arXiv:2212.09848. [Google Scholar]
Xu, W.; Wang, D.; Pan, L.; Song, Z.; Freitag, M.; Wang, W.Y.; Li, L. INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback. arXiv 2023, arXiv:2305.14282. [Google Scholar]
Jiang, D.; Li, Y.; Zhang, G.; Huang, W.; Lin, B.Y.; Chen, W. TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks. Trans. Mach. Learn. Res. 2023, 2024, 1–29. Available online: https://openreview.net/pdf?id=EE1CBKC0SZ (accessed on 26 January 2025).
Maddela, M.; Dou, Y.; Heineman, D.; Xu, W. LENS: A Learnable Evaluation Metric for Text Simplification. arXiv 2022, arXiv:2212.09739. [Google Scholar]
Ajlouni, A.B.A.; Li, J.; Ajlouni, M.A. Towards a Comprehensive Metric for Evaluating Text Simplification Systems. In Proceedings of the 2023 14th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 November 2023; pp. 1–6. [Google Scholar]
Cripwell, L.; Legrand, J.; Gardent, C. Document-Level Planning for Text Simplification. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 993–1006. [Google Scholar]
Yamaguchi, D.; Miyata, R.; Shimada, S.; Sato, S. Gauging the Gap Between Human and Machine Text Simplification Through Analytical Evaluation of Simplification Strategies and Errors. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 359–375. [Google Scholar]
Blinova, S.; Zhou, X.; Jaggi, M.; Eickhoff, C.; Bahrainian, S.A. SIMSUM: Document-level Text Simplification via Simultaneous Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 9927–9944. [Google Scholar]
Alkaldi, W.; Inkpen, D. Text Simplification to Specific Readability Levels. Mathematics 2023, 11, 2063. [Google Scholar] [CrossRef]
Cardon, R.; Bibal, A. On Operations in Automatic Text Simplification. In Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability (TSAR), Varna, Bulgaria, 7 September 2023; pp. 116–130. [Google Scholar]
Hewett, F. APA-RST: A Text Simplification Corpus with RST Annotations. In Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023), Toronto, ON, Canada, 13–14 July 2023. [Google Scholar]
Colas, A.; Ma, H.; He, X.; Bai, Y.; Wang, D.Z. Can Knowledge Graphs Simplify Text? In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 379–389. [Google Scholar]
Ivchenko, O.; Grabar, N. Impact of the Text Simplification on Understanding. Stud. Health Technol. Inform. 2022, 294, 634–638. [Google Scholar] [CrossRef]
Wang, J. Research on Text Simplification Method Based on BERT. In Proceedings of the 2022 7th International Conference on Multimedia Communication Technologies (ICMCT), Xiamen, China, 7–9 July 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 78–81. [Google Scholar] [CrossRef]
Corti, L.; Yang, J. ARTIST: ARTificial Intelligence for Simplified Text. arXiv 2023, arXiv:2308.13458. [Google Scholar]
Harris, L.; Kleimann, S.; Mowat, C. Setting plain language standards. Clarity J. 2010, 64, 16–25. [Google Scholar]
Martinho, M. International standard for clarity—We bet this works for all languages. Clarity J. 2018, 79, 17–20. [Google Scholar]
Brunato, D.; Dell’Orletta, F.; Venturi, G.; Montemagni, S. Design and Annotation of the First Italian Corpus for Text Simplification. In Proceedings of the 9th Linguistic Annotation Workshop, Denver, CO, USA, 5 June 2015; pp. 31–41. [Google Scholar] [CrossRef]
Dębowski, Ł.; Broda, B.; Nitoń, B.; Charzyńska, E. Jasnopis—A program to compute readability of texts in Polish based on psycholinguistic research. In Proceedings of the 12th Workshop on Natural Language Processing and Cognitive Science (NLPCS 2015), Krakow, Poland, 22–23 September 2015; pp. 51–61. [Google Scholar]
Utka, A. Dažninis Rašytinės Lietuvių Kalbos žodynas; Vytautas Magnus University Press: Kaunas, Lithuania, 2009. [Google Scholar]
Scarton, C.; Paetzold, G.; Specia, L. Simpa: A sentence-level simplification corpus for the public administration domain. In Proceedings of the LREC 2018, Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Schwarzer, M.; Tanprasert, T.; Kauchak, D. Improving human text simplification with sentence fusion. In Proceedings of the 15th Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15), Mexico City, Mexico, 11 June 2021; pp. 106–114. [Google Scholar]
Zhang, J.; Zhao, H.; Boyd-Graber, J. Contextualized Rewriting for Text Simplification. Proc. Trans. Assoc. Comput. Linguist. 2021, 9, 1525–1540. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online Event, 6–11 June 2021; pp. 483–498. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. arXiv 2020, arXiv:2001.08210. [Google Scholar] [CrossRef]
Nakvosas, A.; Daniušis, P.; Mulevičius, V. Open Llama2 Model for the Lithuanian Language. arXiv 2024, arXiv:2408.12963. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Yenduri, G.; Ramalingam, M.; Chemmalar Selvi, G.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Deepti Raj, G.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-trained Transformer) —A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Rothman, D. Transformers for Natural Language Processing: Build, Train, and Fine-Tune Deep Neural Network Architectures for NLP with Python, Hugging Face, and OpenAI’s GPT-3, ChatGPT, and GPT-4; Packt Publishing Ltd.: Birmingham, UK, 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Xu, W.; Napoles, C.; Chen, Q.; Callison-Burch, C. Optimizing Statistical Machine Translation for Text Simplification. Trans. Assoc. Comput. Linguist. 2016, 4, 401–415. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, New Orleans, LO, USA, 6–9 May 2019. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Alva-Manchego, F.; Martin, L.; Scarton, C.; Specia, L. EASSE: Easier Automatic Sentence Simplification Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China, 3–7 November 2019; pp. 49–54. [Google Scholar] [CrossRef]
Stodden, R.; Kallmeyer, L. A multi-lingual and cross-domain analysis of features for text simplification. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), Marseille, France, 11 May 2020; pp. 77–84. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the ACL 2002, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Nisioi, S.; Štajner, S.; Ponzetto, S.P.; Dinu, L.P. Exploring neural text simplification models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 85–91. [Google Scholar]
Alva-Manchego, F.; Scarton, C.; Specia, L. Data-driven sentence simplification: Survey and benchmark. Comput. Linguist. 2020, 46, 135–187. [Google Scholar] [CrossRef]
Vig, J. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, 28 July–2 August 2019; pp. 37–42. [Google Scholar] [CrossRef]
Hu, J.E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Massaro, A.; Samo, G. Prompting Metalinguistic Awareness in Large Language Models: ChatGPT and Bias Effects on the Grammar of Italian and Italian Varieties. Verbum 2023, 14, 1–11. [Google Scholar] [CrossRef]
Vladarskienė, R. Lietuvių bendrinės ir administracinės kalbos santykis. Bendrinė Kalba (Iki 2014 Metų–Kalbos KultūRa) 2007, 80, 55–63. [Google Scholar]
Stodden, R. Reproduction of German Text Simplification Systems. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context@ LREC-COLING 2024, Turin, Italy, 21 May 2024; pp. 1–15. [Google Scholar]
Magrath, W.J.; Shneyderman, M.; Bauer, T.; Placer, P.N.; Best, S.; Akst, L. Readability Analysis and Accessibility of Online Materials About Transgender Voice Care. Otolaryngol. Neck Surg. 2022, 167, 952–958. [Google Scholar] [CrossRef] [PubMed]
Vadlamannati, S.; Şahin, G.G. Metric-Based In-context Learning: A Case Study in Text Simplification. arXiv 2023, arXiv:2307.14632. [Google Scholar]
Chamovitz, E.; Abend, O. Cognitive Simplification Operations Improve Text Simplification. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL) (Hybrid Event), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 241–265. [Google Scholar]
Iavarone, B.; Brunato, D.; Dell’Orletta, F. Sentence Complexity in Context. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Online Event, 10 June 2021. [Google Scholar]
Nau, N.; Spraunienė, B.; Žeimantienė, V. The passive family in Baltic. Balt. Linguist. 2020, 11, 27–128. [Google Scholar] [CrossRef]
Ramonaitė, J.T. Pronoun Variants in Standard Lithuanian: Diamesic Dimension. Liet. Kalba 2021, 16, 8–24. [Google Scholar] [CrossRef]
Field, A.P.; Gillett, R. How to do a meta-analysis. Br. J. Math. Stat. Psychol. 2010, 63, 665–694. [Google Scholar] [CrossRef]
Eddington, D. Statistics for Linguists: A Step-by-Step Guide for Novices; Cambridge Scholars Publishing: Newcastle, UK, 2016. [Google Scholar]
Aziz Hussin, A. Refining the Flesch Reading Ease formula for intermediate and high-intermediate ESL learners. Int. J.-Learn. High. Educ. (IJELHE) 2015, 3, 123–142. [Google Scholar]
Po, D.K. Similarity based information retrieval using Levenshtein distance algorithm. Int. J. Adv. Sci. Res. Eng. 2020, 6, 6–10. [Google Scholar] [CrossRef]
Chen, B.; Cherry, C. A systematic comparison of smoothing techniques for sentence-level BLEU. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 362–367. [Google Scholar]
Zhou, K.; Ethayarajh, K.; Card, D.; Jurafsky, D. Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 401–423. [Google Scholar]
Bai, Y.; Yi, J.; Tao, J.; Tian, Z.; Wen, Z.; Zhang, S. Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1897–1911. [Google Scholar] [CrossRef]
Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the BlackboxNLP@ACL, Florence, Italy, 1 August 2019. [Google Scholar]
Manakul, P.; Gales, M. Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9359–9368. [Google Scholar]
Raganato, A.; Tiedemann, J. An analysis of encoder representations in transformer-based machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 287–297. [Google Scholar]
Lamar, A.K.; Kaya, Z. Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT. In Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023), Dubrovnik, Croatia, 6 May 2023. [Google Scholar] [CrossRef]
Van, H. Mitigating Data Scarcity for Large Language Models. arXiv 2023, arXiv:2302.01806. [Google Scholar]
Okimura, I.; Reid, M.; Kawano, M.; Matsuo, Y. On the impact of data augmentation on downstream performance in natural language processing. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, Dublin, Ireland, 26 May 2022; pp. 88–93. [Google Scholar]
Sai, A.B.; Dixit, T.; Sheth, D.Y.; Mohan, S.; Khapra, M.M. Perturbation CheckLists for Evaluating NLG Evaluation Metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7219–7234. [Google Scholar]
Zhang, S.; Bansal, M. Finding a Balanced Degree of Automation for Summary Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6617–6632. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 2511–2522. [Google Scholar]
He, Z.; Qin, Z.; Prakriya, N.; Sun, Y.; Cong, J. HMT: Hierarchical Memory Transformer for Long Context Language Processing. arXiv 2024, arXiv:2405.06067. [Google Scholar]
Wu, Z.; Hu, Y.; Shi, W.; Dziri, N.; Suhr, A.; Ammanabrolu, P.; Smith, N.A.; Ostendorf, M.; Hajishirzi, H. Fine-grained human feedback gives better rewards for language model training. Adv. Neural Inf. Process. Syst. 2023, 36, 59008–59033. [Google Scholar]
Niklaus, C.; Cetto, M.; Freitas, A.; Handschuh, S. Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions. arXiv 2023, arXiv:2308.00425. [Google Scholar]
Salman, M.; Haller, A.; Rodríguez Méndez, S.J. Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification. arXiv 2023, arXiv:2304.07774. [Google Scholar]
Hijazi, R.; Espinasse, B.; Gala, N. GRASS: A Syntactic Text Simplification System based on Semantic Representations. Data Sci. Mach. Learn. 2022, 12, 221–236. [Google Scholar] [CrossRef]
Lu, J.; Li, J.; Wallace, B.C.; He, Y.; Pergola, G. NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 1079–1091. [Google Scholar]
Guan, J.; Mao, X.; Fan, C.; Liu, Z.; Ding, W.; Huang, M. Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online Event, 1–6 August 2021; pp. 6379–6393. [Google Scholar]
Yu, H.; Wang, C.; Zhang, Y.; Bi, W. TRAMS: Training-free Memory Selection for Long-range Language Modeling. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 4966–4972. [Google Scholar]
Fang, J.; Tang, L.; Bi, H.; Qin, Y.; Sun, S.; Li, Z.; Li, H.; Li, Y.; Cong, X.; Lin, Y.; et al. Unimem: Towards a unified view of long-context large language models. arXiv 2024, arXiv:2402.03009. [Google Scholar]
He, Z.; Karlinsky, L.; Kim, D.; McAuley, J.; Krotov, D.; Feris, R. CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory. arXiv 2024, arXiv:2402.13449. [Google Scholar]

Figure 1. Best simplifications according to SARI: active participle construction.

Figure 2. Best simplifications according to SARI: passive participle construction.

Figure 3. Worst simplifications according to SARI.

Figure 4. Label distribution.

Figure 5. Cross attention of the last layer: example of best simplification.

Figure 6. Cross attention of the last layer: example of worst simplification.

Figure 7. Decoder attention of the last layer: example of best simplification.

Figure 8. Decoder attention of the last layer: example of worst simplification.

Figure 9. Encoder attention of the last layer: example of best simplification.

Figure 10. Encoder attention of the last layer: example of worst simplification.

Table 1. Basic statistics of corpora.

	Parallel Corpus 1		Parallel Corpus 2
	Original Sentences	Simplified Sentences	Original Sentences	Simplified Sentences
Number of sentences	2142	2521	3123	2999
Number of words	36,404	34,702	64,936	52,382
Average sentence length	14.75	12.43	16.97	13.53
Average word length	7.10	6.79	7.03	6.68

Table 2. Comparison of mBART and mT5 models for text simplification in Lithuanian language.

Metric	‘New’ mBART *	‘New’ mT5 *	‘Old’ mBART *	‘Old’ mT5 *
SARI	57.2374	54.1182	72.9781	56.0943
BERTScore	0.8633	0.8342	0.9155	0.8498
ROUGE-1	0.6396	0.5931	0.7797	0.6205
ROUGE-2	0.4703	0.4323	0.6753	0.4652
ROUGE-L	0.5993	0.5593	0.7555	0.5875

* New mBART and mT5 were fine-tuned on Parallel Corpus 2, old mBART and mT5—on Parallel Corpus 1.

Table 3. Qualitative evaluation scores (averages).

	Simplicity	Meaning Retention	Grammaticality
mT5	2.67	3.16	3.33
mBART	2.80	3.72	3.88

Table 4. System vs. reference comparison table.

	SARI	FKGL	Compression Ratio	Sentence Splits	Levenshtein Similarity	Exact Copies	Additions Proportion	Deletions Proportion	Lexical Complexity Score
System output	49.63	13.84	0.94	1.04	0.84	0.19	0.20	0.23	9.66
Identity baseline	13.43	17.41	1.0	1.0	1.0	1.0	0.0	0.0	9.73
Truncate baseline	23.41	15.98	0.76	0.98	0.85	0.0	0.05	0.26	9.84

Table 5. Results by sentence length (characters).

Length	BLEU	SARI	FKGL	Compression Ratio	Sentence Splits	Levenshtein Similarity	Exact Copies	Additions Proportion	Deletions Proportion	Lexical Complexity Score
[8;55]	51.04	58.51	8.77	1.13	1.00	0.87	0.51	0.23	0.13	10.36
[55;102]	41.26	50.77	10.35	0.97	0.98	0.86	0.19	0.21	0.20	10.14
[102;164]	40.04	48.63	13.72	0.92	1.02	0.87	0.15	0.17	0.21	9.59
[164;242]	40.04	51.73	14.13	0.89	1.09	0.81	0.07	0.22	0.29	9.23
[242;4830]	22.10	45.06	16.80	0.81	1.11	0.78	0.04	0.18	0.33	8.88

Table 6. Metrics that correlated with the simplicity criterion.

Metric	Pearson	p-Value
BLEUSmoothed	−0.1277	0.0071
AverageCosine	−0.1114	0.0190
LexicalComplexity	−0.1100	0.0206
ProportionDeletedWords	0.1067	0.0247
FleshReadingEase	0.1061	0.0255
AvgPositionWordsFreqTable	−0.1021	0.0317
CompressionRatio	0.1011	0.0334
LemmasInCommon	−0.0988	0.0377
LevenshteinDistance	0.0954	0.0448
LevenshteinSimilarity	−0.0954	0.0448

Table 7. Metrics that correlated with meaning preservation criterion.

Metric	Pearson	p-Value
BLEUSmoothed	0.1726	0.0003
ProportionDeletedWords	−0.1660	0.0005
WordsPerSentence	−0.1648	0.0005
CharactersPerSentence	−0.1648	0.0005
SyllablesPerSentence	−0.1641	0.0005
MaxPositionWordsFreqTable	−0.1461	0.0021
FKGL	−0.1409	0.0030
CharsPerSentenceDifference	−0.1293	0.0064
UnchangedWordsProportion	0.1269	0.0075
LexicalComplexityScore	−0.1130	0.0174
KeptWordsProportion	0.1117	0.0187
HungarianCosine	0.0998	0.0358

Table 8. Metrics that correlated with grammaticality criterion.

Metric	Pearson	p-Value
BLEUSmoothed	0.1651	0.0005
Words	−0.1472	0.0019
Characters	−0.1472	0.0019
SyllablesPerSentence	−0.1463	0.0020
LevenshteinDistance	−0.1386	0.0035
LevenshteinSimilarity	0.1386	0.0035
UnchangedWordsProportion	0.1310	0.0057
WordsPerSentence	−0.1275	0.0072
CharactersPerSentence	−0.1275	0.0072
MaxPositionWordsFreqTable	−0.1258	0.0080
SyllablesPerSentence	−0.1247	0.0086
LemmasInCommon	0.1243	0.0088
ProportionKeptWords	0.1179	0.0131
FKGL	−0.0977	0.0399

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mandravickaitė, J.; Rimkienė, E.; Kapkan, D.K.; Kalinauskaitė, D.; Čenys, A.; Krilavičius, T. Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language. Mathematics 2025, 13, 465. https://doi.org/10.3390/math13030465

AMA Style

Mandravickaitė J, Rimkienė E, Kapkan DK, Kalinauskaitė D, Čenys A, Krilavičius T. Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language. Mathematics. 2025; 13(3):465. https://doi.org/10.3390/math13030465

Chicago/Turabian Style

Mandravickaitė, Justina, Eglė Rimkienė, Danguolė Kotryna Kapkan, Danguolė Kalinauskaitė, Antanas Čenys, and Tomas Krilavičius. 2025. "Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language" Mathematics 13, no. 3: 465. https://doi.org/10.3390/math13030465

APA Style

Mandravickaitė, J., Rimkienė, E., Kapkan, D. K., Kalinauskaitė, D., Čenys, A., & Krilavičius, T. (2025). Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language. Mathematics, 13(3), 465. https://doi.org/10.3390/math13030465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Materials

3.2. Methods

3.3. Evaluation Methods

4. Results

4.1. Experimental Setup

4.2. Fine-Tuning Process

4.3. Preliminary Evaluation

4.3.1. Fine-Tuning Results: Parallel Corpus 1 vs. Parallel Corpus 2

4.3.2. Results of Fine-Tuned LT-Llama-2

4.3.3. ChatGPT Results

4.4. In-Depth Evaluation

4.4.1. Qualitative Evaluation

4.4.2. EASSE Report for mBART Results

Comparison with Baselines

Analysis of Sentence-Length Effect

Analysis of Best and Worst Simplifications

4.4.3. Tseval Results for mBART

Simplicity

Meaning Preservation

Grammaticality

4.4.4. Attention Analysis with BertViz

5. Discussion

5.1. Data Augmentation and Training Data Quality

5.2. Limitations of Automated Metrics

5.3. Model Performance

5.4. Sentence-Length Dependency and Structural Changes

5.5. Evaluation Metrics and Correlation Analysis

5.6. Attention Mechanisms and Model Behavior

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI