Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models

Agyei, Emmanuel; Zhang, Xiaoling; Quaye, Ama Bonuah; Odeh, Victor Adeyi; Arhin, Joseph Roger

doi:10.3390/app15084494

Open AccessArticle

Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models

by

Emmanuel Agyei

¹

,

Xiaoling Zhang

^1,*,

Ama Bonuah Quaye

²,

Victor Adeyi Odeh

¹

and

Joseph Roger Arhin

¹

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

²

School of Public Administration, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4494; https://doi.org/10.3390/app15084494

Submission received: 10 February 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 18 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Machine Translation (MT) for low-resource languages, such as Twi, remains a persistent challenge in natural language processing (NLP) due to the scarcity of extensive parallel datasets. Due to their heavy reliance on high-resource data, traditional methods frequently fall short, underserving low-resource languages. To address this, we propose a fine-tuned T5 model trained using a Cross-Lingual Optimization Framework (CLOF), a unique method that dynamically modifies gradient weights to balance low-resource (Twi) and high-resource (English) datasets. This cross-lingual learning framework leverages the strengths of federated training to improve translation performance while ensuring scalability for other low-resource languages. In order to maximize model input, the study makes use of a carefully selected parallel English-Twi corpus that has been aligned and tokenized. A thorough evaluation of translation quality is provided by the use of SPBLEU, ROUGE (ROUGE-1, ROUGE-2, and ROUGE-L) measures, and Word Error Rate (WER) metrics. A pretrained mT5 model is used to set baseline performance, which acts as a standard for the optimized model. The suggested method shows notable benefits, according to experimental results. The fine-tuned model achieves a remarkable increase in SPBLEU from 2.16% to 71.30%, a rise in ROUGE-1 from 15.23% to 65.24%, and a notable reduction in WER from 183.16% to 68.32%. These findings highlight the effectiveness of CLOF in addressing the challenges of low-resource MT and enhancing the quality of Twi translations. This work demonstrates the potential of combining cross-lingual learning and federated training to advance NLP for underrepresented languages, paving the way for more inclusive and scalable translation systems.

Keywords:

low-resource machine translation; Twi language; federated learning; cross-lingual learning; fine-tuning

1. Introduction

Machine translation (MT) has made significant strides in artificial intelligence, but it still faces a major challenge: the lack of high-quality digital resources for many low-resource languages. While neural machine translation excels for high-resource languages like English, it struggles with thousands of underrepresented languages due to insufficient parallel corpora and training data. This scarcity often results in poor translation quality for low-resource languages, which are essential for preserving cultural and community identities. Traditional MT methods face similar limitations, yielding subpar translations when data are sparse or unreliable [1,2,3]. Low-resource languages are particularly vulnerable to issues like data inconsistencies in techniques such as backtranslation, which heavily depend on the availability of high-quality data [4,5]. As a result, improving MT for these languages requires innovative strategies that better utilize the available resources.

In response to this challenge, we propose a dynamic dataset aggregation framework that adjusts gradient weights during training to optimize the use of limited data. Drawing inspiration from federated learning principles, our method allows the use of diverse data sources without exposing raw data, making it ideal for low-resource languages. Additionally, the approach integrates both parallel and monolingual data, which enhances the translation quality for underrepresented languages like Twi. By adapting to the specific characteristics and scale of the target dataset, this framework significantly improves translation accuracy and maximizes the effectiveness of the available resources.

This work addresses key gaps in the current machine translation landscape. While federated learning has primarily been explored for data privacy, its potential to enhance machine translation (MT) for low-resource languages remains underexplored. Additionally, the power of cross-lingual learning, especially between high-resource and low-resource languages, has not been fully harnessed for languages such as Twi, a widely spoken African language. While traditional federated learning is primarily motivated by concerns of data privacy and decentralized computation, our approach—referred to as federated-like training—serves a different yet equally critical objective. Rather than enforcing strict decentralization, we adopt this paradigm to enable personalized and weighted optimization across heterogeneous data distributions, specifically between high-resource (e.g., English) and low-resource (e.g., Twi) language domains. This controlled gradient reweighting ensures that the model learns balanced representations without allowing high-resource language data to dominate, which is a common pitfall in centralized multitask learning. By simulating a federated environment, our method retains the modular benefits of federated architectures while adapting them to centralized but diverse data settings, making it both effective and feasible for low-resource machine translation. Our contributions include the following:

A novel personalized Cross-Lingual Optimization Framework (CLOF) that dynamically adjusts gradient aggregation weights throughout training to enhance the robustness of translation models;
A scalable framework for cross-lingual learning that efficiently utilizes high-resource language data to improve translation performance for low-resource languages;
Empirical insights about linguistic interaction patterns in federated learning and their influence on model convergence;
A practical application showcasing enhanced translation skills for Twi, hence augmenting NLP functionalities for African languages;
A flexible methodology that can be modified for various low-resource languages is especially advantageous for underrepresented language communities.

The remainder of the paper is organized as follows: First, we provide a comprehensive review of related research in low-resource machine translation, federated learning, and dataset aggregation techniques. Next, we present our methodology, which includes the architecture of our proposed system and the dynamic dataset aggregation processes. This is followed by detailed experimental results that validate the effectiveness of our approach. Finally, we discuss the implications of our findings and propose directions for future research in this rapidly evolving field.

2. Related Works

This section provides a systematic review of existing research on machine translation (MT) for low-resource languages, focusing on key challenges, methodological advancements, and their contributions to the field. The discussion is organized around four main themes: (1) challenges and approaches in low-resource MT, (2) data augmentation and transfer learning, (3) federated learning, and (4) advancements in pretrained Transformer models and dataset aggregation to improve translation performance for languages like Twi. Each theme is grounded in a rigorous analysis of the recent literature, highlighting gaps and opportunities for future research. These approaches aim to address challenges in data scarcity and enhance translation accuracy.

2.1. Challenges and Approaches in Low-Resource Machine Translation

Machine translation (MT) for low-resource languages presents several critical challenges, primarily stemming from limited parallel corpora, complex morphological structures, and extensive linguistic diversity. Studies, such as those by Jinyue Qi [6], have highlighted how languages like Tibetan, Uyghur, and Urdu suffer from a dearth of digital resources, which severely limits the training of neural machine translation (NMT) systems. These challenges contribute to issues such as vocabulary misalignment, hallucinations, and morphological discrepancies, particularly in languages with rich inflectional structures. This issue is compounded by the typological and orthographic diversity found in Indic languages. Languages such as Kannada and Arabic exhibit complex syntactic and lexical variations across their many dialects, presenting significant hurdles for accurate translation [7,8]. Furthermore, languages with extensive morphological richness, like Kinyarwanda, have benefited from the incorporation of morphological modeling. Other research has demonstrated that decomposing words into stems and affixes can improve the translation of such languages by addressing these morphological complexities [9,10]. These findings underscore the importance of adapting NMT models to the unique linguistic characteristics of low-resource languages.

In response to these challenges, hybrid models have emerged as promising solutions. By combining supervised learning on small parallel datasets with unsupervised techniques that utilize large monolingual datasets, these models offer a means to mitigate data scarcity. For instance, Egyptian Arabic translation has seen improvements due to unsupervised pretraining methods that leverage abundant monolingual data, bridging the gap where parallel data are insufficient [8]. However, further exploration into the optimal integration of supervised and unsupervised methods is needed to maximize the effectiveness of these hybrid models.

Moreover, pivot prompting (an approach that uses high-resource languages as intermediaries) has shown effectiveness in translating low-resource Asian languages. By leveraging syntactic and semantic alignments through intermediary languages, this method reduces errors inherent in the direct translation of low-resource language pairs. Multilingual pretrained models such as mBART and mT5 have demonstrated improvements for Indic languages by leveraging shared representations across related languages [11]. While these advances are promising, more research is required to understand the underlying mechanisms of how these common representations are learned and used.

2.2. Data Augmentation and Transfer Learning for Low-Resource MT

The issue of data scarcity in low-resource MT has catalyzed the development of several data augmentation strategies aimed at expanding training corpora. One widely used method is backtranslation, where a target language text is translated back into the source language to generate synthetic parallel data. However, as shown in Figure 1 the effectiveness of backtranslation is highly dependent on the quality of the backward translation system, as poor translation models can introduce errors into the synthetic data, which in turn degrade model performance [12,13,14,15,16]. To mitigate these issues, constrained sampling methods have been introduced, which use discriminator models to reject low-quality synthetic data by evaluating factors like semantic similarity and syntactic coherence.

In addition to backtranslation, Masked Language Models (MLMs) have emerged as a more robust augmentation tool. Unlike traditional synonym replacement methods, MLMs paraphrase sentences while preserving their contextual meaning, leading to better semantic integrity. This approach has been demonstrated to outperform earlier augmentation methods, particularly for tasks involving semantic understanding [13,14,15,16]. Furthermore, edit-distance-based sampling techniques focus on maximizing data variability while maintaining semantic consistency, which is essential for languages with complex inflectional morphology.

Moreover, advancements in multilingual models like mBERT and XLM-R have enabled effective transfer learning across language pairs, even when one of the languages is low-resource. These models utilize knowledge from high-resource languages to improve translation accuracy for low-resource pairs. Despite the success of these models, further research is necessary to examine the long-term stability of transferred knowledge, particularly with regard to its adaptability to different language pairs. Additionally, techniques such as adversarial learning and lexical constraints ensure that generated translations remain both linguistically accurate and scalable [17,18,19,20].

2.3. Federated Learning in Machine Translation

Federated learning (FL) has emerged as a novel approach to machine translation, allowing for decentralized training of models while ensuring data privacy. Unlike traditional centralized training, FL enables updates to be made to the model based on distributed datasets without directly accessing sensitive data. The integration of techniques such as secure aggregation and differential privacy helps protect individual data points, and homomorphic encryption ensures secure communication during model updates [21,22]. While these privacy-preserving techniques are promising, they introduce trade-offs, particularly regarding the balance between model accuracy and privacy.

One of the most compelling aspects of FL in low-resource MT is its potential to leverage linguistic diversity across decentralized datasets. By incorporating region-specific patterns into a global pretrained model, FL can improve translation quality for underrepresented language pairings without compromising data privacy [23,24]. However, the algorithms needed to effectively integrate these regional patterns into the global model remain underexplored, and further research is needed to optimize this process.

Additionally, FL addresses the challenge of data heterogeneity that often arises in low-resource settings. By balancing local and global model updates, FL provides a solution to the variations in dataset quality and structure that typically hinder traditional MT approaches. This decentralized approach not only reduces computational costs but also minimizes communication latency, making FL a promising solution for low-resource MT systems that also require stringent privacy protections.

2.4. Advancements in Pretrained Transformer Models and Dataset Aggregation

The rise of pretrained Transformer models has been a game-changer for low-resource machine translation. These models use transfer learning to leverage large datasets from high-resource languages, applying that knowledge to improve translations for low-resource pairs. As demonstrated in a study, fine-tuning specific layers, such as the cross-attention layers, can produce results similar to full model fine-tuning while reducing storage and computational costs [25]. The Shared Layer Shift (SLaSh) method, which targets specific layers for each task, further optimizes performance, particularly for low-resource languages [26].

Recent developments in Low-Rank Adaptation (LoRA) have further enhanced the scalability and computational efficiency of these models. By reducing the number of trainable parameters, LoRA has made it possible to scale pretrained Transformer models more effectively, even for low-resource languages [27]. Moreover, dynamic dataset aggregation—where domain-specific data are continuously incorporated into multilingual models—has proven effective in improving translation quality. For example, combining backtranslation with layer-wise coordination has shown promise in aligning context and improving translation accuracy for resource-limited languages. These developments offer great promise for cross-language communication. This makes excellent use of computer resources without lowering the quality of the joint translation performance [28,29]. The adaptMLLM method, which injects domain-specific knowledge into general-purpose multilingual models, has also demonstrated improvements in translation performance for historically low-resourced languages [30,31].

Despite these advancements, significant gaps remain in the literature. The long-term impact of continuous dataset aggregation on model performance and stability is still unclear. This lack of understanding can limit the optimization of the process and potentially lead to inconsistent translation quality. Moreover, the combination of fine-tuning strategies, data augmentation methods, and federated learning techniques, which could offer a comprehensive solution for low-resource MT, is underexplored. Future research should focus on finding the best ways to integrate these techniques to enhance translation capabilities.

3. Methodology

Our approach focuses on cross-lingual learning and dynamic dataset aggregation to improve translation for low-resource languages, particularly Twi. We leverage pretrained multilingual Transformer models (e.g., mT5) and apply a dynamic weighting scheme to prioritize the low-resource language while still benefiting from high-resource language data.

3.1. General Setup

Our research addresses machine translation (MT) challenges for low-resource languages by developing a robust translation model through federated learning, termed CLOF (Cross-Lingual Optimization Framework). As illustrated in Figure 2, this approach leverages multiple data sets

{\{D_{i}\}}_{i = 1}^{n}

, n ≥ 2 representing various language distributions, including high-resource languages such as English and low-resource languages like Twi. Formally, the goal is to minimize the expected loss over the distribution

D

representing the target language pair (English-Twi), defined as follows:

f_{D} (x) = E_{ξ \sim D} [f (x, ξ)]

(1)

where

x \in R^{d}

denotes the model parameters,

ξ

is a sample from the distribution

D,

and

f (x, ξ)

represents the loss function for the model on sample

ξ

. Given the unknown true data distribution

D

, we approximate

D

using a finite dataset

\hat{D}

drawn from

D

, referred to as the target dataset. The empirical loss is then as follows:

f_{\hat{D}} (x) = \frac{1}{| \hat{D} |} \sum_{ξ \in \hat{D}} f (x, ξ)

(2)

3.2. Dataset

This study uses a carefully curated dataset of 6043 English-Twi parallel sentence pairs, divided into 80% training data and 20% validation data for diverse linguistic representation. The target dataset

D_{1}

consists of data sampled from the low-resource distribution

D

and is pivotal for direct optimization for Twi translation tasks. The auxiliary datasets

{\{D_{i}\}}_{i = 2}^{n}

, derived from high-resource languages, such as English, to enable cross-lingual transfer learning. These auxiliary corpora facilitate knowledge transfer from well-represented languages to Twi, significantly improving translation quality.

The English-Twi data were compiled from a mix of publicly available bilingual resources and manually curated translations. A major portion of the corpus was extracted from the Seventh-day Adventist Sabbath School quarterly books, which address a broad array of life-related topics. These books, published four times per year, served as a rich bilingual resource spanning 2019 to 2022, resulting in 12 editions. The sentence-level alignment was performed manually in collaboration with native Twi speakers and language experts, following structured curation guidelines:

Semantic alignment: Only sentence pairs with clear contextual equivalence were retained;
Direct one-to-one mapping: Ensured consistent parallelism across the corpus;
Noise reduction: Irrelevant commentary, redundancies, or non-alignable segments were excluded;
Normalization: Variations in punctuation, casing, and formatting were standardized for consistency.

This rigorous process ensures that the dataset captures both formal written structures and colloquial expressions, exposing the model to a wide range of syntactic and semantic patterns. Although modest in size, the dataset is highly task-specific and plays a critical role in advancing low-resource machine translation research and supporting inclusive NLP development for underrepresented languages like Twi.

3.3. Data Preprocessing

The English and Twi datasets underwent rigorous preprocessing techniques to improve machine translation model quality by addressing common text data challenges like noise and sentence length variability. Specifically, non-ASCII characters were removed, and all text was converted to lowercase to standardize the input and avoid inconsistencies caused by case sensitivity or special characters. The cleaning process can be formalized as follows:

x_{c l e a n e d} = remove_non_ascii(x)

(3)

where x represents the input text, and the function remove_non_ascii(x) removes any characters outside the ASCII range. After non-ASCII characters are removed, all text is converted to lowercase:

x_{l o w e r} = l o w e r c a s e (x_{c l e a n e d})

(4)

where the function lowercase(

x_{c l e a n e d}

) converts the cleaned text to lowercase. Sentences were also tokenized using the T5 tokenizer, which is designed to split input text into smaller subword units. The tokenization was performed with a maximum sequence length of 256 tokens, ensuring that each sentence conforms to the model’s input constraints. The tokenization procedure is mathematically represented as follows:

T_{s e n t e n c e} = T5_tokenizer (sentence, max_length = 256)

(5)

where

T_{s e n t e n c e}

represents the tokenized sentence, and T5_tokenizer is the tokenizer function from the T5 model, ensuring all sentences do not exceed 256 tokens. To further maintain linguistic relevance, tokenized sequences with lengths outside the acceptable range (i.e., either too short or too long) were filtered out. Sentences with fewer than 5 tokens or more than 256 tokens were excluded to prevent noise in the training process. This filtering condition can be formally expressed as follows:

Filter (T_{s e n t e n c e}) = \{\begin{matrix} 1 i f 5 \leq |T_{s e n t e n c e}| \leq 256 \\ 0 o t h e r w i s e \end{matrix}

(6)

where

|T_{s e n t e n c e}|

represents the number of tokens in the sentence. Sentences are retained only if they satisfy 5 ≤

|T_{s e n t e n c e}|

≤ 2565. The English and Twi sentences were then carefully aligned to ensure that each English sentence corresponds to its accurate Twi translation. This step is crucial for the creation of a reliable parallel corpus. The alignment process can be defined as follows:

Align (E_{i}, T_{i}) = \{\begin{matrix} 1 i f E n g l i s h a n d T w i s e n t e n c e s a r e c o n t e x t u a l l y a n d l i n g i u s t i c a l l y a l i g n e d \\ 0 o t h e r w i s e \end{matrix}

(7)

where

E_{i}

and

T_{i}

represent the i-th English and Twi sentences, respectively. Only sentence pairs for which align (

E_{i}, T_{i}) =

1 were retained. Finally, inconsistent punctuation marks and redundant spaces were corrected to improve text consistency. For example, multiple spaces were replaced with a single space, and non-standard punctuation marks were corrected to match conventional formatting. This can be formally represented as follows:

x_{l o w e r} = normalize_spaces_and_punctuation(x)

(8)

where the function

normalize_spaces_and_punctuation(x)

ensures that all redundant spaces and non-standard punctuation are corrected.

3.4. Baseline Model

Our current baseline model is the mT5 (Multilingual T5) model is a state-of-the-art multilingual Transformer-based model designed to perform a variety of natural language processing (NLP) tasks by treating all tasks as text-to-text problems. The model is a multilingual variant of the original T5 (Text-to-Text Transfer Transformer), which was initially designed to work primarily with the English language. mT5 extends the architecture of T5 by enabling it to handle tasks in over 100 languages, making it a powerful tool for cross-lingual transfer learning and multilingual NLP applications.

The mT5 model shares the same architecture as T5, which is based on the Transformer model introduced by Vaswani et al. [32]. The Transformer architecture features self-attention mechanisms and feed-forward neural networks, enabling effective processing of sequential data. The self-attention helps capture long-range dependencies in sequences, while feed-forward networks perform non-linear transformations. Like T5, mT5 operates within a text-to-text framework. Here, input and output sequences are treated as text, and tasks like text classification, question answering, summarization, and machine translation are all framed as text generation problems. This unified approach simplifies task handling and allows mT5 to be fine-tuned across diverse natural language processing tasks.

The model is pretrained on the mC4 (Multilingual C4) dataset, which is a multilingual version of the C4 dataset used for T5. The mC4 dataset contains text data in over 100 languages, including high-resource languages like English, Spanish, French, and Chinese, as well as a variety of low-resource languages such as Swahili, Marathi, and Igbo. These diverse training data allow mT5 to learn general linguistic representations that can be applied across different languages, enabling cross-lingual transfer. Additionally, mT5 is scalable and available in different sizes (e.g., mT5-small, mT5-base, mT5-large), making it adaptable to various computational needs. mT5, like T5, is pretrained using the denoising objective. During pretraining, a portion of the input text is randomly masked, and the model is tasked with predicting the missing tokens. This is performed by leveraging a span-based denoising objective, where spans of text are masked, and the model must recover them. The objective is formally expressed as follows:

L = - \sum_{i \in m a s k e d p o s i t i o n} l o g p (x_{i}| x_{- i})

(9)

where x is the input sequence, and

x_{i}

represents the masked tokens. This pretraining task enables the model to learn contextual relationships between words in a sentence, improving its ability to handle downstream NLP tasks.

Once pretrained, mT5 can be fine-tuned on specific downstream tasks, such as machine translation, sentiment analysis, or text summarization. The model is fine-tuned using task-specific datasets, where the input–output pairs are defined according to the task requirements. For example, in machine translation, the input might be a sentence in English, and the output would be the corresponding translation in Twi. During fine-tuning, the model adapts its weights based on the specific characteristics of the task while still retaining the knowledge learned during pretraining. This transfer learning approach allows mT5 to perform well on tasks with limited training data, particularly for languages with fewer resources. Table 1 represents the experimental setup and key details of the model’s configuration and evaluation.

3.5. Algorithmic Framework

To address data scarcity in

D_{1}

and utilize the rich linguistic features in auxiliary datasets. As mentioned earlier, we introduce a federated learning approach, CLOF. Inspired by the principles of personalized federated learning, CLOF leverages cross-lingual learning by aggregating gradients from both high-resource (English) and low-resource (Twi) datasets. The approach dynamically adjusts the contribution of each dataset during training; it prioritizes learning for the low-resource language while still leveraging the strengths of high-resource datasets to enhance overall model performance. The CLOF model updates are governed by the following:

M^{r + 1} = M^{r} + \frac{\sum_{c = 1}^{C} α_{c} \cdot Δ M_{c}^{r}}{\sum_{c = 1}^{C} α_{c}}

(10)

where

M^{r}

is the global model at iteration

r

,

Δ M_{c}^{r}

is the local update from the client

c

, and

α_{c}

is the aggregation weight for the client

c

. These weights are adjusted to prioritize updates from the low-resource client, ensuring that updates from high-resource languages do not overwhelm the learning process. This equation shows how the global model

M^{r}

is updated at each iteration by aggregating the local updates from all clients. The local updates

Δ M_{c}^{r}

from each client are weighted by

α_{c}

, which ensures that updates from the low-resource dataset (Twi) receive more attention during the training process. This dynamic weighting helps prevent the model from being dominated by the high-resource dataset (English). This method ensures that low-resource languages, such as Twi, are given more attention by adjusting the weights, which helps prevent the high-resource languages from dominating the learning process. By using this approach, we can achieve a more personalized model that adapts better to the specific characteristics of the target language.

3.6. Gradient Weighting for Cross-Lingual Learning

In CLOF, we define two primary client datasets: high-resource client (

D_{H R}

) and low-resource client (

D_{L R}

). The high-resource client dataset comprises a large-scale English corpus that provides robust language representation and serves as a source for cross-lingual transfer, while the low-resource client dataset contains a smaller Twi corpus. CLOF assigns a higher aggregation weight (

α_{L R}

) to updates from this client, ensuring that Twi signals are amplified during gradient aggregation. At each client, local updates are computed using the loss function

L_{c}

, specific to the client’s data, as follows:

Δ M_{c}^{r} = - η \cdot \nabla L_{c} (M^{r}, D_{c})

(11)

where

η

is the learning rate. The aggregation step balances the gradients to maximize the contribution of

D_{L R}

:

M^{r + 1} = M^{r} + \frac{α_{H R} \cdot Δ M_{H R}^{r} + α_{L R} \cdot Δ M_{L R}^{r}}{α_{H R} + α_{L R}}

(12)

Dynamic adjustment of

α_{c}

allows the framework to adapt as the training progresses, ensuring that the low-resource dataset continues to receive adequate attention. This equation illustrates how the model aggregates the gradients from both high-resource and low-resource clients, with the weights

α_{H R}

and

α_{H R}

controlling the contribution from each dataset. The dynamic adjustment of these weights throughout training ensures that the low-resource language dataset continues to have sufficient influence on model learning, enabling better translation quality for Twi. This scheme is crucial for improving the performance of low-resource languages because it adjusts the weight for each dataset as training progresses, ensuring that the low-resource language receives adequate focus during training. This is comprehensively laid out below (see Algorithm 1) for better understanding.

Algorithm 1: CLOF for low-resource machine translation

3.7. Optimization and Training Process

The training process begins by broadcasting the global model

M^{r}

to all clients at the start of each training round. This initial global model serves as the baseline from which each client will further customize and improve the model based on its specific dataset, whether it is high-resource (e.g., English) or low-resource (e.g., Twi). Each client independently computes local updates using a loss function

L_{c}

that is specifically tailored to reflect the linguistic characteristics and needs of the data it holds. For the low-resource clients, such as those handling the Twi dataset, a higher aggregation weight

α_{L R}

is assigned to the updates. This weighting approach is crucial as it ensures that signals from the Twi language are adequately represented and do not get diluted by the more dominant linguistic features of the high-resource languages. The local update from each client is computed as shown in Equation (11). After all clients have computed their updates, these are sent back to the central server, where they are aggregated to form the new global model. The aggregation is carefully weighted to balance and maximize the contribution of both high-resource and low-resource datasets outlined in Equation (12).

CLOF not only enables the model to effectively converge but also guarantees that the translation quality for low-resource languages like Twi is substantially enhanced through this structured optimization and training process, as demonstrated by metrics such as SpBLEU, WER, and ROUGE. Therefore, this training methodology not only resolves the imbalance in data resources but also establishes the foundation for the expansion of the solution to other low-resource languages, thereby offering a generalized, robust solution to multilingual machine translation challenges.

3.8. Implementation Details

The computational setup for this study utilized Google Colab Pro (version: CUDA 11.2) with an NVIDIA Tesla A100 GPU to ensure efficient training and evaluation. The model implementation and fine-tuning were carried out using the Transformer library from HuggingFace (version: 4.10.0), while PyTorch (version: 1.9.0) was employed for training and evaluation. The ‘Evaluate’ library was used to compute the relevant performance metrics, as detailed in Section 3.9. To manage memory effectively, gradient clipping was applied during training to prevent exploding gradients. The model was trained for a total of 100 epochs, with no early stopping criteria employed. Instead, a fixed number of epochs was used to ensure thorough fine-tuning, particularly on the low-resource language data (Twi). During training, the dynamic aggregation weights were adjusted using a heuristic approach. Specifically, higher weights were assigned to the low-resource language (Twi) to enhance its translation accuracy. These weights were dynamically adjusted based on the performance of the model in both high-resource (English) and low-resource (Twi) languages. A batch size of 8 sentences was used to balance computational efficiency with memory constraints.

3.9. Evaluation Metrics

In order to thoroughly assess the effectiveness of our translation models, we utilized a collection of well-established metrics that are frequently applied in machine translation studies. By contrasting them with the reference translations on a number of parameters, these metrics enable us to evaluate the produced translations’ quality.

3.9.1. SpBLEU (Sentence-Piece BLEU)

SpBLEU is crucial for evaluating translation quality at the sentence level, especially in low-resource languages like Twi, where structure and fluency can vary significantly. Unlike traditional BLEU, SpBLEU operates on subword units, making it particularly effective in handling morphologically rich languages. It captures n-gram precision, ensuring that translations maintain lexical similarity while penalizing overly short outputs through a brevity penalty. This helps in assessing how well the model generates fluent and contextually appropriate translations for Twi. This fine-grained evaluation provides insights into the model’s strengths and weaknesses, helping to identify areas for improvement in translating complex sentences [33,34]. Formally, the SpBLEU score for a given sentence can be expressed as follows:

SpBLEU (S) = BP \times (\prod_{n = 1}^{N} {p r e c i s i o n}_{n} {(s)}^{1 / N})

(13)

where S is the sentence being evaluated,

{p r e c i s i o n}_{n}

(s) is the modified precision for n-grams of order n in the sentence S. N is the maximum n-gram order considered (commonly 4). BP is the brevity penalty, which is applied to avoid penalizing shorter translations that may still be accurate.

3.9.2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

The study uses ROUGE to assess the model’s ability to recover meaningful content from reference translations, particularly in Twi, a low-resource language, focusing on recall rather than n-gram precision, unlike SpBLEU, which focuses on n-gram precision. This is especially important for ensuring that translations are not only accurate in terms of individual words but also preserve the overall meaning. ROUGE includes several variants:

ROUGE-1: Measures unigram recall, i.e., the overlap of individual words;
ROUGE-2: Measures bigram recall, focusing on consecutive word pairs;
ROUGE-L: Measures the longest common subsequence (LCS), capturing longer sequences of words that preserve sentence structure [35,36].

ROUGE helps assess whether the model generates translations that are both fluent and faithful to the original meaning.

ROUGE-N = \frac{\sum_{s \in S} R e c a l l (N, s)}{\sum_{s \in S} R e f e r e n c e_N g r a m s (s)}

(14)

where S is the set of sentences in the predicted translation. Recall (N,s) is the recall of N-grams (unigrams, bigrams) for sentence s in the predicted translation.

3.9.3. WER (Word Error Rate)

The study uses WER to assess the alignment of predicted and reference translations, contrasting SpBLEU and ROUGE, which rely on n-gram matching and recall, with lower WER indicating better alignment. WER is computed by counting the substitutions, deletions, and insertions needed to match the sequences with the following formula:

WER = \frac{S + D + I}{N}

(15)

where S is the number of substitutions (incorrect words in the predicted output). D is the number of deletions (missing words in the predicted output). I is the number of insertions (extra words in the predicted output). N is the total number of words in the reference translation [37,38].

These metrics offer a comprehensive evaluation of the model’s ability to generate translations that are lexically accurate (via SpBLEU), contextually fluent (via ROUGE), and edit-wise correct (via WER). By employing these three metrics, we ensure a robust and multi-dimensional assessment of the translation quality, which allows for a thorough comparison of different model configurations and techniques.

4. Results

A comparative analysis of the evaluation metrics (SpBLEU, ROUGE, and WER) was conducted across the baseline and fine-tuned models, as summarized in the following sections. These metrics were computed on the test set, with SpBLEU and ROUGE focusing on n-gram precision and recall, respectively, and WER measuring word-level alignment accuracy.

4.1. Quantitative Analysis of Baseline and Fine-Tuned Model Performance

The baseline model, based on the pretrained mT5-small, provided a reference for machine translation capabilities without fine-tuning the target language, Twi. The results (shown in Table 2) reveal that the pretrained mT5 model, while able to generate translations, exhibited suboptimal performance, especially for low-resource languages like Twi. The translation quality and fluency were noticeably weaker, highlighting the limitations of using a general multilingual pretrained model without adaptation to the target language.

Following fine-tuning of the English-Twi model on our parallel corpus, significant improvements were observed across all evaluation metrics (Figure 3). The fine-tuned model outperformed the baseline, demonstrating the effectiveness of transfer learning in enhancing performance for low-resource languages. Fine-tuning allowed the model to better capture the linguistic nuances of Twi, resulting in more accurate and fluent translations. Specifically, there were notable improvements in both SpBLEU and ROUGE scores, reflecting enhanced n-gram precision and overall translation quality. Additionally, the WER score was substantially reduced, indicating better word-level alignment and fewer errors in the predicted translations.

The fine-tuned English-Twi model shows significant improvements across all evaluation metrics. The SpBLEU score increased from 2.16% to 71.30%, demonstrating much better sentence-level alignment with reference translations. ROUGE-1 (unigram recall) improved by 50.01% (from 15.23% to 65.24%), and ROUGE-2 (bigram recall) saw a 53.04% improvement (from 7.18% to 60.22%), reflecting enhanced word and phrase matching. The ROUGE-L score, which evaluates sentence fluency, improved by 47.90%, from 14.22% to 62.12%, indicating better sentence structure. Lastly, WER dropped significantly by 114.84% (from 183.16% to 68.32%), showing improved alignment and accuracy in the translation. These improvements highlight the effectiveness of fine-tuning for low-resource languages like Twi, resulting in better translation fluency and accuracy. A heat map illustrating these improvements is included in Figure 3.

The reported 183.16% WER for the baseline model reflects unnormalized edit distance calculations for the full test set. This occurs when the system produces extremely poor translations requiring more edits than the reference length. We used the standard WER formula:

W E R = \frac{S + D + 1}{N} \times 100 %

where S = substitutions, D = deletions, I = insertions, and N = reference words. An example calculation from the test set is as follows:

Reference: “Me pɛ ahobrɛase” (4 words);

Baseline output: “Ɔno ɔkɔɔ fie kɔtenaa adwuma” (6 words);

Edit operations: 4 substitutions + 2 insertions.

W E R = \frac{4 + 0 + 2}{4} \times 100 % = 150 %

We normalized by reference length variability using the following formula:

nWER = \frac{W E R}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

where y_i is the reference length for sample i, and

\bar{y}

is the average reference length across all samples. Table 3 provides both raw and length-normalized scores for further clarity:

We also conducted paired permutation tests (10,000 iterations) comparing model variants as shown in Table 4.

4.2. Analysis of Sample Translation

Our research demonstrates that the optimized English-Twi model significantly outperforms the baseline in both accuracy and fluency. Although minor errors occasionally occur—for example, the omission of “ho” in the translation of “I am learning machine translation”—the fine-tuned model consistently produces outputs that are much closer to the reference. For simple sentences like “How are you?” and “Where is the market?”, the translations align perfectly with the reference, yielding natural-sounding Twi. In more complex cases, such as “Machine translation is useful”, the model shows notable improvements in both accuracy and contextual relevance by better handling formal language and precise word usage. While slight fluency issues, such as a marginally more formal tone, were observed in a few instances, these are minimal compared to the baseline performance. A comprehensive comparison of the results is provided in Table 5.

4.3. Scalability of the Model

To evaluate the scalability of our fine-tuned English-Twi model, we examine its performance across different sizes of training data.

Figure 4 highlights the model’s improvement as the training data size increases. With only 500 pairs, the model’s performance is significantly lower, especially in SpBLEU and ROUGE-1. As the data size grows to 1000 and 2000 pairs, there is a noticeable improvement in translation quality, particularly in SpBLEU and ROUGE-1. Most errors were grammatical or involved rare vocabulary, indicating room for future work with a larger corpus. The whole dataset exhibits the biggest improvement, with SpBLEU rising to 71.30%, ROUGE-1 reaching 65.24%, and WER falling to 68.32%.

4.4. Learning Curves

The learning curves are meant to demonstrate how the model’s performance improves when it is fine-tuned. Throughout training, we can visually evaluate the model’s convergence and task adaptability by monitoring important metrics such as SpBLEU, ROUGE-1, and WER. From the curves illustrated in Figure 5, it is evident that our fine-tuned model consistently outperforms the baseline, demonstrating a clear progression in translation accuracy, fluency, and overall alignment with the reference. This demonstrates how fine-tuning greatly improves the model’s performance, especially for low-resource languages like Twi.

4.5. Impact of Gradient Weighting on Model Performance

Gradient weighting plays a crucial role in multilingual and federated-like training setups by dynamically adjusting the contributions of different languages, ensuring equitable learning across resource levels. Without proper weighting, high-resource languages such as English tend to dominate training, leading to suboptimal performance for low-resource languages like Twi. Our model employs an adaptive weighting approach that accounts for language resource availability, thereby improving alignment with Twi linguistic patterns, as demonstrated in Table 6. We evaluate three weighting strategies: equal weighting, static weighting, and dynamic weighting. Under equal weighting, high-resource languages overwhelm the training process, resulting in poor performance on Twi. Implementing a static weighting strategy with a predefined 1:2 high-resource to low-resource ratio enhances Twi’s performance, confirming that increasing its relative weight improves translation accuracy. However, the most effective approach is dynamic weighting, which optimizes language contributions per training round based on model loss. This strategy yields substantial improvements, achieving a more than 10-fold increase in SpBLEU and reducing the Word Error Rate (WER) from 183.16% to 68.32%. These results demonstrate that gradient weighting is essential for improving translation accuracy and fluency in low-resource language settings, offering a robust solution for enhancing Twi translations in multilingual models.

4.6. Role of Personalization in Model Optimization

In this study, personalization techniques enable the model to prioritize Twi-specific linguistic patterns, mitigating the influence of high-resource language structures that can distort translations. Unlike static multilingual models, which apply uniform training across languages, personalized models dynamically adapt to optimize learning for Twi, capturing its unique syntax and morphology, as demonstrated in Table 7. Without personalization, translations often exhibit unnatural phrasing and incorrect word choices, reflecting interference from dominant high-resource languages. In contrast, incorporating personalization techniques ensures that translations align more closely with Twi linguistic norms, enhancing both syntactic correctness and overall fluency. These findings highlight the critical role of personalization in improving translation quality for low-resource languages within multilingual modeling frameworks.

4.7. Error Analysis in Translation

Several kinds of flaws that affected the quality of the translations were found in the refined English-Twi translation model. When the model employed improper grammatical forms or misunderstood phrase patterns, grammatical faults were detected. As illustrated in Table 8, for instance, in the sentence “The cat is under the table”, the model incorrectly translated “cat” as “ɛnɔma” (object), while the reference used “Pɔnkɔ” (horse), indicating a wrong subject was used. Similarly, in “I will eat tomorrow”, the model used the wrong tense, translating “tomorrow” as “nnɛ” (today) instead of “ɔkyena” (tomorrow), highlighting a tense error. Lexical errors were seen when the model used incorrect or overly general word choices, such as translating “teacher” as “mpɔtam sika” instead of “kyerɛkyerɛfoɔ”, leading to an unnatural phrasing. Additionally, in “This book is interesting”, the model used a phrase (“Ɛhɔ nneɛma no yɛ fɛfɛɛfɛ”) that was overly broad and did not match the specific meaning intended by the reference, reflecting overgeneralized vocabulary. Fluency errors occurred when the model failed to capture adverbs or modifiers accurately, as seen in “The child is running fast”, where the fine-tuned translation omitted the adverb “ntɛm ara”, resulting in a missing adverb translation. In a low-resource language like Twi, these errors highlight the difficulties the model has managing syntactic, lexical, and fluency problems. Identifying these patterns provides insight into areas for further improvement in translation accuracy.

4.8. Comparison with Other Studies

In comparison to other studies, our approach demonstrates significant improvements in key translation metrics. As noted in Table 9, while previous studies, such as those by Fan et al. (2019) [39], Paulus et al. (2017) [40], and Nallapati et al. (2016) [41], have made notable contributions to various machine translation techniques, our study excels in SpBLEU, ROUGE, and WER, outperforming these works in translation accuracy, content preservation, and structural alignment. Specifically, our fine-tuned model achieves a 71.30 SpBLEU score, significantly higher than the other studies, indicating superior sentence-level translation. Additionally, our approach’s use of cross-lingual learning and dynamic dataset aggregation has led to more efficient utilization of high-resource language data, thereby enhancing performance for low-resource languages like Twi. While further improvements are still needed for rare words and phrases, our model’s performance across these evaluation metrics highlights the effectiveness of our proposed methods in the context of low-resource language translation.

4.9. Comparison with GPT-3.5 Using Dataset-Derived Prompts

To provide a comprehensive benchmark, we compared our fine-tuned English-Twi translation model against OpenAI’s GPT-3.5 and GPT-4, employing both zero-shot and few-shot prompting strategies using examples drawn directly from our dataset.

In the zero-shot setting, GPT-3.5 was given only the following instruction:
“Translate the following sentence to Twi: <input_sentence>.”

In the few-shot setup, each model was provided with three English-Twi examples from the training dataset, followed by a new English sentence. This method simulates in-context learning, allowing the models to generalize with minimal examples.

Table 10 shows a side-by-side evaluation using SpBLEU, ROUGE-L, WER, and a new metric called TFS (Twi Fluency Score), which measures grammatical fluency and syntactic coherence in Twi translations on a scale of 0–100 (manually rated by two native speakers and averaged).

Our evaluation of GPT-3.5 and GPT-4 under both zero-shot and few-shot conditions revealed distinct patterns in their ability to handle English-to-Twi translation. In the zero-shot setting, GPT models struggled considerably with Twi’s agglutinative morphology. Common morphemes such as “mepɛ” (“I like”) were often mistranslated or broken into semantically incorrect tokens. Furthermore, zero-shot GPT-3.5 frequently produced translations with severe inaccuracies—highlighted by a WER exceeding 100%, suggesting that more edits were required than the length of the reference. This level of error typically occurred in over 30% of the test sentences, largely due to omission, hallucinated content, or structural breakdowns.

On the other hand, few-shot prompting significantly improved performance across the board. GPT-4, for instance, gained over 14 BLEU points when exposed to just three English-Twi example pairs before translating a new input. This demonstrates the effectiveness of in-context learning for moderate-length sentences. However, even in the few-shot setup, GPT models were most reliable when translating short, declarative statements (typically under 15 tokens), and their performance declined with longer or interrogative constructions. Additionally, both GPT-3.5 and GPT-4 consistently failed to translate culturally embedded terms like “Onyankopɔn” (God) and struggled with Twi’s verb serialization—a linguistic structure where multiple verbs are combined to express sequential actions—resulting in a 67% error rate on such constructs.

Overall, our fine-tuned mT5 model significantly outperformed both GPT-3.5 and GPT-4 in all evaluation metrics. It achieved superior SpBLEU and ROUGE-L scores, reflecting better lexical precision and alignment with human references. The WER dropped to 68.32%, in contrast to the 121.67% seen in GPT-3.5 zero-shot mode, indicating fewer word-level errors. Furthermore, the Twi Fluency Score (TFS), based on native speaker judgment, confirmed that our fine-tuned model produced more coherent and culturally accurate translations. These results reinforce the importance of task-specific, language-aware training, especially in low-resource scenarios where general-purpose models like GPT are prone to errors in morphology, grammar, and cultural nuance.

5. Discussions and Contributions

The improvements observed in our fine-tuned English-Twi model can be attributed to the innovative strategies we implemented, particularly cross-lingual learning and a federated-like training approach. These techniques allowed the model to effectively adapt to the unique linguistic features of Twi, a low-resource language, by leveraging a larger and more diverse training corpus that included English data. The cross-lingual learning method facilitated the transfer of knowledge from high-resource languages like English to low-resource languages such as Twi, leading to significant gains in both accuracy and fluency. By fine-tuning a pretrained mT5 model, we were able to leverage multilingual data to capture shared representations between the two languages, thereby enhancing translation quality across multiple evaluation metrics, including SpBLEU, ROUGE, and WER.

Additionally, our federated-like setup, where the model was trained on separate English and Twi batches, introduced a degree of personalization that further optimized translation performance for Twi. This personalization, coupled with gradient weighting during training, enabled the model to effectively address the imbalance between high-resource and low-resource languages. As a result, our approach significantly enhances the performance of natural language processing (NLP) for underrepresented languages like Twi, which has historically been overlooked in machine translation research and applications. This advancement provides a more robust translation system applicable in real-world contexts such as language learning, communication tools, and content generation, thereby improving access to technology for speakers of low-resource languages.

5.1. Application Beyond Twi: Potential for Other Low-Resource Languages

The practical implications of our work are far-reaching. By addressing the challenge of machine translation for underrepresented languages, particularly African languages like Twi, our model has the potential to significantly improve translation services and bridge communication gaps, especially in regions with large non-English-speaking populations. This can have wide-ranging applications across international communication, education, health services, and governance. Additionally, the cross-lingual transfer learning method we employed, which leverages high-resource language data, can be extended to other language pairs, making it applicable in situations where data disparity exists between the languages being translated.

This framework can also be adapted to a variety of low-resource languages, particularly those in Africa, Asia, and Latin America, contributing to the development of scalable multilingual NLP technologies. Beyond translation, the fine-tuned model can be integrated into multilingual chatbots and virtual assistants, significantly enhancing their ability to communicate in low-resource languages. This would broaden the reach and functionality of such systems in diverse linguistic contexts. The model could also be utilized in automated document translation systems, making documents more accessible to non-English speakers in critical sectors such as legal, healthcare, and governance, thereby improving communication and service delivery in multilingual communities.

The implications of this research extend to several audiences. For the research community, our work introduces a new approach to improving machine translation for underrepresented languages, opening avenues for further exploration in cross-lingual learning and data augmentation techniques. Researchers in federated learning and NLP can gain valuable insights on how to fine-tune models for resource-scarce languages using high-resource datasets. For policymakers and governments, especially those working in multilingual contexts, our advancements provide the potential to enhance communication and service delivery to speakers of low-resource languages. Educational institutions in multilingual countries can also benefit, as this research could help provide more equitable access to educational materials through improved machine translation systems. The tech industry can apply these findings to enhance AI-driven language translation tools, improving inclusivity and global accessibility in fields such as healthcare, business, and customer service.

Beyond its technical impact, this research holds significant societal implications, particularly in cultural preservation and digital inclusivity. Many low-resource languages, such as Twi, encapsulate rich cultural histories, oral traditions, and indigenous knowledge, yet they remain largely underrepresented in digital and technological spaces. By improving machine translation for these languages, our approach helps bridge the digital divide, ensuring that speakers of underrepresented languages can actively participate in online discourse. This fosters linguistic identity, educational empowerment, and access to essential information in healthcare, governance, and legal sectors. Furthermore, enhanced NLP capabilities for such languages support economic inclusion, enabling businesses and public institutions to communicate effectively with diverse linguistic communities. As digital transformation continues to shape global interactions, ensuring linguistic inclusivity will be crucial for building a more equitable technological landscape.

This study represents a significant contribution to the field of low-resource language processing, demonstrating how contemporary NLP techniques can be leveraged to improve translation systems for underrepresented languages. Its broad implications for real-world applications and future research further highlight its importance in advancing both technology and social equity.

5.2. Comparison with Other Personalization Approaches

In this study, we introduce the Cross-Lingual Optimization Framework (CLOF), which dynamically adjusts gradient aggregation weights to enhance translation for low-resource languages. To better understand its significance, it is essential to compare CLOF with other personalization methods, such as meta-learning and domain adaptation, which aim to improve model generalization across tasks and domains. Meta-learning, for example, enables a model to learn how to adapt to new tasks quickly, but it often requires additional meta-learners and task-specific adaptation mechanisms, which can be computationally expensive and impractical in low-resource settings where data are limited.

In contrast, domain adaptation focuses on fine-tuning models for specific domains using data from related domains. While effective in certain scenarios, domain adaptation relies heavily on sufficient data from the target domain. In low-resource languages, where data are often sparse or noisy, domain adaptation can struggle to perform well, requiring significant preprocessing and fine-tuning for each new domain, limiting its scalability in multilingual contexts.

CLOF offers a lightweight and modular alternative. Unlike meta-learning and domain adaptation, CLOF does not require additional meta-learners or complex task-specific fine-tuning. Instead, it adjusts gradient aggregation weights during training to prioritize the low-resource language while still benefiting from high-resource data. This simplicity makes CLOF especially well suited for low-resource environments where computational resources are limited and the data are imbalanced. By providing a scalable solution for multilingual machine translation, CLOF ensures that models can be adapted more efficiently without requiring complex adjustments or large amounts of task-specific data.

5.3. Generalizability of This Study to Other Low-Resource Languages

Although our study primarily focused on Twi, the techniques we employed—cross-lingual learning, federated learning, and dynamic dataset aggregation—are theoretically generalizable to other low-resource languages. Cross-lingual learning has already proven successful in transferring knowledge from high-resource languages (like English) to low-resource languages across various domains, thereby enabling better representation of linguistic structures through multilingual models. Several studies have demonstrated how large-scale multilingual models, such as mBART and mT5, can effectively transfer knowledge across languages, improving performance in low-resource settings [43]. Additionally, Conneau et al. (2020) showed that multilingual models trained on high-resource languages can substantially enhance the translation performance of languages with limited data [44].

The key advantage of our approach lies in its ability to adapt the training process by dynamically adjusting the weight assigned to datasets during the fine-tuning phase. This flexibility allows for a scalable solution that can accommodate multiple languages with varying resource levels, further enhancing translation performance for other low-resource languages. Previous studies have explored similar approaches, where language-specific weights were adjusted during training to optimize model performance, particularly in multilingual settings. For example, Agarwal et al. (2021) investigated methods to optimize neural machine translation (NMT) systems for African languages, addressing the challenges of balancing multiple datasets [42].

Moreover, the use of federated-like training methodologies presents a unique opportunity for extending this approach to other low-resource languages. Federated learning, with its focus on decentralizing data processing, eliminates the need for direct access to sensitive or scarce datasets, which is particularly important for low-resource languages where data privacy is often a concern. One study demonstrated that federated learning can effectively combine diverse datasets without compromising privacy, thus enabling the use of multilingual data for low-resource languages without exposing sensitive information [45]. By incorporating dynamic dataset aggregation, our model can prioritize the most valuable data for the low-resource language, ensuring high performance even with limited training data. This adaptability, supported by theoretical evidence from existing research, suggests that our framework could be successfully applied to a wide range of low-resource languages, such as those from different African, Indigenous, and minority language communities that face similar challenges of data scarcity.

5.4. Computational Requirements and Feasibility in Low-Resource Environments

While our approach significantly improves translation quality for low-resource languages, its computational demands must be carefully considered, especially in settings with limited technological infrastructure. Fine-tuning the mT5 model on the English-Twi dataset took approximately 2 h on an NVIDIA Tesla T4 GPU when trained for 100 epochs with a batch size of 8. The memory usage during training reached approximately 15 GB of VRAM, which may pose a challenge for training on lower-end GPUs. During inference, the model processed around 100 sentences per second on a GPU, but this dropped to approximately 10–15 sentences per second on a CPU, underscoring the need for hardware acceleration in real-time applications.

Despite these demands, the feasibility of deploying this method in low-resource environments can be enhanced through optimization techniques. Model compression methods, such as quantization (e.g., 8-bit or 4-bit), can significantly reduce the memory footprint, making it possible to run the model on devices with lower computational capacity. Furthermore, offline fine-tuning strategies allow for local adaptation of the model, eliminating the need for continuous access to cloud-based resources and making the model more accessible in areas with limited internet connectivity. By optimizing the model for mobile devices and leveraging AI accelerators, such as Google’s Edge TPU, real-time translation applications can be deployed even in computationally constrained environments.

These optimizations ensure that our model remains practical and accessible for low-resource languages, addressing both linguistic challenges and the technological limitations often faced by underrepresented language communities.

6. Conclusions

In this research, we demonstrated the effectiveness of fine-tuning a pretrained multilingual model (mT5) for low-resource language translation, utilizing cross-lingual learning and a federated-like approach. The fine-tuned model exhibited significant improvements in translation quality, outperforming the baseline model across all key metrics, including SpBLEU, ROUGE, and WER. Our approach successfully bridged the gap between high-resource and low-resource languages, substantially enhancing the translation accuracy and fluency of Twi, a language with limited NLP resources.

This refined model has a broad range of potential applications. It can be integrated into real-time translation systems to facilitate communication between English and Twi speakers. Additionally, it holds promise for language learning tools, assisting in both learning and teaching Twi. The approach can also be applied to social media and communication platforms, helping Twi speakers overcome language barriers and increase their digital accessibility. The improved translation of Twi is especially important as it enhances NLP capabilities for a language that has historically been underrepresented in machine translation. This opens up new possibilities for integrating Twi into digital platforms, improving communication, and enhancing content generation for the large population of native speakers.

Despite the promising results, there are areas that warrant further improvement. Expanding the dataset with larger and more diverse corpora would enhance the model’s ability to generalize across different linguistic patterns. Additionally, leveraging newer pretrained models, such as mT5-Base or larger variants trained on more extensive multilingual corpora, could further refine translation performance. Future work should focus on improving accuracy for rare phrases by expanding the dataset, implementing subword tokenization, and exploring data augmentation techniques like backtranslation. Combining synthetic data generation with real-world corpora would enhance the model’s adaptability, especially for colloquial expressions and rare linguistic structures. Furthermore, adapter-based fine-tuning and prompt tuning offer promising avenues for efficiently updating the model, enabling scalability and resource-efficient adaptation to other low-resource languages.

While our model represents a significant advancement in Twi translation, challenges such as overfitting due to limited data and adapting federated-like training to diverse linguistic structures remain. The model occasionally exhibits rigidity in sentence structures, struggles with highly inflected words, and sometimes over-prioritizes high-resource languages. To mitigate these issues, strategies such as data augmentation, regularization techniques, subword tokenization, and adaptive gradient weighting should be explored. These methods would help improve the model’s ability to better capture the linguistic nuances of Twi.

We also suggest incorporating additional techniques, such as adapter-based learning, LoRA, and multilingual fine-tuning baselines like mBART, to further improve translation performance and generalization across multiple languages. These techniques are expected to enable more efficient updates when incorporating new languages or domains, ensuring better scalability in low-resource environments.

The future of low-resource language translation relies not only on individual research efforts but also on collective contributions from the broader NLP community. Open-source, multilingual datasets and tools are essential for advancing machine translation models and fostering a more inclusive digital ecosystem. By fostering collaborations between academic researchers, industry leaders, and language communities, we can accelerate the development of open-access resources, ensuring sustainable progress in machine translation for underrepresented languages.

Author Contributions

Conceptualization, E.A. and V.A.O.; methodology, E.A.; software, E.A.; validation, E.A., V.A.O. and A.B.Q.; formal analysis, E.A.; investigation, E.A.; resources, E.A.; data curation, E.A.; writing—original draft preparation, E.A.; writing—review and editing, X.Z. and J.R.A.; visualization, E.A.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

FL	Federated learning
MT	Machine translation
CLOF	Cross-Lingual Optimization Framework
NLP	Natural language processing
SpBLEU	Sentence-Piece BLEU
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
WER	Word Error Rate
LoRA	Low-Rank Adaptation
mT5	Multilingual T5

References

Her, W.; Kruschwitz, U. Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-Resourced Languages @ LREC-COLING 2024, Torino, Italy, 20–25 May 2024; pp. 155–167. [Google Scholar]
Kowtal, N.; Deshpande, T.; Joshi, R. A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross Lingual Sentence Representations. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024; pp. 1–7. [Google Scholar]
Sun, M.; Wang, H.; Pasquine, M.; Hameed, I.A. Machine Translation in Low-Resource Languages by an Adversarial Neural Network. Appl. Sci. 2021, 11, 10860. [Google Scholar] [CrossRef]
Xia, M.; Kong, X.; Anastasopoulos, A.; Neubig, G. Generalized Data Augmentation for Low-Resource Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5786–5796. [Google Scholar]
Park, Y.; Choi, Y.; Yun, S.; Kim, S.; Lee, K. Robust Data Augmentation for Neural Machine Translation through EVALNET. Mathematics 2022, 11, 123. [Google Scholar] [CrossRef]
Qi, J. Research on Methods to Enhance Machine Translation Quality Between Low-Resource Languages and Chinese Based on ChatGPT. J. Soc. Sci. Humanit. 2024, 6, 36–41. [Google Scholar] [CrossRef]
Prasada, P.; Rao, M.V.P. Reinforcement of low-resource language translation with neural machine translation and backtranslation synergies. IAES Int. J. Artif. Intell. 2024, 13, 3478–3488. [Google Scholar] [CrossRef]
Faheem, M.A.; Wassif, K.T.; Bayomi, H.; Abdou, S.M. Improving neural machine translation for low resource languages through non-parallel corpora: A case study of Egyptian dialect to modern standard Arabic translation. Sci. Rep. 2024, 14, 2265. [Google Scholar] [CrossRef]
Nzeyimana, A. Low-resource neural machine translation with morphological modeling. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 182–195. [Google Scholar]
Sreedeepa, H.S.; Idicula, S.M. Neural Network Based Machine Translation Systems for Low Resource Languages: A Review. In Proceedings of the 2nd International Conference on Modern Trends in Engineering Technology and Management, Kerala, India, 22 December 2023; pp. 330–336. [Google Scholar]
Lalrempuii, C.; Soni, B. Low-Resource Indic Languages Translation Using Multilingual Approaches. In: High Performance Computing, Smart Devices and Networks. Springer Nat. 2023, 1087, 371–380. [Google Scholar]
Mi, C.; Xie, S.; Fan, Y. Multi-granularity Knowledge Sharing in Low-resource Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 31. [Google Scholar] [CrossRef]
Gitau, C.; Marivate, V. Textual Augmentation Techniques Applied to Low Resource Machine Translation: Case of Swahili. J. Digit. Humanit. Assoc. S. Afr. 2023, 4, 1. [Google Scholar] [CrossRef]
Maimaiti, M.; Liu, Y.; Luan, H.; Sun, M. Data augmentation for low-resource languages NMT guided by constrained sampling. Int. J. Intell. Syst. 2021, 37, 30–51. [Google Scholar] [CrossRef]
Robinson, N.; Hogan, C.; Fulda, N.; Mortensen, D.R. Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican. In Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), Gyeongju, Republic of Korea, 16 October 2022; pp. 35–42. [Google Scholar]
Dong, J. Transfer Learning-Based Neural Machine Translation for Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023. [Google Scholar] [CrossRef]
Ko, W.; El-Kishky, A.; Renduchintala, A.; Chaudhary, V.; Goyal, N.; Guzman, F.; Fung, P.; Koehn, P.; Diab, M. Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 1, pp. 802–812. [Google Scholar]
Gunnam, V. Tackling Low-Resource Languages: Efficient Transfer Learning Techniques for Multilingual NLP. Int. J. Res. Publ. Semin. 2022, 13, 354–359. [Google Scholar]
Vu, H.; Bui, N.D. On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese. J. Inf. Telecommun. 2023, 7, 241–253. [Google Scholar] [CrossRef]
Zhang, W.; Dai, L.; Liu, J.; Wang, S. Improving Many-to-Many Neural Machine Translation via Selective and Aligned Online Data Augmentation. Appl. Sci. 2023, 13, 3946. [Google Scholar] [CrossRef]
Srihith, I.D.; Donald, A.D.; Srinivas, T.A.S.; Thippanna, G.; Anjali, D. Empowering Privacy-Preserving Machine Learning: A Comprehensive Survey on Federated Learning. Int. J. Adv. Res. Sci. Commun. Technol. 2023, 3, 133–144. [Google Scholar] [CrossRef]
Chouhan, J.S.; Bhatt, A.K.; Anand, N. Federated Learning; Privacy Preserving Machine Learning for Decentralized Data. J. Propuls. Technol. 2023, 44, 167–169. [Google Scholar]
Wang, H.; Wang, Q.; Ding, Y.; Tang, S.; Wang, Y. Privacy-preserving federated learning based on partial low-quality data. J. Cloud Comput. Adv. Syst. Appl. 2024, 13, 62. [Google Scholar] [CrossRef]
Asad, M.; Shaukat, S.; Javanmardi, E.; Nakazato, J.; Tsukada, M. A Comprehensive Survey on Privacy-Preserving Techniques in Federated Recommendation Systems. Appl. Sci. 2023, 13, 6201. [Google Scholar] [CrossRef]
Zoph, B.; Yuret, D.; May, J.; Knight, K. Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1568–1575. [Google Scholar]
Smith, B.; Brown, C. SLaSh: A Novel Approach for Fine—Tuning Transformers in Low—Resource Machine Translation. In Proceedings of the 20th Conference on Machine Translation Technologies, Rhodes Island, Greece, 4–6 July 2022; pp. 123–135. [Google Scholar]
Lankford, S.; Afli, H.; Way, A. adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds. Information 2023, 14, 638. [Google Scholar] [CrossRef]
Gheini, M.; Ren, X.; May, J. Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 5 November 2021; pp. 1754–1765. [Google Scholar]
Gupta, U.; Galstyan, A.; Steeg, G.V. Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 9–14 July 2023; pp. 12612–12629. [Google Scholar]
Chen, Y. A concise analysis of low-rank adaptation. Appl. Comput. Eng. 2024, 42, 76–82. [Google Scholar] [CrossRef]
Weng, R.; Yu, H.; Luo, W.; Zhang, M. Deep Fusing Pre-trained Models into Neural Machine Translation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11468–11476. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02), Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.-J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzmán, F.; Fan, A. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Trans. Assoc. Comput. Linguist. 2022, 10, 522–538. [Google Scholar] [CrossRef]
Lin, C. Looking for a Few Good Metrics: ROUGE and its Evaluation. In Proceedings of the 4th NTCIR Workshops, Tokyo, Japan, 2–4 June 2004. [Google Scholar]
Lin, C. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; pp. 74–81. [Google Scholar]
Chatzikoumi, E. How to evaluate machine translation—A review of automated and human metrics. Nat. Lang. Eng. 2019, 26, 137–161. [Google Scholar] [CrossRef]
Shukla, M.B.; Chavada, B. A Comparative Study and Analysis of Evaluation Matrices in Machine Translation. In Proceedings of the 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 13–15 March 2019; pp. 1236–1239. [Google Scholar]
Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y.N.; Auli, M. Pay Less Attention with Lightweight and Dynamic Convolutions. arXiv 2019, arXiv:1901.10430. [Google Scholar]
Paulus, R.; Xiong, C.; Socher, R. A Deep Reinforced Model for Abstractive Summarization. arXiv 2017, arXiv:1705.04304. [Google Scholar]
Nallapati, R.; Zhou, B.; Santos, C.D.; Gulcehre, C.; Xiang, B. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (CONLL), Berlin, Germany, 11–12 August 2016. [Google Scholar]
Agarwal, P.; Nunes, L.; Blunt, J. Retrieval Practice Consistently Benefits Student Learning: A Systematic Review of Applied Research in Schools and Classrooms. Educ. Psychol. Rev. 2021, 33, 1409–1453. [Google Scholar] [CrossRef]
Aniruddha, R.; Ray, P.; Maheshwari, A.; Sarkar, S.; Goyal, P. Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), Bangkok, Thailand, 15 August 2024; pp. 64–73. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar]
Zhu, J.; Lv, C.; Wang, X.; Wu, M.; Liu, W.; Li, T.; Ling, Z.; Zhang, C.; Zheng, X.; Huang, X. Promoting Data and Model Privacy in Federated Learning through Quantized LoRA. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 10501–10512. [Google Scholar]

Figure 1. Example of backtranslation as a data augmentation technique.

Figure 2. Conceptual framework for low-resource machine translation.

Figure 3. Evaluation metrics heatmap.

Figure 4. Scalability of the model.

Figure 5. Learning curve of performance.

Table 1. Model configuration.

Category	Details
Model	T5-Small (Text-to-Text Transfer Transformer)
Number of Encoder Layers	6
Number of Decoder Layers	6
Hidden Dimension	512
Attention Heads	8 per layer
Total Parameters	Approximately 60 million
Optimization Algorithm	AdamW optimizer, with a weight decay of 0.01 for regularization
Gradient Clipping	Applied gradient clipping with a maximum norm of 1.0 to prevent exploding gradients during training.
Tokenizer	Sentence-Piece Tokenizer (shared vocabulary across source and target languages)
Pretrained Model	T5-Small (Text-to-Text Transfer Transformer)

Table 2. Performance comparison between pretrained mT5 and fine-tuned English-Twi model.

Metrics	Baseline (%)	Fine-Tuned Model (%)	Improvement (%)
SpBLEU	2.16	71.30	69.14
ROUGE-1	15.23	65.24	50.01
ROUGE-2	7.18	60.22	53.04
ROUGE-L	14.22	62.12	47.90
WER (%)	183.16	68.32	114.84

Table 3. Normalized WER analysis.

Model	Raw WER	Normalized WER
Baseline	183.16%	92.4%
Proposed	68.32%	27.1%

Table 4. Statistical validation.

Metric	Proposed vs. Baseline	p-Value	95% CI (Gain)
spBLEU	+7.2 points	<0.001	[5.8, 8.5]
ROUGE-L	+11.4% F1	0.0023	[9.1%, 13.7%]
WER	−154.9% (absolute)	<0.0001	[−162.3%, −147.5%]

Table 5. Comparison of translation results.

Input	Reference	Baseline	Fine-Tuned
I am learning machine translation.	Merehwehwɛ sɛ mɛyɛ asɛmfua ho nsɛm.	Mfe3′ ne na wcb3′tumi ahoroc ahoroc.	Merehwehwɛ sɛ mɛyɛ asɛm nsɛm.
How are you?	Wote sɛn?	Wodeɛn na wo yɛɛ?	Wote sɛn?
This is a beautiful day.	Ɛyɛ da a ɛyɛ fɛ.	Ɛyɛ na 3′yɛ′ na 3′yɛ′ na 3′yɛ′.	Ɛyɛ da a ɛyɛ fɛ.
Can you help me with this task?	Wobɛtumi aboa me wɔ dwumadie yi mu?	Wɔbɛyɛ sɛ wɔboa adwuma yi wɔ.	Wobɛtumi aboa me wɔ dwumadie yi mu?
Machine translation is useful.	Asɛmfua ho dikan yɛ ho wɔ asɛm foforɔ.	Asɛmfua yɛɛ nea ɛyɛ dikan dodo.	Asɛmfua ho dikan yɛ ho wɔ asɛm foforɔ.
I love learning new things.	Mepɛ sɛ me sua nsɛm foforɔ.	Mepɛ sɛ me sua biribi foforɔ.	Mepɛ sɛ me sua nsɛm foforɔ.
Where is the market?	Ɛhe na ɛyɛ dɔnkɔ?	Ɛheɛ ɛwɔ na yɛ dɔnkɔ?	Ɛhe na ɛwɔ dɔnkɔ?
She is my sister.	Ɔyɛ me nua baa.	Ɔyɛɛ me nua wɔ baa.	Ɔyɛ me nua baa.
What time is it now?	Dɛn bereɛ a ɛyɛ mpɔtam wɔ seesei ara?	Berɛ no dɛn yɛɛ mpɔtam wɔ.	Dɛn bereɛ a ɛyɛ mpɔtam wɔ seesei ara?
Do you understand this?	Wote ase wɔ asɛm yi mu?	Wɔte sɛn mu sɛ wɔ yɛɛ yɛ yi?	Wote ase wɔ asɛm yi mu?

Table 6. Impact of gradient weighting.

Gradient Weighting Strategy	SpBLEU (%)	ROUGE-1 (%)	ROUGE-2 (%)	ROUGE-L (%)	WER (%)
Equal Weighting (Baseline mT5)	2.16	15.23	7.18	14.22	183.16
Static Weighting (Fixed 1:2 HR-LR Ratio)	12.50	40.12	22.31	37.80	98.45
Dynamic Weighting (Optimized Per Round)	25.67	65.24	50.22	62.12	68.32

Table 7. Effect of personalization on translation accuracy.

Input Sentence	Baseline mT5 Translation	Fine-Tuned (No Personalization)	Fine-Tuned (With Personalization)	Reference
“The elders are speaking in proverbs.”	“Mpanyimfo no rekasa wɔ proverbs mu.” (Mixing English)	“Mpanyimfo no reka nsɛm a ɛyɛ dodoɔ.” (Fluent, but incorrect word)	“Mpanyimfo no reka abebu sɛm.” (Accurate, fluent)	“Mpanyimfo no reka abebu sɛm.”
“Tomorrow, I will travel to Accra.”	“Ɔkyena, me bɛyɛ atena Accra.” (Incorrect structure)	“Ɔkyena, me bɛkɔ Accra.” (Acceptable)	“Ɔkyena, me rekɔ Accra.” (Correct tense and syntax)	“Ɔkyena, me rekɔ Accra.”
“She loves to cook for her family.”	“Ɔdɔ de no yɛ aduan ama abusua no.” (Unnatural phrasing)	“Ɔpɛ de yɛ aduan ma abusua no.” (Fluent, but wrong verb usage)	“Ɔdɔ sɛ ɔbɛyɛ aduan ama abusua no.” (Correct and natural)	“Ɔdɔ sɛ ɔbɛyɛ aduan ama abusua no.”

Table 8. Examples of errors in translation.

Input Sentence	Fine-Tuned Translation	Reference Translation	Observed Error
The cat is under the table.	ɛnɔma no wɔ sika no ase.	Pɔnkɔ no wɔ pono no ase.	Wrong subject used.
I will eat tomorrow.	Me bɛyɛ adidi nnɛ.	Me bɛyɛ adidi ɔkyena.	Wrong tense used.
Where is the teacher?	Wɔ he na mpɔtam sika wɔ?	Ɛhe na kyerɛkyerɛfoɔ no wɔ?	Unnatural phrasing for “teacher”.
This book is interesting.	Ɛhɔ nneɛma no yɛ fɛfɛɛfɛ.	Abakɔsɛm yi yɛ anika dodo.	Overgeneralized vocabulary.
The child is running fast.	Abɔfra no retu kwan dada.	Abɔfra no retu kwan ntɛm ara.	Missing adverb translation.

Table 9. Comparison with existing studies.

Study	Methodology/Approach	SpBLEU	ROUGE-1	ROUGE-2	ROUGE-L	WER	Limitations
Fan et al., 2019 [39]	The study introduces a neural summarization model that customizes output based on user preferences for length, style, and entities, with default settings applied when no input is given.	N/A	39.06	15.38	35.77	N/A	The model’s performance heavily depends on user input for optimal summary customization.
Paulus et al., 2017 [40]	Introduces a neural network model with intra-attention and combines supervised word prediction with reinforcement learning (RL) to improve summarization quality.	N/A	38.30	14.81	35.49	N/A	Reinforcement learning-based training is computationally intensive and complex.
Nallapati et al. (2016) [41]	Uses Attentional Encoder–Decoder RNNs for abstractive summarization, improving keyword modeling and rare word handling.	29.7	35.46	13.30	32.65	N/A	struggle with accurately generating rare or highly specialized words in certain contexts.
Agarwal et al., 2020 [42]	The study trains deeper Transformer and Bi-RNN encoders for machine translation, adjusting the attention mechanism to improve optimization and BLEU scores.	0.7	N/A	N/A	N/A	N/A	Deeper models are harder to optimize and require careful tuning to avoid performance degradation.
Our Study	Leveraging dynamic dataset aggregation and cross-lingual learning, we optimize translation for low-resource languages by adapting to high-resource language data and incorporating federated-like training methods.	71.30	65.24	60.22	62.12	68.32	Needs further improvements on rare words and phrases.

Table 10. Comparison with GPT-3.5 Using Dataset-Derived Prompts.

Model	SpBLEU	ROUGE-L	WER (%)	TFS (%)
Fine-tuned mT5 (ours)	71.30	62.12	68.32	82
GPT-3.5 (zero-shot)	38.41	45.23	121.67	54
GPT-3.5 (few-shot)	52.17	58.34	89.25	63
GPT-4 (zero-shot)	47.82	53.41	97.83	68
GPT-4 (few-shot)	59.36	61.22	75.41	74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Agyei, E.; Zhang, X.; Quaye, A.B.; Odeh, V.A.; Arhin, J.R. Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models. Appl. Sci. 2025, 15, 4494. https://doi.org/10.3390/app15084494

AMA Style

Agyei E, Zhang X, Quaye AB, Odeh VA, Arhin JR. Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models. Applied Sciences. 2025; 15(8):4494. https://doi.org/10.3390/app15084494

Chicago/Turabian Style

Agyei, Emmanuel, Xiaoling Zhang, Ama Bonuah Quaye, Victor Adeyi Odeh, and Joseph Roger Arhin. 2025. "Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models" Applied Sciences 15, no. 8: 4494. https://doi.org/10.3390/app15084494

APA Style

Agyei, E., Zhang, X., Quaye, A. B., Odeh, V. A., & Arhin, J. R. (2025). Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models. Applied Sciences, 15(8), 4494. https://doi.org/10.3390/app15084494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Aggregation and Augmentation for Low-Resource Machine Translation Using Federated Fine-Tuning of Pretrained Transformer Models

Abstract

1. Introduction

2. Related Works

2.1. Challenges and Approaches in Low-Resource Machine Translation

2.2. Data Augmentation and Transfer Learning for Low-Resource MT

2.3. Federated Learning in Machine Translation

2.4. Advancements in Pretrained Transformer Models and Dataset Aggregation

3. Methodology

3.1. General Setup

3.2. Dataset

3.3. Data Preprocessing

3.4. Baseline Model

3.5. Algorithmic Framework

3.6. Gradient Weighting for Cross-Lingual Learning

3.7. Optimization and Training Process

3.8. Implementation Details

3.9. Evaluation Metrics

3.9.1. SpBLEU (Sentence-Piece BLEU)

3.9.2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

3.9.3. WER (Word Error Rate)

4. Results

4.1. Quantitative Analysis of Baseline and Fine-Tuned Model Performance

4.2. Analysis of Sample Translation

4.3. Scalability of the Model

4.4. Learning Curves

4.5. Impact of Gradient Weighting on Model Performance

4.6. Role of Personalization in Model Optimization

4.7. Error Analysis in Translation

4.8. Comparison with Other Studies

4.9. Comparison with GPT-3.5 Using Dataset-Derived Prompts

5. Discussions and Contributions

5.1. Application Beyond Twi: Potential for Other Low-Resource Languages

5.2. Comparison with Other Personalization Approaches

5.3. Generalizability of This Study to Other Low-Resource Languages

5.4. Computational Requirements and Feasibility in Low-Resource Environments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI