Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation

Javed, Arifa; Zan, Hongying; Mamyrbayev, Orken; Abdullah, Muhammad; Ahmed, Kanwal; Oralbekova, Dina; Dinara, Kassymova; Akhmediyarova, Ainur

doi:10.3390/electronics14020243

Open AccessArticle

Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation

by

Arifa Javed

¹

,

Hongying Zan

^1,*,

Orken Mamyrbayev

²

,

Muhammad Abdullah

¹

,

Kanwal Ahmed

³,

Dina Oralbekova

²

,

Kassymova Dinara

⁴ and

Ainur Akhmediyarova

^5,*

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan

³

School of Software, Henan University, Kaifeng 475001, China

⁴

Academy of Logistics and Transport, Almaty 050010, Kazakhstan

⁵

Institute of Automation and Information Technologies, Satbayev University, Almaty 050013, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(2), 243; https://doi.org/10.3390/electronics14020243

Submission received: 20 November 2024 / Revised: 28 December 2024 / Accepted: 3 January 2025 / Published: 8 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Neural machine translation (NMT) plays a vital role in modern communication by bridging language barriers and enabling effective information exchange across diverse linguistic communities. Due to the limited availability of data in low-resource languages, NMT faces significant translation challenges. Data sparsity limits NMT models’ ability to learn, generalize, and produce accurate translations, which leads to low coherence and poor context awareness. This paper proposes a transformer-based approach incorporating an encoder–decoder structure, bilingual curriculum learning, and contrastive re-ranking mechanisms. Our approach enriches the training dataset using back-translation and enhances the model’s contextual learning through BERT embeddings. An incomplete-trust (in-trust) loss function is introduced to replace the traditional cross-entropy loss during training. The proposed model effectively handles out-of-vocabulary words and integrates named entity recognition techniques to maintain semantic accuracy. Additionally, the self-attention layers in the transformer architecture enhance the model’s syntactic analysis capabilities, which enables better context awareness and more accurate translations. Extensive experiments are performed on a diverse Chinese–Urdu parallel corpus, developed using human effort and publicly available datasets such as OPUS, WMT, and WiLi. The proposed model demonstrates a BLEU score improvement of 1.80% for Zh→Ur and 2.22% for Ur→Zh compared to the highest-performing comparative model. This significant enhancement indicates better translation quality and accuracy.

Keywords:

neural machine translation; syntactic analysis; transformer; BERT; re-ranking; M2M model

Graphical Abstract

1. Introduction

Language is a fundamental tool for communication and plays a significant role in preserving cultural heritage. It is a powerful medium for fostering connections and understanding between nations, particularly in global initiatives like the ‘Belt and Road’ initiative [1]. Effective communication between different language groups is essential for facilitating economic and cultural exchanges. However, overcoming language barriers remains a significant challenge, especially in low-resource languages, where translation accuracy and fluency are fundamental. In this context, machine translation (MT) becomes an essential tool for bridging these language gaps.

Machine translation (MT) is a field at the intersection of linguistics, computer science, artificial intelligence, and specifically natural language processing (NLP) [2]. The primary aim of MT is to automatically translate text from one source language to a target language, thereby enhancing communication and mutual understanding. Early MT approaches relied on rule-based methods, where linguistic rules were manually created to translate text [3]. For instance, rule-based systems like SYSTRAN use predefined grammar rules and dictionaries to map phrases between languages. While effective for simple translations, these systems struggled with complex sentence structures, idiomatic expressions, and contextual variations.

With the advent of statistical methods [4], MT systems evolved to use statistical models trained on large bilingual corpora. These models could learn translation patterns from data, improving accuracy by capturing more subtle language nuances. Statistical models relied on feature engineering, where features like word alignments and phrase pairings were used to improve translation [5]. However, these methods still struggled to capture context fully and to provide fluent translations, particularly with complex syntactic structures, such as varying word order or sentence construction, and idiomatic expressions like “break a leg”, which could be misinterpreted if translated literally.

The introduction of neural machine translation (NMT) marked a significant advancement in the field of machine translation [6,7]. NMT uses artificial neural networks, particularly architectures like recurrent neural networks (RNNs) [8] and transformer-based models [9], to translate text in an end-to-end manner. Unlike previous models, which relied on predefined rules or statistical data, NMT learns translation patterns directly from vast amounts of data, producing more fluent, contextually accurate, and natural translations [10]. This shift enables the generation of translations that resemble human language. NMT models prioritize fluency and natural language usage, ensuring that translations are structured as humans would express the sentiment. By focusing on context, NMT ensures that the translation is grammatically correct and idiomatic, much like a human-generated translation. Specifically, context awareness in NMT refers to the model’s ability to understand and incorporate the surrounding context of a word, phrase, or sentence. Rather than translating isolated words, a context-aware system takes into account the broader sentence or passage to generate a more coherent and accurate translation. For example, when translating the word ‘bank’, the model can determine whether it refers to a financial institution or the side of a river based on the surrounding text. These improvements are primarily due to deep learning techniques, allowing the system to learn the intricate relationships between words and phrases within a given context, resulting in more natural and fluent translations akin to those produced by humans.

Despite the success of NMT for high-resource language pairs like English–French [6], challenges remain for low-resource languages such as Chinese and Urdu. Low-resource NMT deals with the scarcity of parallel corpora, linguistic resources, and reduced computational power [11]. Even though large populations speak both Chinese and Urdu, high-quality parallel corpora for these languages are limited [12]. This data sparsity presents a challenge for NMT models, as they have fewer examples to learn from, particularly regarding rare words, idiomatic expressions, and complex sentence structures. The limited data also restricts the model’s ability to generalize linguistic patterns, making it difficult for NMT systems to handle domain-specific contexts [13].

A key issue in NMT for low-resource languages is handling out-of-vocabulary (OOV) words which were not encountered during the training phase. These OOV words often present significant syntactic and linguistic challenges [14], especially when translating between languages with very different linguistic structures. For instance, Chinese and Urdu have very different word order and morphology, which makes it difficult for the model to handle certain words effectively. To address this, named entity recognition (NER) [15] plays a vital role in improving the contextual accuracy of translations. NER helps identify and classify essential terms, such as people, organizations, and locations, ensuring these elements are correctly translated. NER capabilities also assist in syntactic parsing, which is the process of analyzing the grammatical structure of a sentence to understand how different words relate to each other. This parsing allows the model to handle complex sentence structures better and ensures the translated text is syntactically correct.

In response to these challenges, our research aims to improve NMT performance for Chinese↔Urdu translations by addressing data scarcity and linguistic complexity. We leverage pre-trained models and curated datasets to enhance the model’s training process. The transformer-based M2M-100 model [16] is proposed, which is trained on a “many-to-many” dataset covering 7.5 billion sentences across 100 languages. This model enhances translation quality by improving context awareness and coherence while addressing issues such as OOV words. Furthermore, integrating NER within the translation pipeline ensures that critical named entities are accurately translated, preventing errors and omissions. Back-translation techniques involve translating text from Urdu to Chinese and vice versa, creating synthetic parallel corpora, further enriching the training data. These techniques collectively optimize the model’s performance, enabling it to handle diverse linguistic patterns and improve its ability to produce high-quality translations for low-resource language pairs like Chinese and Urdu.

Contributions

The key contributions of the proposed model are as follows.

By employing back-translation, the model effectively combats the challenge of data scarcity and sparsity in low-resource language pairs.
This paper utilizes the incomplete-trust (in-trust) loss function to replace cross-entropy loss. This loss function enhances model robustness by reducing the impact of noisy data during training, thereby helping to prevent overfitting and improving generalization.
A contrastive re-ranking approach is employed to refine the translation output by evaluating and selecting the most accurate candidate translation from multiple candidate translations.
The transformer-based architecture of M2M with BERT embeddings and attention mechanisms allows the model to maintain a higher level of context awareness throughout the translation process. The model’s training includes focused layers on semantic parsing and understanding the relationships between words and phrases.
Incorporating NER tagging within the translation workflow ensures that the model prioritizes and accurately translates critical named entities or specific terms.

The proposed model aims to overcome the above-described primary challenges and ensures more accurate, fluent, and contextually relevant translations across diverse linguistic and domain-specific settings. The remaining article is organized in the following way. Section 2 discusses a critical literature review of state-of-the-art paradigms of NMT. Section 3 introduces the research methodology and its implementation process. Section 4 describes the experimental structure. Section 5 presents the outcomes of the proposed model and its effectiveness over the existing model with comparative analysis. Finally, Section 6 provides concluding remarks and discusses potential future research directions.

2. Literature Review

Machine translation systems are generally categorized based on their underlying architecture. These categories include rule-based machine translation (RBMT), corpus-based machine translation (CBMT), hybrid approaches, and neural machine translation (NMT) [3,5,6], each offering distinct methods for overcoming translation challenges. RBMT is one of the earliest MT approaches, relying heavily on linguistic rules to translate text between languages. There are two key strategies within RBMT: the transfer-based approach and the interlingua-based approach. The transfer approach translates the source language into the target language by applying syntactic and semantic rules and bilingual dictionaries. In contrast, the interlingua approach generates an intermediate semantic representation that can be translated into any language, bypassing the need for direct source-to-target mapping [3]. RBMT systems require extensive knowledge of source and target languages. The need for comprehensive rule creation often limits them, making them less scalable for language pairs with limited linguistic resources [17].

CBMT emerged to overcome the limitations of rule-based systems. Liu et al. [18] utilized this approach on large bilingual corpora to learn translation patterns automatically. CBMT systems rely on statistical and probabilistic models, which map phrases in the source language to their corresponding translations in the target language. The shift from rule-based to corpus-based systems allowed for greater scalability, particularly for languages with extensive text corpora available [13]. However, CBMT still requires significant amounts of parallel data to ensure high-quality translations, and it struggles in low-resource language pairs where such data are scarce.

Hybrid approaches aim to combine the strengths of both RBMT and CBMT. These systems integrate linguistic knowledge, such as rules and dictionaries, with data-driven methods to enhance translation quality [5]. Hybrid models benefit low-resource language pairs or specialized domains where purely rule-based or data-driven methods may fail to capture all nuances [19]. These approaches attempt to leverage the best of both worlds—rule-based precision and corpus-based flexibility.

The most significant breakthrough in MT in recent years has been the advent of neural machine translation (NMT) [6]. NMT uses deep learning techniques, particularly artificial neural networks (ANNs), to learn translations directly from data without the need for predefined linguistic rules [7]. Unlike RBMT and CBMT, which require explicit mappings or rules, NMT systems learn a sequence-to-sequence translation process from large parallel corpora, making them more adaptable to various languages and domains. A key feature of NMT is its ability to generate translations that account for the context of entire sentences rather than just word-to-word translations [20].

One of the early successes of NMT was in the development of the sequence-to-sequence (Seq2Seq) model, which relies on recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) networks [21]. These models performed well on translation tasks where previous methods struggled, particularly with long-range dependencies in sentences [22]. The utilization of the attention mechanism further improved the Seq2Seq model by allowing the system to focus on the most relevant parts of the source sentence during translation, improving the fluency and accuracy of the output [23].

The introduction of transformer-based models by Vaswani marked a significant turning point in NMT [9]. Unlike RNNs and LSTMs [24], which process data sequentially, transformers leverage a self-attention mechanism that allows them to process the entire input sequence simultaneously. This enables parallel computation, significantly speeding up training and inference times. Transformers excel at handling languages with complex syntactic structures and long-range dependencies, making them particularly well suited for languages like Chinese and Urdu, which have significant differences in both syntax and semantics.

Several studies have demonstrated the effectiveness of transformer models in low-resource languages. For example, multilingual BERT has been successfully adapted for sentiment analysis in Urdu [25], showing how transformer models can process and understand underrepresented languages. Similarly, Malik et al. [26] used transformers to detect threatening language in Urdu, emphasizing their ability to maintain semantic integrity in sensitive-content areas. The ability of transformers to manage diverse languages, even those with limited resources, has contributed to their dominance in the NMT landscape.

Transformer models have also enabled the development of multilingual and cross-lingual models, which can simultaneously perform translation tasks across many languages. Liu et al. [27] demonstrated the effectiveness of the XGLUE benchmark for evaluating named entity recognition (NER), Part-of-Speech (POS) tagging, and news classification across different languages. Another notable approach is multilingual machine translation, which leverages transformer models trained in various languages. Guerreiro et al. [28] applied the M2M model for hallucination detection in translations, showing how large multilingual models can improve translation quality even in challenging cases. These models, trained on massive datasets like FLORES-101 [29], have effectively provided accurate translations.

Despite the success of transformer models, low-resource languages continue to present significant challenges in NMT. One of the primary obstacles is data sparsity, which limits the availability of parallel corpora necessary for training high-quality models. Khan et al. [30] explored the use of OpenNMT for translating Chinese to Urdu, highlighting issues such as the difficulty of handling non-universal language pairs and named entity recognition (NER). They noted that a lack of data and insufficient context awareness were significant barriers to improving translation accuracy. To address these challenges, data augmentation methods have been proposed to expand the available datasets artificially. For example, Zeeshan et al. [31] used a Seq2Seq model with attention mechanisms to improve translation quality in Chinese–Urdu NMT, focusing on enhancing context awareness. Despite these advancements, the fixed-length context vector used in Seq2Seq models often fails to capture longer sequences, resulting in information loss during translation. Subword tokenization has emerged as a crucial technique in modern NMT systems to address some of these challenges. This approach allows models to break down words into smaller, more manageable units (subwords), which is particularly helpful when dealing with rare or unseen words. By reducing the vocabulary size, subword tokenization ensures the model can handle various word combinations, including those in languages with complex word formation rules or rich morphology [32]. This method improves the handling of unseen words and enhances the overall performance of NMT systems by facilitating better generalization, especially in low-resource settings where large corpora may not be available [33].

Zeeshan et al. [34] focused on developing an electronic dictionary for Chinese–Urdu translation using both LSTM and transformer architectures. Their study showed improvements in translation accuracy, though challenges persist due to the limitations of the available domain-specific language data. Similarly, Chan et al. [35] proposed incorporating Part-of-Speech (POS) sequence prediction into transformer models to refine translations. The system can generate more accurate translations by first predicting the target language’s POS sequence. However, this method can introduce errors if the POS tagging is inaccurate or incomplete, highlighting the need for high-quality, well-annotated data.

This literature review highlights several approaches and challenges in neural machine translation (NMT), particularly for low-resource language pairs, as summarized in Table 1. Traditional rule-based and corpus-based methods rely heavily on large datasets and linguistic rules, which are often inadequate for low-resource settings. Recent advancements in transformer-based models and techniques, such as back-translation and the integration of pre-trained language models, have shown significance in addressing some of these issues. Despite these advancements, there is a significant research gap in developing NMT models that can perform effectively with minimal data while maintaining high translation quality.

3. Materials and Methods

This section outlines the materials and methods used in this study. The dataset preparation, text direction, data preprocessing with tokenization, and proposed model architecture are explained. Figure 1 illustrates the research workflow.

3.1. Preliminaries

The neural machine translation (NMT) model, denoted by

ϕ

, transforms a sentence s from the source language into a sentence t in the target language. The model M utilizes a parallel training corpus

D = {(s_{i}, t_{i})}_{i = 1}^{M}

to optimize the log-likelihood of accurately predicting t given s, with each pair

(s_{i}, t_{i})

assumed to be independent and identically distributed. The optimization objective is formulated as

max_{ϕ} \sum_{(s_{i}, t_{i}) \in D} log p_{ϕ} (t_{i} ∣ s_{i})

(1)

The model employs an encoder–decoder architecture. The encoder processes the source sentence into a sequence of hidden states. The decoder then constructs the target words based on these hidden states and the previously generated target words.

3.2. Dataset Preparation and Validation

Due to data scarcity in low-resource languages, a diverse Chinese and Urdu parallel corpus was developed through human effort and web crawlers (ParaCrawl, Bitextor, Common Crawl, and OpenNMT). The text was thoroughly reviewed and validated by experts in the native language and NLP. To ensure accuracy and reliability, we utilized various translators, including Google Translate (https://translate.google.com) and Baidu Translate (https://www.baidu.com). The checks focused on completeness, accuracy, uniform formats, naming conventions, and the removal of duplicates. Text consistency checks addressed spelling, grammar, and terminology. Sampling validation ensured accurate population representation for better model generalization and bias detection. Text alignment and statistical analysis were performed to evaluate linguistic diversity and coverage across domains. For each word

w_{i}

in the dataset, the frequency

f_{i}

was calculated using Equation (2):

f_{i} = \frac{Number of occurrences of w_{i}}{Total number of words}

(2)

This analysis helped identify common and rare words across the dataset. To measure the degree of alignment between the source and target texts, the Correlation between sentence lengths in both languages was analyzed using the Pearson correlation coefficient [6]. The Type–Token Ratio (TTR) was computed to assess the vocabulary richness. This ratio compares the number of unique words (types) to the total number (tokens). Lastly, we used expert and fluent reviewers in both languages to assess a random dataset sample for accuracy and naturalness. We analyzed the frequency distribution of words or phrases to understand the dataset’s linguistic diversity.

In addition, we incorporated publicly available datasets.

The OPUS dataset is a collection of parallel corpora for various languages, including Chinese and Urdu, widely used for machine translation and linguistic research [41]. The dataset is available at (http://opus.nlpl.eu/).
The Workshop on Machine Translation (WMT) provides a benchmark in machine translation. It includes a variety of parallel corpora for different language pairs and is used in annual machine translation competitions [42].
The WiLi 18 benchmark dataset of short text extracts from Wikipedia contains 1000 paragraphs in 235 languages, a total of 23,500 paragraphs. Each language in this dataset contains 1000 rows/paragraphs. After data selection and preprocessing, we selected the same 45 paragraphs in Chinese and Urdu with the help of the middle language, English [43].

These datasets were compiled and formatted in CSV format. Table 2 shows the details of the datasets. We consolidated these datasets into a single comprehensive dataset in a unified format to facilitate experiments on large corpora. The experiments were conducted dataset-wise and on the consolidated dataset to ensure thorough analysis and validation. The sizes of the datasets are denoted as

| D_{train} |

for the training set,

| D_{test} |

for the testing set, and

| D_{validation} |

for the validation set as a ratio of

70 %

,

15 %

, and

15 %

respectively, where D represents the dataset. These sizes were calculated using the lengths of the training, testing and validation data, respectively, as follows:

| D_{train} |, | D_{test} |, | D_{val} | = len (t r a i n_d a t a), len (t e s t_d a t a), len (v a l_d a t a) .

3.3. Text Direction Analysis

Bidirectional Chinese–Urdu machine translation involves text direction analysis and its linguistic impact. Chinese and Urdu belong to the families of Sino-Tibetan and Indo-European languages. Chinese is written horizontally from left to right or vertically from top to bottom. Urdu is written from right to left. This fundamental difference in text direction affects the syntactic and semantic processing in natural language processing (NLP) tasks. The probabilistic autoregressive NMT model is utilized to calculate the conditional probabilities of sentences [44] given a parallel sentence pair

(c, u)

, where c represents Chinese and u represents Urdu. The primary task determines the original and translated side between language C and language U. This is achieved by comparing the conditional translation probability

P (u | c)

using an NMT model

M_{C \to U}

with the conditional translation probability

P (c | u)

using a model

M_{U \to C}

operating in the opposite direction. We assume that NMT models assign the translations higher conditional probabilities than the originals, so if

P (u | c) > P (c | u)

, we predict that u is the translation and c is the original. The original translation direction is

C \to U

. Equation (3) illustrates this process, where we obtain

P (u | c)

as a product of the individual token probabilities.

P (u | c) = \prod_{j = 1}^{| u |} p (u_{j} | u_{< j}, c)

(3)

The average token-level

(l o g)

probabilities are represented in Equation (4).

P_{tok} (u | c) = \frac{P (u | c)}{| u |}

(4)

To detect the original translation direction (OTD) as shown in Equation (5),

P_{tok} (u | c)

and

P_{tok} (c | u)

are compared:

OTD = \{\begin{matrix} C \to U, & if P_{tok} (u | c) > P_{tok} (c | u) \\ U \to C, & otherwise \end{matrix}

(5)

3.4. Text Pre-Processing

This process removes special characters, HTML tags, punctuation, missing values, and extraneous whitespace. It also standardizes test cases and corrects common misspellings. The main steps include Replacing &amp with & in the data. If a traditional Chinese character appears in the target sentence, it is converted to Simplified Chinese to ensure that the translation source is consistent. Consistency checks are performed on the custom dataset using FastAlign to verify the validity. More precisely, inspired by back-translation [45], we use the Google API to provide alternative translations, helping generate ground truth results. Specifically, the original training datasets were enhanced by replacing semantically similar phrases with word vectors. As a result, the original training datasets were 10 times larger than the original data. For Chinese, the Stanford NLP tool was employed due to its robust handling of logographic script. We utilized a custom-trained Urdu model capable of recognizing entities in the extended Arabic script.

3.5. Integrating Named Entity Recognition (NER) with BERT Encoder

We integrated NER into the BERT encoder to enhance the model’s capability to handle context-specific entities. This integration is implemented through a multi-step process to accurately identify and prioritize critical entities within source and target sentences. Initially, a pre-trained NER model is applied to the input sentences in both languages to detect entities such as names, locations, and organizations. Once these entities are identified, they are enclosed with special tokens, specifically ‘<NER_START>’ and ‘<NER_END>’, to explicitly delineate their boundaries. This tagging process ensures that the BERT encoder recognizes and differentiates these entities from the surrounding text. The tagged sentences are then fed into the BERT encoder, which processes them to generate enriched contextual embeddings that emphasize the identified entities. By doing so, the encoder is better equipped to maintain the semantic integrity of these entities during translation, ensuring that they are accurately and consistently translated across language pairs. This method improves the translation quality of entity-rich sentences and enhances the generated translations’ overall semantic coherence. During the training phase, the model learns to associate the special NER tokens with their corresponding entity types, further refining its ability to handle diverse and complex entity structures within the translation tasks.

3.6. In-Trust Loss Function

We utilized the incomplete-trust (in-trust) loss function to address noisy data within the parallel corpus and modify the traditional cross-entropy loss. Noise in the training data can adversely affect the model’s performance, leading to overfitting on erroneous examples and degrading translation quality. Figure 2 depicts selected noisy samples from our dataset and their English descriptions for a better understanding. Noise is present in the training corpus’s source and target segments. Noise control aims to improve model robustness by minimizing the impact of noisy data. Regularization techniques are employed to filter out noisy data samples. For a dataset

D

with potential noisy pairs

(c_{i}, u_{i})

, we define

D_{clean}

as a subset of

D

that is free of significant noise, as shown in Equation (6).

L_{clean} (θ) = \sum_{(c_{i}, u_{i}) \in D_{clean}} L (θ ∣ c_{i}, u_{i})

(6)

Once the noisy samples are over-fitted, the translation performance of the final model will be impacted. Inspired by previous work [46], it is suggested that noisy datasets may also offer valuable insights. Incomplete-trust (in-trust) loss is proposed as a substitute for the original cross-entropy loss function to decrease the disparity between synthetic and real examples affected by noise. The loss function quantifies the discrepancy between the predicted translation

\hat{t}

and the actual target translation t. This new loss function is expressed as follows:

L_{In-trust} (θ) = - \sum_{(c, t) \in D} [α_{t} log P (t ∣ c; θ) + (1 - α_{t}) log (1 - P (t ∣ c; θ))]

(7)

where the equation elements are defined as follows:

$L_{In-trust} (θ)$ is the incomplete-trust loss.
$(c, t)$ are the source and target translation pairs in the training dataset $D$ .
$P (t ∣ c; θ)$ is the predicted probability of the target translation t given the source sentence c and model parameters $θ$ .
$α_{t}$ is the trust factor for the target translation t, which adjusts the weight of each term in the loss function to account for the noise in the data.

This formulation ensures that the model trusts cleaner data while learning from noisy examples, thereby reducing the impact of overfitting on noisy samples.

Estimating the Trust Factor ( $α_{t}$ )

The trust factor

α_{t}

plays a pivotal role in balancing the influence of clean and noisy data within the in-trust loss function. To estimate

α_{t}

, we employ a dynamic approach based on the confidence level of each training sample. The implementation involves several key steps:

Each training sample is initially assigned a baseline trust factor

α_{t}

, typically set to 0.5, indicating an equal weighting between trusting and distrusting the data. After each training epoch, the model evaluates the loss

L (c, t)

for each sample on a validation set. Samples that consistently yield lower loss values—indicating higher confidence—have their

α_{t}

increased (e.g., to 0.7), amplifying their influence on the loss function. Conversely, samples with higher loss values undergo a decrease in

α_{t}

(e.g., down to 0.3), reducing their impact to mitigate the effect of potential noise.

To formalize the adjustment of

α_{t}

, we introduce the following update rule:

α_{t}^{(n e w)} = \{\begin{matrix} α_{t} + Δ α & if L (c, t) < τ \\ α_{t} - Δ α & if L (c, t) \geq τ \end{matrix}

(8)

where the equation elements are defined as follows:

$L (c, t)$ is the loss for sample $(c, t)$ .
$τ$ is a predefined loss threshold.
$Δ α$ is the increment/decrement value (e.g., 0.2).

This update rule ensures that

α_{t}

remains within the [0,1] range, maintaining stability during training. Integrating the in-trust loss function into our training process enhances the model’s robustness against noisy data, leading to improved translation performance.

The implementation of the in-trust loss function involves the following steps:

Initialization: Assign an initial trust factor $α_{t} = 0.5$ to all training samples in $D$ .
Training Iteration: For each epoch, compute the in-trust loss $L_{In-trust} (θ)$ using Equation (7).
Model Update: Perform backpropagation and update the model parameters $θ$ accordingly.
Trust Factor Adjustment: After each epoch, evaluate the loss $L (c, t)$ for each sample on a validation set and adjust $α_{t}$ using Equation (8).
Normalization: Ensure that $α_{t}$ remains within the valid range [0,1].

This adaptive mechanism allows the model to prioritize learning from cleaner data while extracting useful information from noisy samples.

3.7. Converting Chinese Text to Pinyin

The conversion of Chinese characters into Pinyin turns them into the Latin alphabet, aiding NMT models that cannot handle Chinese scripts directly. Using Pinyin as an intermediary language offers a novel solution to avoid using English as a middle language. This improves translation efficiency by aligning Chinese phonetics directly with Urdu, reduces ambiguities, and enhances model performance. Research by [47] supports the effectiveness of phonetic representations in translation models. To address homophones in Pinyin, we use deep contextual analysis, enhanced tokenization, and post-processing corrections for accurate translations.

To demonstrate how Pinyin serves as an intermediary in the translation process, we present a snapshot of our dataset in Figure 3, which shows the alignment of Chinese, Pinyin, and Urdu.

Algorithm 1 utilizes the pinyin library to translate each Chinese character into its corresponding Pinyin representation. The text is then split into words based on spaces. It was initially designed to potentially invert these words. The current implementation maintains the original order. To apply this conversion across the dataset containing Chinese text, the convert _to_pinyin function is mapped to each sentence in the dataset. This conversion facilitates the alignment of Chinese phonetics directly with Urdu, enhancing translation accuracy and efficiency.

Algorithm 1 Convert Chinese Characters to Pinyin and Maintain Word Order

1:: procedure convert_to_pinyin( $s e n t e n c e$ )
2:: $p i n y i n_r e s u l t \leftarrow pinyin . get (s e n t e n c e, “ strip ”, “ ”)$ ▹ Convert the Chinese characters in the sentence to Pinyin, keeping spaces between syllables.
3:: $w o r d s \leftarrow split (p i n y i n_r e s u l t, “ ”)$ ▹ Split the Pinyin result into a list of words (syllables).
4:: $f i n a l_s e n t e n c e \leftarrow join (w o r d s, “ ”)$ ▹ Rejoin the words (syllables) to form the final Pinyin sentence.
5:: return $f i n a l_s e n t e n c e$ ▹ Return the final Pinyin sentence, ready for further processing.
6:: end procedure

3.8. Tokenization and OOV Management

The tokenizer transforms a sentence s into a sequence of tokens

{t_{1}, t_{2}, \dots, t_{n}}

, represented as

Tokenize (s) = [t_{1}, t_{2}, \dots, t_{n}]

. The SentencePiece approach for tokenization is applied to effectively manage languages with large vocabularies and different scripts without requiring pre-segmented text. The SentencePiece model, denoted as

S P

, segments a given input sequence s into a sequence of subword units:

S P (s) = {s w_{1}, s w_{2}, \dots, s w_{m}}

, where

{s w_{1}, s w_{2}, \dots, s w_{m}}

are the subword tokens derived from the input string s. This approach is effective for managing out-of-vocabulary (OOV) words, as it breaks down rare words into known subwords or morphemes.

The encoding function E converts the sequence of tokens into a sequence of integer IDs, defined as

E (t_{i}) = i d_{i}

for each token

t_{i}

, where

i d_{i}

is the integer ID corresponding to the token

t_{i}

. Once the tokens are converted into integer IDs, they are transformed into a tensor format suitable for neural network processing as the function T:

X = T ({i d_{1}, i d_{2}, \dots, i d_{n}})

, where X is the tensor that will be input into the NMT model. This tensor encapsulates the sequence of integer IDs in a format that the neural network can efficiently process as follows:

P (s) = \prod_{i = 1}^{m} p (s w_{i})

(9)

The tokenization

S P (s)

that maximizes the log probability of the segmented sentence is chosen, as represented in the following equation:

max \sum_{i = 1}^{m} log p (s w_{i})

(10)

Each subword

s w_{i}

is mapped to an integer ID

i d_{i}

, which the proposed model will use. The tokenized sequences are then fed into the proposed NMT model, where an encoder encodes them and subsequently decodes them into the target language. This process forms the backbone of understanding and generating translations, as represented in the following equation:

P (c | u) = \prod_{j = 1}^{| y |} p (c_{j} | y_{< j}, u)

(11)

3.9. Bilingual Curriculum Learning

The bilingual curriculum learning strategy is employed to improve the efficiency and performance of our NMT model. Curriculum learning organizes training data from simple to complex, enhancing learning dynamics. Our setup’s curriculum is based on sentence complexity and translation difficulty between Chinese and Urdu. The process involves evaluating sentence pairs by length, complexity, and rare words. We sort the dataset into tiers, starting with simple sentences and progressively adding more complex ones. Training is performed in phases: Phase 1 focuses on easy sentences, Phase 2 introduces moderate complexity, and Phase 3 fine-tunes the model on complex sentences. This approach improves convergence, enhances generalization, and reduces overfitting.

3.10. Proposed Model

A transformer-based M2M100 model [16] is proposed, and is a pre-trained multilingual encoder–decoder (seq-to-seq) with an attention mechanism. The architecture of this model is depicted in Figure 4. It is well suited for generating diverse translation candidates due to its background in handling a variety of linguistic contexts. Initially, multiple translation options are produced using the M2M100 variants with enhancements in contextual analysis and multi-head attention.

The encoder processes the input in the source language (Chinese or Urdu) by converting each token into an embedding vector,

x_{i}

, and adding positional encodings,

p_{i}

, to incorporate the positional information of the tokens in the sequence:

e_{i} = x_{i} + p_{i}

3.10.1. BERT Embedding Integration

The encoder layers are initialized with BERT’s pre-trained weights to enhance the encoder’s ability to understand better and represent input text. After initialization, we fine-tune the BERT-augmented encoder layers on our datasets to adjust the pre-trained weights to the specifics of translation tasks. The dimensions of BERT embeddings (

d_{bert}

) differ from the expected dimensions of the M2M model’s encoder embedding (

d_{m 2 m}

). A linear transformation is applied to match the dimensions:

e_{bert}^{'} = W_{transform} e_{bert} + b_{transform}

where

W_{transform}

and

b_{transform}

are the weights and biases of the transformation layer, respectively. This ensures that the transformed BERT embeddings (

e_{bert}^{'}

) are compatible with the M2M100 encoder.

3.10.2. Attention Mechanism

Some key modifications are introduced to the attention mechanisms, such as synchronous bidirectional attention and dynamic weight attention. Synchronous bidirectional attention enables the model to attend to both past and future tokens simultaneously within each layer, enhancing its understanding of context and improving its ability to capture long-range dependencies. This modification is achieved by adjusting the attention masks, which were originally unidirectional, to allow bidirectional context. Specifically, we modify the attention mask so the model can attend to all tokens in the sequence, regardless of their position. This bidirectional attention mask ensures that each token can attend to preceding and upcoming tokens in the sequence, enabling a more comprehensive understanding of the input. The attention matrix

M_{bidirectional}

is modified to

M_{bidirectional}

= 1, or all token pairs in the sequence.

This modification allows the model to leverage the full context of the input sequence, thereby improving translation accuracy, especially for sentences where relationships between distant tokens are critical. This leads to better handling of syntactic and semantic structures, resulting in more accurate translations.

The second modification, the refined attention mechanism, dynamically adjusts attention weights based on the importance of tokens. Token importance is identified through NER and semantic analysis. NER is applied at the token level to recognize key entities, while semantic analysis assesses the role and relevance of each token in the sentence. The importance of a token t is computed as

I (t) = NER (t) \cdot SemanticScore (t)

.

Where

NER (t)

is a score indicating whether token t is a named entity and

SemanticScore (t)

measures its contextual relevance. Tokens with higher importance scores are given higher priority in the attention mechanism. The adjusted attention matrix

A^{'}

is computed by multiplying the standard attention matrix A with an importance modifier

ImportanceMod (I (t))

:

A^{'} = A \cdot ImportanceMod (I (t))

.

This dynamic adjustment prioritizes critical tokens—such as named entities and key concepts—enhancing the model’s ability to generate semantically accurate translations.

Additionally, attention weights are recalculated during each attention operation, making them context-dependent and allowing the model to adapt to the significance of different tokens as the input sequence is processed. The attention mechanism with dynamic weighting is expressed as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + A^{'}) V

(12)

We employ a fallback mechanism to address named entities not present in the training data. In the case of an unseen named entity, we replace it with a predefined entity from a bilingual dictionary or an entity mapping list. If no direct match is found, the model relies on contextual clues from the surrounding tokens to infer the meaning of the entity. This ensures that the translation remains accurate and fluid, even in the presence of novel or rare entities. The fallback mechanism ensures the model can handle a wide range of named entities, improving its robustness, particularly for languages with limited resources.

3.10.3. Pre-Processing and Post-Processing

Applying layer normalization after the multi-head attention and feed-forward layers stabilizes the learning process and significantly enhances model performance. Dropout layers prevent overfitting by randomly setting some activations to zero during training. Implementing residual connections around the multi-head attention and feed-forward layers helps facilitate gradient flow and enhance the model’s ability to capture complex dependencies. This ensures that the output dimensions of the attention and feed-forward layers match the input dimensions. Equation (13) describes the process:

z = LayerNorm (x + Dropout (MultiHead (Q, K, V)))

(13)

The decoder generates the target language output (Urdu/Chinese) one token at a time. The masked self-attention layer prevents attending to future tokens in the output sequence using a masking mechanism as follows:

MaskedAttention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M) V

(14)

where

M

is the mask matrix that prevents the model from looking at future tokens. The encoder–decoder attention layer handles the queries from the previous decoder layer, with keys and values from the encoder output, allowing each position in the decoder to attend to all positions in the input sequence as:

Attention (Q_{dec}, K_{enc}, V_{enc}) = softmax (\frac{Q_{dec} K_{enc}^{T}}{\sqrt{d_{k}}}) V_{enc}

(15)

The Feed-Forward Neural Network transforms the representation after attention integration, with layer normalization and dropout applied to ensure the dimensions match for subsequent layers:

FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}

(16)

The proposed model is fine-tuned on our datasets to adapt it to our translation tasks. This process helps the model grasp the source language’s full semantic scope. The proposed model can produce translations that preserve longer texts’ coherence and intended meaning. Algorithm 2 and Table 3 provide an abstraction of the proposed model. The proposed model generates translations with two variants of M2M100, 418 M and 12 B, while the 1.2 B variant is used for re-ranking.

3.11. Re-Ranking Candidates

Re-ranking in neural machine translation (NMT) involves generating multiple candidate translations for a given input sentence and subsequently re-evaluating these candidates to select the best one. This process helps enhance translation accuracy and diversity by refining the initial outputs. The proposed model, as described in Section 3.10, is used to generate a set of candidate translations

C = {T_{1}, T_{2}, \dots, T_{n}}

, where each

T_{i}

represents a candidate translation. For each candidate, a set of features is extracted to inform the re-ranking decision. These features include the initial translation score, length, word alignments, translation errors, and coherence.

The initial translation score is directly obtained from the primary NMT model’s output probabilities. The translation length is simply the number of tokens in the candidate translation, which provides a measure of completeness and conciseness. Word alignments are evaluated using tools like FastAlign to assess the mapping quality between source and target words. Translation errors, such as grammatical mistakes, mistranslations, or missing entities, are identified using error detection models or rule-based checks. Translation coherence is evaluated by measuring the semantic consistency of the candidate translation with reference sentences or ensuring its overall logical flow.

Algorithm 2 Translation Model with In-Trust Loss and Tokenization

1:: Input: Dataset $D$ , Clean Subset $D_{clean}$ , Epochs $e p o c h s$ , Learning Rate $l r$
2:: Output: Trained Model
3:: Data Preprocessing:
4:: for each $(c_{i}, u_{i}) \in D$ do
5:: if $(c_{i}, u_{i})$ is clean then
6:: Add $(c_{i}, u_{i})$ to $D_{clean}$
7:: end if
8:: end for
9:: Initialize M2M100Tokenizer, M2M100 model, BERT model, and SentencePiece model $S P$
10:: function InTrustLoss( $p r e d i c t i o n s, t a r g e t s, α_{t}$ )
11:: return $- \sum (α_{t} \cdot log (p r e d i c t i o n s) + (1 - α_{t}) \cdot log (1 - p r e d i c t i o n s))$
12:: end function
13:: Initialize Adam optimizer with learning rate $l r$
14:: for $e p o c h \leftarrow 1$ to $e p o c h s$ do
15:: for each $(s o u r c e, t a r g e t) \in D$ do
16:: optimizer.zero_grad()
17:: NER and OOV Management:
18:: Perform NER on $s o u r c e$ to recognize named entities
19:: Handle OOV words in $s o u r c e$ using subword units
20:: $s o u r c e_i d s \leftarrow T o k e n i z e A n d E n c o d e (s o u r c e)$
21:: $t a r g e t_i d s \leftarrow T o k e n i z e A n d E n c o d e (t a r g e t)$
22:: $o u t p u t s \leftarrow M 2 M 100_m o d e l (i n p u t_i d s = s o u r c e_i d s, l a b e l s = t a r g e t_i d s)$
23:: $p r e d i c t i o n s \leftarrow softmax (o u t p u t s . l o g i t s, \dim = - 1)$
24:: loss ← InTrustLoss $(p r e d i c t i o n s, t a r g e t_i d s, Tensor ([0.5]))$
25:: $l o s s . b a c k w a r d ()$
26:: optimizer.step()
27:: Print “Epoch”, $e p o c h$ , “Loss:”, $l o s s . i t e m ()$
28:: end for
29:: end for
30:: Re-ranking Process:
31:: for each input sentence c do
32:: Generate candidate translations $C = {T_{1}, T_{2}, \dots, T_{n}}$
33:: for each candidate $T_{i}$ in C do
34:: Extract features for $T_{i}$ (e.g., initial score, length, alignment, coherence)
35:: Assign re-ranked score $S (T_{i})$ using M2M100-1.2B model
36:: end for
37:: Select best translation $T^{*} = arg {max}_{T_{i} \in C} S (T_{i})$
38:: end for
39:: return Trained Model

These features are then fed into a re-ranking neural network, which processes the information and produces a re-ranked score

S (T_{i})

for each candidate

T_{i}

. The candidate with the highest re-ranked score is selected as the final translation output, ensuring the best possible translation is chosen.

Contrastive Re-Ranking Method

A contrastive re-ranking method is employed with the M2M100-1.2B model variant, as illustrated in Figure 5. It has a strong capability to evaluate and rank translations across different languages. Training in a broad range of language pairs provides a fine understanding of language that can be beneficial for re-ranking based on subtleties in translation quality. Positive samples come from high-quality translations in the bilingual corpus, and negative samples are drawn from Diverse Beam Search [48] outputs. The re-ranker evaluates each candidate based on the extracted features and assigns a new score. The re-ranker is implemented as a neural network combining features to predict the final translation quality. The candidate with the highest score is selected as the best translation. Let

S (T_{i})

be the re-ranked score of candidate

T_{i}

. The best translation

T^{*}

is given by

T^{*} = arg max_{T_{i} \in C} S (T_{i})

(17)

In this setup,

h_{x}

represents the hidden features of the input text, and

h_{T_{j}}

are the features for the target samples, with

h_{T}^{+}

and

h_{T_{j}}^{-}

denoting positive and negative sample features, respectively. A non-linear projection layer is applied on top of M2M100 to refine these features. The calculation of two types of features with temperature is as follows:

L = - log \frac{e^{sim (h_{x}, h_{T}^{+}) / τ}}{\sum_{j = 1}^{n} e^{sim (h_{x}, h_{T}^{+}) / τ} + e^{sim (h_{x}, h_{T_{j}}^{-}) / τ}}

(18)

This approach ensures that the selected translation is diverse and high-quality.

4. Experimental Setup

4.1. Computational Environment

Python served as the primary programming language due to its extensive support for machine learning and natural language processing tasks in the experimental setup for evaluating the performance of a bidirectional translation system Chinese–Urdu parallel corpus. All experiments were carried out using the PyTorch framework. The experiments were performed on a cloud server with a high-performance NVIDIA RTX 4090 GPU, which provides the necessary computational power for training large-scale translation models. The implementation of the proposed NMT model with all enhancements is detailed in Algorithm 2.

4.2. Performance Metrics

Standard metrics for translation quality are used to evaluate the performance of machine translation systems. BLEU focuses on the precision of n-grams, which counts how many n-grams in the machine translation match the reference translation. Then, it adjusts the overall length to avoid favoring overly short translations. ChrF++ is helpful for languages with difficult word segmentation, such as Chinese and Urdu. It is sensitive to the text’s lexical and morphological properties. METEOR compares the translation to the reference by aligning them at the word level. It evaluates the alignment using various parameters such as synonymy and stemming. ROUGE-L evaluates recall and precision using the longest common subsequence between a candidate and a reference translation. It measures how many identical word sequences appear in both texts, which makes it particularly effective at assessing fluency and the overall structure of the content. TER measures the number of edits required to change a system-generated translation into one of the reference translations.

4.3. Hyper-Parameter Tuning

To fine-tune and compare different hyperparameters for training a model, we define two sets of parameters: one for a fine-tuned approach and one for a ptr-trained model. Below is a description of each, followed by a Table 4 summarizing the hyperparameters used in both scenarios.

5. Results

Table 5 presents the performance of the proposed translation model on different datasets and the combined dataset for translations between Chinese (Zh) and Urdu (Ur) in both directions. The proposed model demonstrates the highest overall performance across both translation directions on combined datasets. The model performs strongly in both translation directions, with a BLEU score of 68.21% for Zh→Ur and 69.37% for Ur→Zh. These results indicate that the combined datasets provide a more comprehensive training set and superior translation quality than the individual datasets. For the OPUS dataset, the model achieves a BLEU score of 54% (Zh→Ur) with ROUGE-L scores of 50.92%. Despite its moderate size, the OPUS dataset performs strongly due to its diverse and well-curated content and balanced data distribution. For the WMT dataset, the model performs slightly worse, with BLEU scores of 40.52% (Zh→Ur) and 39.74% (Ur→Zh). As the largest dataset, WMT provides valuable exposure to various linguistic structures. Though smaller, the Wili + Custom dataset still contributes significantly, with BLEU scores of 22% (Zh→Ur) and 23% (Ur→Zh). The Wili dataset’s specialized content enhances the model’s ability to handle specific terms, while the custom dataset offers highly relevant, tailored content for particular translation scenarios. The proposed model significantly improves Chinese–Urdu translation quality through effective pre-processing, careful adjustments, and strategic dataset selection.

5.1. Ablation Study

An ablation study was conducted to evaluate the impact of various components on the model’s performance. The study was carried out using the combined dataset, and the performance was measured using BLEU and TER across multiple runs for each variant, as shown in Table 6. The baseline models, M2M100_418 and M2M100_12B, were first assessed with the traditional cross-entropy (CE) loss function.

Incorporating BERT word embeddings led to an improvement of 0.39 in BLEU and a reduction of 0.96 in TER, highlighting the effectiveness of BERT’s contextual embeddings in improving translation quality, particularly in low-resource settings. Substituting the CE loss with the proposed in-trust loss function resulted in a more substantial performance boost, with BLEU increasing by 0.56 and TER decreasing by 0.0109. This demonstrates that the in-trust loss function is more robust in response to inaccuracies in the training data. The use of bilingual curriculum learning contributed to an additional improvement of 0.14 in BLEU and a reduction of 1.13 in TER, indicating its role in enhancing efficiency when dealing with low-resource corpora. Lastly, the application of contrastive re-ranking improved BLEU by 0.22. It reduced TER by 0.62, suggesting that this method aids in selecting the most accurate and contextually appropriate translations by increasing the diversity of candidates.

The performance of the proposed components was also evaluated across both translation directions (Zh→Ur and Ur→Zh), as presented in Table 7. While the improvements were consistent, the impact was slightly more pronounced in the Zh→Ur direction. In both cases, the in-trust loss function effectively reduced TER, underscoring its value in noisy, low-resource scenarios.

A comparison of convergence rates between the in-trust loss and the traditional CE loss function is shown in Figure 6 to further validate the benefits of the in-trust loss function. The in-trust loss function exhibited faster convergence during the early stages of training and smoother behavior in later stages, indicating a more stable and effective learning process, particularly in low-resource settings.

5.2. Syntactic Analysis

Our comparison of syntactic complexity compared the average sentence length and average clauses per sentence between the reference and translated texts, as shown in Figure 7. The blue bars represent the reference texts, and the orange bars represent the translations produced by the proposed model. Both metrics are closely matched with translations, showing a slight increase in average sentence length from 31.74 to 32.32 words and a nearly identical proportion of clauses per sentence (0.91 to 0.92). These results suggest that the model maintains the syntactic richness of the source text and indicates the practical preservation of linguistic structures without oversimplification or unnecessary complications in translations.

The length distribution analysis underscores the proposed model’s effectiveness in handling translations with a high degree of fidelity to the source text’s length and syntactic complexity. Figure 8 indicates that the model adeptly maintains a balance in translation length, with a preference for producing outputs that closely mirror the source in most cases.

NMT models often struggle with translating longer sentences. To examine this, sentences are grouped by length, with a BLEU score calculated for each group, as depicted in the left image of Figure 9. The right image of Figure 9 illustrates the BLEU scores alongside the average translation length for each group. While the standard transformer and its variant show commendable performance on shorter sentences, their efficacy diminishes with increased sentence length. The proposed model addresses this shortfall by leveraging past and future contextual information. The integration of synchronous bidirectional attention markedly enhances translation accuracy across all sentence-length groups.

5.3. OOV and NER Analysis

The proposed model significantly improves the handling of out-of-vocabulary (OOV) terms in Chinese–Urdu bidirectional translation. The M2M100 model variants were fine-tuned on a corpus with diverse Chinese and Urdu sentences. This process helped the model adapt better to the linguistic adoption of both languages. BERT embeddings are known for their deep contextual understanding, which allows the model to capture subtler meanings and associations within and across languages, particularly for OOV terms. The results are shown in Figure 10. It highlights the expected results alongside the outputs of both models. For phrases like “High-energy particles”, the pre-trained model generated generic translations like “Charged particles”. In contrast, the fine-tuned model provided more contextually accurate translations, such as “Particles with high energy”. Similarly, for “Digital currency”, the pre-trained model produced “Number exchange”, whereas the fine-tuned model improved the output to “Cryptocurrency”. Another example is “Artificial intelligence”, where the pre-trained model’s output was “Artificial brain”, but the fine-tuned model generated a more appropriate translation, “Machine intelligence”. In Urdu-to-Chinese translation, phrases like “Purpose of life” were translated by the pre-trained model as “Pursuit of life”, while the fine-tuned model refined it to “Life’s goal”. The table demonstrates that the fine-tuned model consistently produces more accurate, nuanced, and contextually relevant translations, effectively addressing the limitations of the pre-trained model in handling complex and rare phrases.

NER represents which types of entities are differentiated and which are missed in a Chinese–Urdu parallel corpus on test data; we present this in Figure 11. It highlights the types of entities (such as person, location, organization, date, event) correctly identified, partially identified, or missed by different model variants.

5.4. Translation Error Analysis

Figure 12 illustrates the range of translation errors encountered, which posed significant challenges to maintaining the integrity and accuracy of the translated content. These errors include addition errors, where unnecessary elements were introduced into the translation, omissions of key parts of the source text, ambiguous translations with multiple possible interpretations, and grammatical inaccuracies that disrupt the linguistic structure of the output. Contextual errors were particularly prominent, accounting for 2.5% of all identified mistakes, where translations failed to accurately reflect the situational context of the source material. Similarly, grammatical errors constituted 2% of the total errors.

To address these challenges, the proposed model integrates a refined attention mechanism and BERT embeddings, effectively capturing the most relevant portions of the input text and minimizing the risk of introducing extraneous elements. Furthermore, in-trust loss, contrastive re-ranking, and bilingual curriculum learning significantly enhance the model’s contextual understanding, grammatical precision, and overall translation quality. As a result, the proposed model achieves substantial reductions across all error categories, as seen in Figure 12, particularly in spelling errors (reduced from 4% to 0.5%) and contextual errors (reduced from 2.5% to 1.1%). These improvements ensure greater fluency, accuracy, and readability of the translations, making them more reliable and contextually appropriate for real-world applications.

Figure 13 presents a baseline (BLEU) score and an improved scenario for different strategies: Longer Context, Domain-Specific, Cross-Lingual, and Post-Processing. The results show that each strategy leads to a notable increase in BLEU scores. This demonstrates their effectiveness in enhancing translation accuracy. Notably, the Cross-Lingual strategy yielded the highest improvement and increased the BLEU score from 0.74 to 0.86, indicating its significant impact on translation quality.

Performance Analysis with OOV Words and NER

The proposed model significantly improved over the baseline in translating OOV terms accurately and contextually in both Chinese to Urdu and Urdu to Chinese translations, as shown in Figure 14. Subword tokenization, contextual BERT embeddings, and back-off strategies such as copying the OOV word directly into the output or replacing it with a placeholder, which can later be addressed through post-processing, allow the model to handle words that are not in the training vocabulary. This approach allows the model to handle OOV words by decomposing them into known subword units, which can then be recombined during translation. It ensures the translations remain accurate and meaningful even when encountering new or rare words. The performance of the proposed model on NER is measured by using three key metrics: precision, recall, and F1 score, as shown in Figure 14. Pre-processing with NER, context-aware translation, and post-processing adjustments, ensure that named entities are accurately translated and preserved, maintaining the integrity of the information and reducing the risk of errors. The precision metric evaluates how accurately the model identifies entities. A higher precision means that when the model predicts an entity, it is correct more often. However, high precision can sometimes be achieved without missing some actual entities (lower recall). In the graph, improvements in precision across models show that the model makes fewer false-positive errors. Recall measures the model’s ability to identify all dataset-related entities. The increase in recall values across models indicates that enhancements help the model capture more true entities without missing many. The F1 score combines precision and recall into a single measure that balances the two. It is particularly valuable when assessing a model’s overall performance, as identifying and correctly classifying entities are equally important.

The training and validation loss in Figure 15 shows that the proposed model is well optimized and effective at learning from the data. The training loss decreases steadily across the 40 epochs. This consistent reduction in training loss indicates that the model is effectively learning from the training data. As the epochs progress, the model minimizes the error on the training set, which suggests that it is becoming better at fitting the training data. The smooth downward trajectory without any sharp fluctuations implies that the learning process is stable and that the model is not facing issues during training. The decline in validation loss suggests that the model improves its performance on unseen data as it trains. The smooth and consistent decline in both loss metrics without sudden increases or instability indicates that the training process is proceeding smoothly without overfitting or under-fitting.

5.5. Comparative Analysis with Baseline Models

The LSTM-based NMT model [49] is implemented via OpenNMT and provides a baseline for evaluating the effectiveness of our proposed Chinese–Urdu translation model. LSTM networks excel at handling sequential data and capturing long-term dependencies, which makes them valuable in translation tasks.

mBART [27] adopts the BART model architecture and serves as a strong baseline for evaluating the proposed Chinese–Urdu translation model due to its pre-training on diverse, multilingual corpora, including Chinese–Urdu. It performs seq2seq noise reduction auto-encoding pre-training on a large-scale monolingual corpus. mBART is fine-tuned on the Chinese–Urdu test dataset, enabling a direct comparison with the proposed model’s performance.

GPT-2 [50] is a transformer-based language model developed by OpenAI and known for its strong text generation and translation capabilities. Despite being a general-purpose model, GPT-2’s ability to generate coherent and contextually relevant text makes it a valuable baseline for translation tasks. Fine-tuning GPT-2 on the Chinese–Urdu dataset allows for a comparison with the proposed model.

LLaMA 7B [51] is a smaller and resource-efficient version of the LLaMA model with 7 billion parameters. It balances performance and computational demands, making it a suitable baseline for tasks requiring strong translation capabilities. Fine-tuning LLaMA 7B on the Chinese–Urdu dataset enables a meaningful comparison with our proposed model.

Google Translate API (https://cloud.google.com/translate/docs/reference/libraries/v3/python) offers strong and reliable multilingual NMT capability. We implement translation using a JSON file that contains (1) the text to be translated (query), (2) the source language, and (3) the target language. By calling these open APIs, we perform translations to compare established translation systems in a low-resource context.

The performance comparison in Table 8 reveals that the proposed Chinese–Urdu translation model consistently outperforms the baselines in both translation directions (Zh→Ur and Ur→Zh). It achieves the highest BLEU, METEOR, and chrF++ scores while maintaining the lowest TER, which indicates superior translation accuracy and fluency. LLaMA 7B delivers strong performance, closely trailing the proposed model despite being smaller and more resource-efficient. This suggests that LLaMA 7B’s advanced architecture effectively captures linguistic nuances, making it a formidable contender. However, LLaMA 7B faces challenges in handling certain language-specific intricacies and cultural contexts in the Chinese–Urdu pair, which slightly impacts its overall performance compared to the proposed model. mBART, with its robust multilingual pre-training, performs well but struggles with the specific intricacies of the Chinese–Urdu language pair, leading to slightly lower scores. GPT-2, while effective, falls short in capturing complex linguistic structures and contextual nuances, resulting in lower performance. The LSTM-based model (OpenNMT) highlights the limitations of traditional RNN architectures in translation tasks, especially in low-resource settings, leading to the lowest scores. Google Translate API, although widely used, underperforms compared to the more specialized and fine-tuned models, particularly in low-resource scenarios. LLaMA 7B’s competitive performance indicates that while it is well suited for this translation task, it still encounters specific challenges that prevent it from fully matching the proposed model’s performance.

Comparison with Existing Studies

As detailed in Table 9, the comparative analysis highlights several studies on Chinese–Urdu neural machine translation (NMT). It focuses on the challenges of low-resource language pairs and issues related to hallucinations. Chen et al. developed a Chinese–Urdu NMT model integrating POS sequence prediction with the transformer architecture, achieving a BLEU score of 0.36 [35]. Zeeshan et al. implemented the OpenNMT framework using LSTM and RNN-based models, attaining a lower BLEU score of 0.18 [34]. A further study by Zeshan et al. compared LSTM with transformer models and demonstrated the superior performance of the transformer, which significantly improved BLEU scores from 0.077 to 0.52, compared to 0.41 for LSTM [30]. The seq2seq NMT system was introduced for Chinese–Urdu bidirectional translation by deploying a hybrid model such as RNN with LSTM cells. The model gained a BLEU score of 0.42 [31].

The proposed model stands out with the highest BLEU score of 0.69 among the studies focused on Chinese–Urdu low-resource language pairs, underscoring its effectiveness in mitigating semantic ambiguities and enhancing translation quality.

5.6. Limitation

The proposed NMT model demonstrates significant advancements in translation quality for the Chinese–Urdu language pair, mainly through multiple evaluation metrics. There are still areas where the model could be refined. The reliance on enriched datasets like the custom Chinese–Urdu corpus and the computational demands of the model limit its scalability and applicability to other low-resource languages and resource-constrained environments. Additionally, while using multiple metrics provides a more comprehensive assessment of translation quality, further exploration into additional or alternative metrics, such as human evaluations, could offer deeper insights into the translation’s cultural and contextual appropriateness. Moreover, refining components such as the contrastive re-ranking mechanism and the incomplete-trust loss function to enhance their adaptability across different language pairs could significantly increase the model’s generalizability. These limitations highlight the need for further refinement in handling noisy data and long, complex sentence structures.

6. Conclusions

This study presents a novel approach to enhancing neural machine translation (NMT) for low-resource languages, with a specific focus on Chinese–Urdu translation. We use transformer-based M2M100 variants and integrate advanced techniques, including bilingual curriculum learning, contrastive re-ranking, a refined attention mechanism, and the in-trust loss function. These additions help address critical challenges such as out-of-vocabulary (OOV) words and named entity recognition (NER), improving both translation quality and contextual accuracy. Experiments conducted on both private and public datasets, including the use of the back-translation technique, played a pivotal role in enriching the training dataset and enabling the model to generalize effectively from limited data. The results show that our model outperforms baseline models and those used in existing studies. Specifically, the model achieved a BLEU score of 0.68 for Chinese-to-Urdu and 0.69 for Urdu-to-Chinese translation on the combined dataset. For the OPUS dataset, we obtained a BLEU score of 0.54. Additionally, the ablation study demonstrates the incremental improvements brought by each component, confirming the model’s robustness in handling the challenges posed by data scarcity and linguistic complexities. Looking ahead, future work will focus on refining document-level translation, aiming to improve translation quality for longer texts and enhance contextual understanding across diverse domains.

Author Contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. BR24993166).

Data Availability Statement

We utilized publicly available datasets and appropriately cited them within the article. Upon completion of the ongoing research, the private dataset will be made available on a reputable platform.

Conflicts of Interest

The authors declare no competing interests. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Code Availability

This is ongoing research with future expansion plans. The code will be made available on a reputable public platform. In the meantime, the code developed for this article is available upon reasonable request from the corresponding author.

Sample Availability

Samples of the compounds are available from the authors.

References

Setiawan, I. The Role of Language in Preserving Cultural Heritage and Religious Beliefs: A Case Study on Oral Traditions in the Indigenous Sasak Community of Lombok, Indonesia. Pak. J. Life Soc. Sci. 2025. [Google Scholar] [CrossRef]
Ramírez, J.G.C. Natural Language Processing Advancements: Breaking Barriers in Human-Computer Interaction. J. Artif. Intell. Gen. Sci. (JAIGS) 2024, 3, 31–39. [Google Scholar]
Ameur, M.S.H.; Meziane, F.; Guessoum, A. Arabic machine translation: A survey of the latest trends and challenges. Comput. Sci. Rev. 2020, 38, 100305. [Google Scholar] [CrossRef]
Mishra, R. A Comparative Analysis of Statistical and Neural Machine Translation Models. Integr. J. Sci. Technol. 2024, 1, 2. [Google Scholar]
Abidin, Z.; Junaidi, A. Wamiliana Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review. J. Inf. Syst. Eng. Bus. Intell. 2024, 10, 217–231. [Google Scholar] [CrossRef]
Buttar, P.K.; Sachan, M.K. A review of the approaches to neural machine translation. In Natural Language Processing and Information Retrieval; CRC Press: Boca Raton, FL, USA, 2023; pp. 78–109. [Google Scholar]
Li, B.; Weng, Y.; Xia, F.; Deng, H. Towards better Chinese-centric neural machine translation for low-resource languages. Comput. Speech Lang. 2024, 84, 101566. [Google Scholar] [CrossRef]
Lankford, S.; Afli, H.; Way, A. Human evaluation of English–Irish transformer-based NMT. Information 2022, 13, 309. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
Ranathunga, S.; Lee, E.S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Shukla, A.; Mishra, R. Unraveling the Complexities of Neural Machine Translation. Integr. J. Sci. Technol. 2024, 1. [Google Scholar]
Moslem, Y. Language Modelling Approaches to Adaptive Machine Translation. arXiv 2024, arXiv:2401.14559. [Google Scholar]
Chen, K.; Chen, B.; Gao, D.; Dai, H.; Jiang, W.; Ning, W.; Yu, S.; Yang, L.; Cai, X. General2Specialized LLMs Translation for E-commerce. arXiv 2024, arXiv:2403.03689. [Google Scholar]
Shaukat, A.; Sadiq, A.H.B. Probing Translation Loss in the Urdu Translation of Alchemist. Harf-o-Sukhan 2024, 8, 300–306. [Google Scholar]
Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res. 2021, 22, 1–48. [Google Scholar]
Fakih, A.; Ghassemiazghandi, M.; Fakih, A.H.; Singh, M.K. Evaluation of Instagram’s Neural Machine Translation for Literary Texts: An MQM-Based Analysis. Gema Online J. Lang. Stud. 2024, 24. [Google Scholar] [CrossRef]
Liu, Y.; Ma, Y.; Zhou, S.; Luo, X. A Survey of Research and Application of NLP-based Machine Translation. In Proceedings of the 2024 6th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 22–24 March 2024; pp. 315–319. [Google Scholar]
Lu, J.; Yin, F. Research on Improving the Quality of Japanese Chinese Machine Translation Based on Deep Learning; IOS Press: Amsterdam, The Netherlands, 2024. [Google Scholar]
Zhou, M.; Duan, N.; Liu, S.; Shum, H.Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering 2020, 6, 275–290. [Google Scholar] [CrossRef]
Hailu, F. Tigrigna-English Bidirectional Machine Translation Using Deep Learning. Ph.D. Thesis, St. Mary’s University, San Antonio, TX, USA, 2024. [Google Scholar]
Ephrem, M. Development of Bidirectional Amharic-Tigrinya Machine Translation Using Recurrent Neural Networks. Ph.D. Thesis, St. Mary’s University, San Antonio, TX, USA, 2024. [Google Scholar]
Chen, Z.; Han, B.; Wang, S.; Qian, Y. Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1636–1649. [Google Scholar] [CrossRef]
Vathsala, M.; Holi, G. RNN based machine translation and transliteration for Twitter data. Int. J. Speech Technol. 2020, 23, 499–504. [Google Scholar] [CrossRef]
Ashraf, M.R.; Jana, Y.; Umer, Q.; Jaffar, M.A.; Chung, S.; Ramay, W.Y. BERT Based Sentiment Analysis for Low-resourced Languages: A Case Study of Urdu Language. IEEE Access 2023, 11, 110245–110259. [Google Scholar] [CrossRef]
Malik, M.S.I.; Cheema, U.; Ignatov, D.I. Contextual Embeddings based on Fine-tuned Urdu-BERT for Urdu threatening content and target identification. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101606. [Google Scholar] [CrossRef]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Guerreiro, N.M.; Voita, E.; Martins, A.F. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. arXiv 2022, arXiv:2208.05309. [Google Scholar]
Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzmán, F.; Fan, A. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguist. 2022, 10, 522–538. [Google Scholar] [CrossRef]
Khan, Z.; Zakira, M.; Slamu, W.; Slam, N. A study of neural machine translation from Chinese to Urdu. J. Auton. Intell. 2020, 2, 29–36. [Google Scholar]
Zeeshan, J.; Zakira, M.; Niaz, M. A seq to seq machine translation from Urdu to Chinese. J. Auton. Intell. 2021, 4, 1–5. [Google Scholar]
Liew, S.R.C.; Law, N.F. Use of subword tokenization for domain generation algorithm classification. Cybersecurity 2023, 6, 49. [Google Scholar] [CrossRef]
Karthikeyan, M.; Mary Anita, E. Text classification; language-independent tokenization; sub word tokenization. Intell. Autom. Soft Comput. 2023, 35. [Google Scholar] [CrossRef]
Zeeshan, Z.A.; Jawad, M.Z. Research on Chinese-Urdu machine translation based on deep learning. J. Auton. Intell. 2020, 3, 34–44. [Google Scholar]
Chen, H.; Wang, J.; Muhammad, N.U.H. Chinese-Urdu neural machine translation interacting POS sequence prediction in Urdu language. Comput. Eng. Sci. 2024, 46, 518. [Google Scholar]
Ortiz-Garces, I.; Govea, J.; Andrade, R.O.; Villegas-Ch, W. Optimizing Chatbot Effectiveness through Advanced Syntactic Analysis: A Comprehensive Study in Natural Language Processing. Appl. Sci. 2024, 14, 1737. [Google Scholar] [CrossRef]
Qiu, J.; Li, S. A multi-encoder model for automatic code comment generation. In Proceedings of the Fourth International Conference on Sensors and Information Technology (ICSI 2024), Xiamen, China, 5–7 January 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13107, pp. 78–85. [Google Scholar]
Li, G.; Zhao, X.; Wang, X. Quantum self-attention neural networks for text classification. Sci. China Inf. Sci. 2024, 67, 1–13. [Google Scholar] [CrossRef]
Bensalah, N.; Ayad, H.; Adib, A.; Ibn El Farouk, A. CRAN: An hybrid CNN-RNN attention-based model for Arabic machine translation. In Networking, Intelligent Systems and Security: Proceedings of NISS 2021; Springer: Singapore, 2022; pp. 87–102. [Google Scholar]
Dowling, M. An Investigation of English-Irish Machine Translation and Associated Resources. Ph.D. Thesis, Dublin City University, Dublin, Ireland, 2022. [Google Scholar]
Tiedemann, J.; Aulamo, M.; Bakshandaeva, D.; Boggia, M.; Grönroos, S.A.; Nieminen, T.; Raganato, A.; Scherrer, Y.; Vazquez, R.; Virpioja, S. Democratizing neural machine translation with OPUS-MT. Lang. Resour. Eval. 2023, 58, 713–755. [Google Scholar] [CrossRef]
Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 1–42. [Google Scholar] [CrossRef]
Thoma, M. The WiLI benchmark dataset for written language identification. arXiv 2018, arXiv:1801.07779. [Google Scholar]
Wastl, M.; Vamvas, J.; Sennrich, R. Machine Translation Models are Zero-Shot Detectors of Translation Direction. arXiv 2024, arXiv:2401.06769. [Google Scholar]
Yan, Y.; Song, J.; Fu, B.; Ye, N.; Shi, X. Automatic Reference-Free Fine-Grained Machine Translation Error Detection via Named Entity Recognition and Back-Translation. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2024; pp. 306–317. [Google Scholar]
Huang, X.; Chen, Y.; Wu, S.; Zhao, J.; Xie, Y.; Sun, W. Named Entity Recognition via Noise Aware Training Mechanism with Data Filter. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 4791–4803. [Google Scholar] [CrossRef]
Yang, J.; Wu, S.; Zhang, D.; Li, Z.; Zhou, M. Improved neural machine translation with Chinese phonologic features. In Proceedings of the Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, 26–30 August 2018; Proceedings, Part I 7. Springer: Berlin/Heidelberg, Germany, 2018; pp. 303–315. [Google Scholar]
Hotate, K.; Kaneko, M.; Komachi, M. Generating diverse corrections with local beam search for grammatical error correction. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 2132–2137. [Google Scholar]
Khan, A.; Sarfaraz, A. RNN-LSTM-GRU based language transformation. Soft Comput. 2019, 23, 13007–13024. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]

Figure 1. Proposed Chinese–Urdu neural machine translation model workflow.

Figure 2. Noisy data samples in bilingual translation.

Figure 3. A snapshot of the dataset showing Chinese, Pinyin, and Urdu translations. Pinyin is an intermediary language to improve translation accuracy between Chinese and Urdu, aligning phonetic representations directly with the target language.

Figure 4. Overview of proposed transformer-based Seq2Seq M2M100 encoder-decoder architecture [16].

Figure 5. Overview of re-ranking Process.

Figure 6. Loss convergence comparison between cross-entropy and proposed in-trust loss.

Figure 7. Comparison of syntactic complexity.

Figure 8. Distribution of length ratios.

Figure 9. Performance analysis of the translation on test data concerning sentence length.

Figure 10. Translation evaluation with OOV words.

Figure 11. Named entity recognition evaluation.

Figure 12. Distribution of translation errors.

Figure 13. Translation quality with different aspects of phrases.

Figure 14. Comparative analysis of OOV words and NER in translation quality metrics.

Figure 15. Training and validation loss.

Table 1. Literature review summary.

Ref.	Datasets	Model and Contribution	Limitations
[6]	Flores, WMT, OPUS	Review of approaches in NMT	Does not address linguistic diversity and noise in data
[19]	OPUS	Research on improving MT quality	Requires large parallel corpora and struggles with data sparsity
[13]	OPUS, TICO 19	Corpus-based MT using algorithms and models	Struggles with handling OOV effectively
[36]	SemEval	Optimizing chatbot effectiveness	Limited syntactic and semantic understanding
[37]	Tibetan–Chinese	NMT for Tibetan–Chinese translation using RNN	Inadequate handling of complex sentences and long range dependency
[38]	Yelp, IMDB, Amazon	Quantum self-attention networks for text classification	Complex implementation and not specifically designed for low-resource language pairs
[22]	Amharic and Tigrinya	Bidirectional Amharic-Tigrinya MT using RNN and LSTM	Limited applicability to other low-resource language pairs
[21]	Tigrinya–English	Bidirectional Tigrigna–English MT using deep learning	Requires significant amounts of training data and has limited syntactic understanding
[39]	OPUS	Hybrid CNN-RNN attention-based model for Arabic MT	Focuses on Arabic and lacks comprehensive context awareness
[40]	ZH-CN SMS chat	Investigation of MT for Chinese–English SMS chat translation	Focuses on a narrow application area and lacks OOV and NER
[25]	Urdu text	Multilingual BERT for sentiment analysis in Urdu	Limited to sentiment analysis and has inadequate handling of complex sentences
[26]	Urdu text	Threatening language detection in Urdu	Lacks syntactic and semantic depth
[28]	FLORES-101	Use of M2M model for hallucination detection	Lacks context awareness
[16]	Many-to-Many	Beyond English-centric MT with M2M100 model	No generalization for low-resource languages (Chinese, Urdu)

Table 2. Chinese–Urdu corpus data.

Corpus	Sentences	Zh Tokens	Ur Tokens	Training	Testing	Validation
OPUS	493,042	1,189,539	899,376	345,129	73,956	73,957
WMT	608,405	1,348,494	1,106,496	425,883	91,260	91,262
WiLi	7938	33,357	56,894	5556	1190	1192
Custom	56,332	130,569	104,742	39,432	8449	8451
Total	1,165,717	2,701,959	2,167,508	815,000	174,855	174,862

Table 3. Key components of the M2M100 model implementation.

Component	Attribute	Value
Shared Embeddings	Dimension	1024
	Padding Index	1
BERT Embedding Integration	Initialization	Pre-Trained BERT Weights
Dimension Matching	Linear Transformation to 1024
Encoder	Layer Count	12
	Attention Mechanism	Multi-Head Attention
	Attention Heads	8
Decoder	Layer Count	12
	Includes	Cross-Attention
	Attention Heads	8
Self-Attention	Projections	8 × 1024
Multi-Head Attention	Projections	8 × 1024
	Head Dimension	128
Feed-Forward Networks	Input/Output	1024/4096, 4096/1024
Layer Normalization	Epsilon	1 × 10⁻⁵
Dropout	Rate	0.1
Residual Connections	Implemented Around	Multi-Head Attention and Feed-Forward Layers
Language Model Head	Output Size	128,112

Table 4. Comparison of generic and fine-tuned hyperparameter settings for enhanced M2M-100 model.

Parameter	Generic Settings	Fine-Tuned Settings
Number of Training Epochs	4	40
Training Batch Size per GPU	16	24
Save Steps	2	2
Evaluation Strategy	Epoch	Epoch
Learning Rate	2 × 10⁻⁵	3 × 10⁻⁵
Optimizer	AdamW	AdamW
Beam Search Size	5	8
Dropout Rate	0.1	0.2
Gradient Clipping	1.0	0.8
Warm-up Steps	500	1400
Attention Dropout	0.1	0.15
Weight Decay	0.01	0.03
BERT Embedding Usage	No	Yes
Multi-Head Attention Heads	8	12
Re-Ranking Strategy	None	Applied

Table 5. Model performance on datasets and combined dataset (%).

Corpus	Zh→Ur				Ur→Zh
	BLEU	METEOR	chrF++	ROUGE-L	BLEU	METEOR	chrF++	ROUGE-L
Combined Dataset	68.21	55.34	75.11	71.19	69.37	53.51	74.33	72.11
OPUS	54.23	48.67	62.11	50.92	53.15	47.89	59.34	51.05
WMT	40.52	43.12	54.89	48.34	39.74	41.62	62.34	46.71
Wili + Custom	22.87	32.45	27.63	43.12	23.11	25.89	38.72	31.45

Table 6. Step-by-step experiments using different methods and their impact on final performance, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.

Method	BLEU-Avg ↑ (%)	TER-Avg ↓ (%)
M2M100_418 (CE loss)	63.24	51.91
M2M100_12B (CE loss)	64.52	44.79
w/BERT word embedding	64.91 (+0.39)	43.83 (−0.96)
w/In-trust loss	65.47 (+0.56)	42.74 (−1.09)
w/Bilingual curriculum learning	65.61 (+0.14)	41.61 (−1.13)
w/Contrastive re-ranking	65.83 (+0.22)	40.99 (−0.62)

Table 7. Ablation study with each proposed component evaluated bidirectionally, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.

Method	Zh→Ur		Ur→Zh
	BLEU (%)↑	TER (%)↓	BLEU (%)↑	TER (%)↓
M2M100_418 (CE loss)	63.24	43.42	64.42	59.77
M2M100_12B (CE loss)	65.54	43.42	66.01	59.77
w/BERT word embedding	66.15	42.16	67.37	58.29
w/In-trust-loss	66.46	41.78	67.73	57.81
w/Bilingual curriculum learning	67.01	42.34	68.10	58.31
w/Contrastive re-ranking	67.17	42.08	68.29	57.92

Table 8. Performance comparison of proposed model and baselines.

Model	Zh→Ur				Ur→Zh
	BLEU (%)	METEOR (%)	chrF++ (%)	TER (%)	BLEU (%)	METEOR (%)	chrF++ (%)	TER (%)
OpenNMT	64.8	51.3	71.1	35.6	63.5	49.7	67.2	46.2
mBART	65.2	52.8	72.5	68.2	64.5	51.1	66.8	64.8
GPT-2	66.4	56.9	70.2	69.1	66.8	53.2	69.7	65.7
LLaMA 7B	66.8	60.2	74.0	66.8	67.5	59.7	73.2	66.3
Google Translate API	65.7	51.1	69.0	44.5	66.0	54.5	68.5	35.0
Proposed Model	68.2	55.3	75.1	71.1	69.3	53.5	74.3	72.1

Table 9. Comparative analysis with state-of-the-art models.

Ref	Year	Model	Language Pair	BLEU Score
[31]	2021	Open NMT, LSTM, and RNN	Chinese ↔ Urdu	0.18
[30]	2020	NMT and LSTM	Chinese ↔ Urdu	0.42
[34]	2020	LSTM	Chinese ↔ Urdu	0.41
[34]	2020	Transformer	Chinese ↔ Urdu	0.52
[35]	2024	Transformer for POS	Chinese ↔ Urdu	0.36
Proposed		M2M100	↔ Urdu
Method		In-trust loss and re-ranking	Chinese ↔ Urdu	0.68
		In-trust loss and re-ranking	Urdu ↔ Chinese	0.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Javed, A.; Zan, H.; Mamyrbayev, O.; Abdullah, M.; Ahmed, K.; Oralbekova, D.; Dinara, K.; Akhmediyarova, A. Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics 2025, 14, 243. https://doi.org/10.3390/electronics14020243

AMA Style

Javed A, Zan H, Mamyrbayev O, Abdullah M, Ahmed K, Oralbekova D, Dinara K, Akhmediyarova A. Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics. 2025; 14(2):243. https://doi.org/10.3390/electronics14020243

Chicago/Turabian Style

Javed, Arifa, Hongying Zan, Orken Mamyrbayev, Muhammad Abdullah, Kanwal Ahmed, Dina Oralbekova, Kassymova Dinara, and Ainur Akhmediyarova. 2025. "Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation" Electronics 14, no. 2: 243. https://doi.org/10.3390/electronics14020243

APA Style

Javed, A., Zan, H., Mamyrbayev, O., Abdullah, M., Ahmed, K., Oralbekova, D., Dinara, K., & Akhmediyarova, A. (2025). Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics, 14(2), 243. https://doi.org/10.3390/electronics14020243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation

Abstract

1. Introduction

Contributions

2. Literature Review

3. Materials and Methods

3.1. Preliminaries

3.2. Dataset Preparation and Validation

3.3. Text Direction Analysis

3.4. Text Pre-Processing

3.5. Integrating Named Entity Recognition (NER) with BERT Encoder

3.6. In-Trust Loss Function

Estimating the Trust Factor ( α t )

3.7. Converting Chinese Text to Pinyin

3.8. Tokenization and OOV Management

3.9. Bilingual Curriculum Learning

3.10. Proposed Model

3.10.1. BERT Embedding Integration

3.10.2. Attention Mechanism

3.10.3. Pre-Processing and Post-Processing

3.11. Re-Ranking Candidates

Contrastive Re-Ranking Method

4. Experimental Setup

4.1. Computational Environment

4.2. Performance Metrics

4.3. Hyper-Parameter Tuning

5. Results

5.1. Ablation Study

5.2. Syntactic Analysis

5.3. OOV and NER Analysis

5.4. Translation Error Analysis

Performance Analysis with OOV Words and NER

5.5. Comparative Analysis with Baseline Models

Comparison with Existing Studies

5.6. Limitation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Code Availability

Sample Availability

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Estimating the Trust Factor ( $α_{t}$ )