Next Article in Journal
A Multi-Model Polynomial-Based Tracking Method for Targets with Complex Maneuvering Patterns
Previous Article in Journal
EDANet: Efficient Dynamic Alignment of Small Target Detection Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation

1
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
2
Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
3
School of Software, Henan University, Kaifeng 475001, China
4
Academy of Logistics and Transport, Almaty 050010, Kazakhstan
5
Institute of Automation and Information Technologies, Satbayev University, Almaty 050013, Kazakhstan
*
Authors to whom correspondence should be addressed.
Electronics 2025, 14(2), 243; https://doi.org/10.3390/electronics14020243
Submission received: 20 November 2024 / Revised: 28 December 2024 / Accepted: 3 January 2025 / Published: 8 January 2025

Abstract

:
Neural machine translation (NMT) plays a vital role in modern communication by bridging language barriers and enabling effective information exchange across diverse linguistic communities. Due to the limited availability of data in low-resource languages, NMT faces significant translation challenges. Data sparsity limits NMT models’ ability to learn, generalize, and produce accurate translations, which leads to low coherence and poor context awareness. This paper proposes a transformer-based approach incorporating an encoder–decoder structure, bilingual curriculum learning, and contrastive re-ranking mechanisms. Our approach enriches the training dataset using back-translation and enhances the model’s contextual learning through BERT embeddings. An incomplete-trust (in-trust) loss function is introduced to replace the traditional cross-entropy loss during training. The proposed model effectively handles out-of-vocabulary words and integrates named entity recognition techniques to maintain semantic accuracy. Additionally, the self-attention layers in the transformer architecture enhance the model’s syntactic analysis capabilities, which enables better context awareness and more accurate translations. Extensive experiments are performed on a diverse Chinese–Urdu parallel corpus, developed using human effort and publicly available datasets such as OPUS, WMT, and WiLi. The proposed model demonstrates a BLEU score improvement of 1.80% for Zh→Ur and 2.22% for Ur→Zh compared to the highest-performing comparative model. This significant enhancement indicates better translation quality and accuracy.

Graphical Abstract

1. Introduction

Language is a fundamental tool for communication and plays a significant role in preserving cultural heritage. It is a powerful medium for fostering connections and understanding between nations, particularly in global initiatives like the ‘Belt and Road’ initiative [1]. Effective communication between different language groups is essential for facilitating economic and cultural exchanges. However, overcoming language barriers remains a significant challenge, especially in low-resource languages, where translation accuracy and fluency are fundamental. In this context, machine translation (MT) becomes an essential tool for bridging these language gaps.
Machine translation (MT) is a field at the intersection of linguistics, computer science, artificial intelligence, and specifically natural language processing (NLP) [2]. The primary aim of MT is to automatically translate text from one source language to a target language, thereby enhancing communication and mutual understanding. Early MT approaches relied on rule-based methods, where linguistic rules were manually created to translate text [3]. For instance, rule-based systems like SYSTRAN use predefined grammar rules and dictionaries to map phrases between languages. While effective for simple translations, these systems struggled with complex sentence structures, idiomatic expressions, and contextual variations.
With the advent of statistical methods [4], MT systems evolved to use statistical models trained on large bilingual corpora. These models could learn translation patterns from data, improving accuracy by capturing more subtle language nuances. Statistical models relied on feature engineering, where features like word alignments and phrase pairings were used to improve translation [5]. However, these methods still struggled to capture context fully and to provide fluent translations, particularly with complex syntactic structures, such as varying word order or sentence construction, and idiomatic expressions like “break a leg”, which could be misinterpreted if translated literally.
The introduction of neural machine translation (NMT) marked a significant advancement in the field of machine translation [6,7]. NMT uses artificial neural networks, particularly architectures like recurrent neural networks (RNNs) [8] and transformer-based models [9], to translate text in an end-to-end manner. Unlike previous models, which relied on predefined rules or statistical data, NMT learns translation patterns directly from vast amounts of data, producing more fluent, contextually accurate, and natural translations [10]. This shift enables the generation of translations that resemble human language. NMT models prioritize fluency and natural language usage, ensuring that translations are structured as humans would express the sentiment. By focusing on context, NMT ensures that the translation is grammatically correct and idiomatic, much like a human-generated translation. Specifically, context awareness in NMT refers to the model’s ability to understand and incorporate the surrounding context of a word, phrase, or sentence. Rather than translating isolated words, a context-aware system takes into account the broader sentence or passage to generate a more coherent and accurate translation. For example, when translating the word ‘bank’, the model can determine whether it refers to a financial institution or the side of a river based on the surrounding text. These improvements are primarily due to deep learning techniques, allowing the system to learn the intricate relationships between words and phrases within a given context, resulting in more natural and fluent translations akin to those produced by humans.
Despite the success of NMT for high-resource language pairs like English–French [6], challenges remain for low-resource languages such as Chinese and Urdu. Low-resource NMT deals with the scarcity of parallel corpora, linguistic resources, and reduced computational power [11]. Even though large populations speak both Chinese and Urdu, high-quality parallel corpora for these languages are limited [12]. This data sparsity presents a challenge for NMT models, as they have fewer examples to learn from, particularly regarding rare words, idiomatic expressions, and complex sentence structures. The limited data also restricts the model’s ability to generalize linguistic patterns, making it difficult for NMT systems to handle domain-specific contexts [13].
A key issue in NMT for low-resource languages is handling out-of-vocabulary (OOV) words which were not encountered during the training phase. These OOV words often present significant syntactic and linguistic challenges [14], especially when translating between languages with very different linguistic structures. For instance, Chinese and Urdu have very different word order and morphology, which makes it difficult for the model to handle certain words effectively. To address this, named entity recognition (NER) [15] plays a vital role in improving the contextual accuracy of translations. NER helps identify and classify essential terms, such as people, organizations, and locations, ensuring these elements are correctly translated. NER capabilities also assist in syntactic parsing, which is the process of analyzing the grammatical structure of a sentence to understand how different words relate to each other. This parsing allows the model to handle complex sentence structures better and ensures the translated text is syntactically correct.
In response to these challenges, our research aims to improve NMT performance for Chinese↔Urdu translations by addressing data scarcity and linguistic complexity. We leverage pre-trained models and curated datasets to enhance the model’s training process. The transformer-based M2M-100 model [16] is proposed, which is trained on a “many-to-many” dataset covering 7.5 billion sentences across 100 languages. This model enhances translation quality by improving context awareness and coherence while addressing issues such as OOV words. Furthermore, integrating NER within the translation pipeline ensures that critical named entities are accurately translated, preventing errors and omissions. Back-translation techniques involve translating text from Urdu to Chinese and vice versa, creating synthetic parallel corpora, further enriching the training data. These techniques collectively optimize the model’s performance, enabling it to handle diverse linguistic patterns and improve its ability to produce high-quality translations for low-resource language pairs like Chinese and Urdu.

Contributions

The key contributions of the proposed model are as follows.
  • By employing back-translation, the model effectively combats the challenge of data scarcity and sparsity in low-resource language pairs.
  • This paper utilizes the incomplete-trust (in-trust) loss function to replace cross-entropy loss. This loss function enhances model robustness by reducing the impact of noisy data during training, thereby helping to prevent overfitting and improving generalization.
  • A contrastive re-ranking approach is employed to refine the translation output by evaluating and selecting the most accurate candidate translation from multiple candidate translations.
  • The transformer-based architecture of M2M with BERT embeddings and attention mechanisms allows the model to maintain a higher level of context awareness throughout the translation process. The model’s training includes focused layers on semantic parsing and understanding the relationships between words and phrases.
  • Incorporating NER tagging within the translation workflow ensures that the model prioritizes and accurately translates critical named entities or specific terms.
The proposed model aims to overcome the above-described primary challenges and ensures more accurate, fluent, and contextually relevant translations across diverse linguistic and domain-specific settings. The remaining article is organized in the following way. Section 2 discusses a critical literature review of state-of-the-art paradigms of NMT. Section 3 introduces the research methodology and its implementation process. Section 4 describes the experimental structure. Section 5 presents the outcomes of the proposed model and its effectiveness over the existing model with comparative analysis. Finally, Section 6 provides concluding remarks and discusses potential future research directions.

2. Literature Review

Machine translation systems are generally categorized based on their underlying architecture. These categories include rule-based machine translation (RBMT), corpus-based machine translation (CBMT), hybrid approaches, and neural machine translation (NMT) [3,5,6], each offering distinct methods for overcoming translation challenges. RBMT is one of the earliest MT approaches, relying heavily on linguistic rules to translate text between languages. There are two key strategies within RBMT: the transfer-based approach and the interlingua-based approach. The transfer approach translates the source language into the target language by applying syntactic and semantic rules and bilingual dictionaries. In contrast, the interlingua approach generates an intermediate semantic representation that can be translated into any language, bypassing the need for direct source-to-target mapping [3]. RBMT systems require extensive knowledge of source and target languages. The need for comprehensive rule creation often limits them, making them less scalable for language pairs with limited linguistic resources [17].
CBMT emerged to overcome the limitations of rule-based systems. Liu et al. [18] utilized this approach on large bilingual corpora to learn translation patterns automatically. CBMT systems rely on statistical and probabilistic models, which map phrases in the source language to their corresponding translations in the target language. The shift from rule-based to corpus-based systems allowed for greater scalability, particularly for languages with extensive text corpora available [13]. However, CBMT still requires significant amounts of parallel data to ensure high-quality translations, and it struggles in low-resource language pairs where such data are scarce.
Hybrid approaches aim to combine the strengths of both RBMT and CBMT. These systems integrate linguistic knowledge, such as rules and dictionaries, with data-driven methods to enhance translation quality [5]. Hybrid models benefit low-resource language pairs or specialized domains where purely rule-based or data-driven methods may fail to capture all nuances [19]. These approaches attempt to leverage the best of both worlds—rule-based precision and corpus-based flexibility.
The most significant breakthrough in MT in recent years has been the advent of neural machine translation (NMT) [6]. NMT uses deep learning techniques, particularly artificial neural networks (ANNs), to learn translations directly from data without the need for predefined linguistic rules [7]. Unlike RBMT and CBMT, which require explicit mappings or rules, NMT systems learn a sequence-to-sequence translation process from large parallel corpora, making them more adaptable to various languages and domains. A key feature of NMT is its ability to generate translations that account for the context of entire sentences rather than just word-to-word translations [20].
One of the early successes of NMT was in the development of the sequence-to-sequence (Seq2Seq) model, which relies on recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) networks [21]. These models performed well on translation tasks where previous methods struggled, particularly with long-range dependencies in sentences [22]. The utilization of the attention mechanism further improved the Seq2Seq model by allowing the system to focus on the most relevant parts of the source sentence during translation, improving the fluency and accuracy of the output [23].
The introduction of transformer-based models by Vaswani marked a significant turning point in NMT [9]. Unlike RNNs and LSTMs [24], which process data sequentially, transformers leverage a self-attention mechanism that allows them to process the entire input sequence simultaneously. This enables parallel computation, significantly speeding up training and inference times. Transformers excel at handling languages with complex syntactic structures and long-range dependencies, making them particularly well suited for languages like Chinese and Urdu, which have significant differences in both syntax and semantics.
Several studies have demonstrated the effectiveness of transformer models in low-resource languages. For example, multilingual BERT has been successfully adapted for sentiment analysis in Urdu [25], showing how transformer models can process and understand underrepresented languages. Similarly, Malik et al. [26] used transformers to detect threatening language in Urdu, emphasizing their ability to maintain semantic integrity in sensitive-content areas. The ability of transformers to manage diverse languages, even those with limited resources, has contributed to their dominance in the NMT landscape.
Transformer models have also enabled the development of multilingual and cross-lingual models, which can simultaneously perform translation tasks across many languages. Liu et al. [27] demonstrated the effectiveness of the XGLUE benchmark for evaluating named entity recognition (NER), Part-of-Speech (POS) tagging, and news classification across different languages. Another notable approach is multilingual machine translation, which leverages transformer models trained in various languages. Guerreiro et al. [28] applied the M2M model for hallucination detection in translations, showing how large multilingual models can improve translation quality even in challenging cases. These models, trained on massive datasets like FLORES-101 [29], have effectively provided accurate translations.
Despite the success of transformer models, low-resource languages continue to present significant challenges in NMT. One of the primary obstacles is data sparsity, which limits the availability of parallel corpora necessary for training high-quality models. Khan et al. [30] explored the use of OpenNMT for translating Chinese to Urdu, highlighting issues such as the difficulty of handling non-universal language pairs and named entity recognition (NER). They noted that a lack of data and insufficient context awareness were significant barriers to improving translation accuracy. To address these challenges, data augmentation methods have been proposed to expand the available datasets artificially. For example, Zeeshan et al. [31] used a Seq2Seq model with attention mechanisms to improve translation quality in Chinese–Urdu NMT, focusing on enhancing context awareness. Despite these advancements, the fixed-length context vector used in Seq2Seq models often fails to capture longer sequences, resulting in information loss during translation. Subword tokenization has emerged as a crucial technique in modern NMT systems to address some of these challenges. This approach allows models to break down words into smaller, more manageable units (subwords), which is particularly helpful when dealing with rare or unseen words. By reducing the vocabulary size, subword tokenization ensures the model can handle various word combinations, including those in languages with complex word formation rules or rich morphology [32]. This method improves the handling of unseen words and enhances the overall performance of NMT systems by facilitating better generalization, especially in low-resource settings where large corpora may not be available [33].
Zeeshan et al. [34] focused on developing an electronic dictionary for Chinese–Urdu translation using both LSTM and transformer architectures. Their study showed improvements in translation accuracy, though challenges persist due to the limitations of the available domain-specific language data. Similarly, Chan et al. [35] proposed incorporating Part-of-Speech (POS) sequence prediction into transformer models to refine translations. The system can generate more accurate translations by first predicting the target language’s POS sequence. However, this method can introduce errors if the POS tagging is inaccurate or incomplete, highlighting the need for high-quality, well-annotated data.
This literature review highlights several approaches and challenges in neural machine translation (NMT), particularly for low-resource language pairs, as summarized in Table 1. Traditional rule-based and corpus-based methods rely heavily on large datasets and linguistic rules, which are often inadequate for low-resource settings. Recent advancements in transformer-based models and techniques, such as back-translation and the integration of pre-trained language models, have shown significance in addressing some of these issues. Despite these advancements, there is a significant research gap in developing NMT models that can perform effectively with minimal data while maintaining high translation quality.

3. Materials and Methods

This section outlines the materials and methods used in this study. The dataset preparation, text direction, data preprocessing with tokenization, and proposed model architecture are explained. Figure 1 illustrates the research workflow.

3.1. Preliminaries

The neural machine translation (NMT) model, denoted by ϕ , transforms a sentence s from the source language into a sentence t in the target language. The model M utilizes a parallel training corpus D = { ( s i , t i ) } i = 1 M to optimize the log-likelihood of accurately predicting t given s, with each pair ( s i , t i ) assumed to be independent and identically distributed. The optimization objective is formulated as
max ϕ ( s i , t i ) D log p ϕ ( t i s i )
The model employs an encoder–decoder architecture. The encoder processes the source sentence into a sequence of hidden states. The decoder then constructs the target words based on these hidden states and the previously generated target words.

3.2. Dataset Preparation and Validation

Due to data scarcity in low-resource languages, a diverse Chinese and Urdu parallel corpus was developed through human effort and web crawlers (ParaCrawl, Bitextor, Common Crawl, and OpenNMT). The text was thoroughly reviewed and validated by experts in the native language and NLP. To ensure accuracy and reliability, we utilized various translators, including Google Translate (https://translate.google.com) and Baidu Translate (https://www.baidu.com). The checks focused on completeness, accuracy, uniform formats, naming conventions, and the removal of duplicates. Text consistency checks addressed spelling, grammar, and terminology. Sampling validation ensured accurate population representation for better model generalization and bias detection. Text alignment and statistical analysis were performed to evaluate linguistic diversity and coverage across domains. For each word w i in the dataset, the frequency f i was calculated using Equation (2):
f i = Number of occurrences of w i Total number of words
This analysis helped identify common and rare words across the dataset. To measure the degree of alignment between the source and target texts, the Correlation between sentence lengths in both languages was analyzed using the Pearson correlation coefficient [6]. The Type–Token Ratio (TTR) was computed to assess the vocabulary richness. This ratio compares the number of unique words (types) to the total number (tokens). Lastly, we used expert and fluent reviewers in both languages to assess a random dataset sample for accuracy and naturalness. We analyzed the frequency distribution of words or phrases to understand the dataset’s linguistic diversity.
In addition, we incorporated publicly available datasets.
  • The OPUS dataset is a collection of parallel corpora for various languages, including Chinese and Urdu, widely used for machine translation and linguistic research [41]. The dataset is available at (http://opus.nlpl.eu/).
  • The Workshop on Machine Translation (WMT) provides a benchmark in machine translation. It includes a variety of parallel corpora for different language pairs and is used in annual machine translation competitions [42].
  • The WiLi 18 benchmark dataset of short text extracts from Wikipedia contains 1000 paragraphs in 235 languages, a total of 23,500 paragraphs. Each language in this dataset contains 1000 rows/paragraphs. After data selection and preprocessing, we selected the same 45 paragraphs in Chinese and Urdu with the help of the middle language, English [43].
These datasets were compiled and formatted in CSV format. Table 2 shows the details of the datasets. We consolidated these datasets into a single comprehensive dataset in a unified format to facilitate experiments on large corpora. The experiments were conducted dataset-wise and on the consolidated dataset to ensure thorough analysis and validation. The sizes of the datasets are denoted as | D train | for the training set, | D test | for the testing set, and | D validation | for the validation set as a ratio of 70 % , 15 % , and 15 % respectively, where D represents the dataset. These sizes were calculated using the lengths of the training, testing and validation data, respectively, as follows:
| D train | , | D test | , | D val | = len ( t r a i n _ d a t a ) , len ( t e s t _ d a t a ) , len ( v a l _ d a t a ) .

3.3. Text Direction Analysis

Bidirectional Chinese–Urdu machine translation involves text direction analysis and its linguistic impact. Chinese and Urdu belong to the families of Sino-Tibetan and Indo-European languages. Chinese is written horizontally from left to right or vertically from top to bottom. Urdu is written from right to left. This fundamental difference in text direction affects the syntactic and semantic processing in natural language processing (NLP) tasks. The probabilistic autoregressive NMT model is utilized to calculate the conditional probabilities of sentences [44] given a parallel sentence pair ( c , u ) , where c represents Chinese and u represents Urdu. The primary task determines the original and translated side between language C and language U. This is achieved by comparing the conditional translation probability P ( u | c ) using an NMT model M C U with the conditional translation probability P ( c | u ) using a model M U C operating in the opposite direction. We assume that NMT models assign the translations higher conditional probabilities than the originals, so if P ( u | c ) > P ( c | u ) , we predict that u is the translation and c is the original. The original translation direction is C U . Equation (3) illustrates this process, where we obtain P ( u | c ) as a product of the individual token probabilities.
P ( u | c ) = j = 1 | u | p ( u j | u < j , c )
The average token-level ( l o g ) probabilities are represented in Equation (4).
P tok ( u | c ) = P ( u | c ) | u |
To detect the original translation direction (OTD) as shown in Equation (5), P tok ( u | c ) and P tok ( c | u ) are compared:
OTD = C U , if P tok ( u | c ) > P tok ( c | u ) U C , otherwise

3.4. Text Pre-Processing

This process removes special characters, HTML tags, punctuation, missing values, and extraneous whitespace. It also standardizes test cases and corrects common misspellings. The main steps include Replacing &amp with & in the data. If a traditional Chinese character appears in the target sentence, it is converted to Simplified Chinese to ensure that the translation source is consistent. Consistency checks are performed on the custom dataset using FastAlign to verify the validity. More precisely, inspired by back-translation [45], we use the Google API to provide alternative translations, helping generate ground truth results. Specifically, the original training datasets were enhanced by replacing semantically similar phrases with word vectors. As a result, the original training datasets were 10 times larger than the original data. For Chinese, the Stanford NLP tool was employed due to its robust handling of logographic script. We utilized a custom-trained Urdu model capable of recognizing entities in the extended Arabic script.

3.5. Integrating Named Entity Recognition (NER) with BERT Encoder

We integrated NER into the BERT encoder to enhance the model’s capability to handle context-specific entities. This integration is implemented through a multi-step process to accurately identify and prioritize critical entities within source and target sentences. Initially, a pre-trained NER model is applied to the input sentences in both languages to detect entities such as names, locations, and organizations. Once these entities are identified, they are enclosed with special tokens, specifically ‘<NER_START>’ and ‘<NER_END>’, to explicitly delineate their boundaries. This tagging process ensures that the BERT encoder recognizes and differentiates these entities from the surrounding text. The tagged sentences are then fed into the BERT encoder, which processes them to generate enriched contextual embeddings that emphasize the identified entities. By doing so, the encoder is better equipped to maintain the semantic integrity of these entities during translation, ensuring that they are accurately and consistently translated across language pairs. This method improves the translation quality of entity-rich sentences and enhances the generated translations’ overall semantic coherence. During the training phase, the model learns to associate the special NER tokens with their corresponding entity types, further refining its ability to handle diverse and complex entity structures within the translation tasks.

3.6. In-Trust Loss Function

We utilized the incomplete-trust (in-trust) loss function to address noisy data within the parallel corpus and modify the traditional cross-entropy loss. Noise in the training data can adversely affect the model’s performance, leading to overfitting on erroneous examples and degrading translation quality. Figure 2 depicts selected noisy samples from our dataset and their English descriptions for a better understanding. Noise is present in the training corpus’s source and target segments. Noise control aims to improve model robustness by minimizing the impact of noisy data. Regularization techniques are employed to filter out noisy data samples. For a dataset D with potential noisy pairs ( c i , u i ) , we define D clean as a subset of D that is free of significant noise, as shown in Equation (6).
L clean ( θ ) = ( c i , u i ) D clean L ( θ c i , u i )
Once the noisy samples are over-fitted, the translation performance of the final model will be impacted. Inspired by previous work [46], it is suggested that noisy datasets may also offer valuable insights. Incomplete-trust (in-trust) loss is proposed as a substitute for the original cross-entropy loss function to decrease the disparity between synthetic and real examples affected by noise. The loss function quantifies the discrepancy between the predicted translation t ^ and the actual target translation t. This new loss function is expressed as follows:
L In-trust ( θ ) = ( c , t ) D α t log P ( t c ; θ ) + ( 1 α t ) log ( 1 P ( t c ; θ ) )
where the equation elements are defined as follows:
  • L In-trust ( θ ) is the incomplete-trust loss.
  • ( c , t ) are the source and target translation pairs in the training dataset D .
  • P ( t c ; θ ) is the predicted probability of the target translation t given the source sentence c and model parameters θ .
  • α t is the trust factor for the target translation t, which adjusts the weight of each term in the loss function to account for the noise in the data.
This formulation ensures that the model trusts cleaner data while learning from noisy examples, thereby reducing the impact of overfitting on noisy samples.

Estimating the Trust Factor ( α t )

The trust factor α t plays a pivotal role in balancing the influence of clean and noisy data within the in-trust loss function. To estimate α t , we employ a dynamic approach based on the confidence level of each training sample. The implementation involves several key steps:
Each training sample is initially assigned a baseline trust factor α t , typically set to 0.5, indicating an equal weighting between trusting and distrusting the data. After each training epoch, the model evaluates the loss L ( c , t ) for each sample on a validation set. Samples that consistently yield lower loss values—indicating higher confidence—have their α t increased (e.g., to 0.7), amplifying their influence on the loss function. Conversely, samples with higher loss values undergo a decrease in α t (e.g., down to 0.3), reducing their impact to mitigate the effect of potential noise.
To formalize the adjustment of α t , we introduce the following update rule:
α t ( n e w ) = α t + Δ α if L ( c , t ) < τ α t Δ α if L ( c , t ) τ
where the equation elements are defined as follows:
  • L ( c , t ) is the loss for sample ( c , t ) .
  • τ is a predefined loss threshold.
  • Δ α is the increment/decrement value (e.g., 0.2).
This update rule ensures that α t remains within the [0,1] range, maintaining stability during training. Integrating the in-trust loss function into our training process enhances the model’s robustness against noisy data, leading to improved translation performance.
The implementation of the in-trust loss function involves the following steps:
  • Initialization: Assign an initial trust factor α t = 0.5 to all training samples in D .
  • Training Iteration: For each epoch, compute the in-trust loss L In-trust ( θ ) using Equation (7).
  • Model Update: Perform backpropagation and update the model parameters θ accordingly.
  • Trust Factor Adjustment: After each epoch, evaluate the loss L ( c , t ) for each sample on a validation set and adjust α t using Equation (8).
  • Normalization: Ensure that α t remains within the valid range [0,1].
This adaptive mechanism allows the model to prioritize learning from cleaner data while extracting useful information from noisy samples.

3.7. Converting Chinese Text to Pinyin

The conversion of Chinese characters into Pinyin turns them into the Latin alphabet, aiding NMT models that cannot handle Chinese scripts directly. Using Pinyin as an intermediary language offers a novel solution to avoid using English as a middle language. This improves translation efficiency by aligning Chinese phonetics directly with Urdu, reduces ambiguities, and enhances model performance. Research by [47] supports the effectiveness of phonetic representations in translation models. To address homophones in Pinyin, we use deep contextual analysis, enhanced tokenization, and post-processing corrections for accurate translations.
To demonstrate how Pinyin serves as an intermediary in the translation process, we present a snapshot of our dataset in Figure 3, which shows the alignment of Chinese, Pinyin, and Urdu.
Algorithm 1 utilizes the pinyin library to translate each Chinese character into its corresponding Pinyin representation. The text is then split into words based on spaces. It was initially designed to potentially invert these words. The current implementation maintains the original order. To apply this conversion across the dataset containing Chinese text, the convert _to_pinyin function is mapped to each sentence in the dataset. This conversion facilitates the alignment of Chinese phonetics directly with Urdu, enhancing translation accuracy and efficiency.
Algorithm 1 Convert Chinese Characters to Pinyin and Maintain Word Order
1:
procedure convert_to_pinyin( s e n t e n c e )
2:
     p i n y i n _ r e s u l t pinyin . get ( s e n t e n c e , strip , ) ▹ Convert the Chinese characters in the sentence to Pinyin, keeping spaces between syllables.
3:
     w o r d s split ( p i n y i n _ r e s u l t , )           ▹ Split the Pinyin result into a list of words (syllables).
4:
     f i n a l _ s e n t e n c e join ( w o r d s , )     ▹ Rejoin the words (syllables) to form the final Pinyin sentence.
5:
    return  f i n a l _ s e n t e n c e           ▹ Return the final Pinyin sentence, ready for further processing.
6:
end procedure

3.8. Tokenization and OOV Management

The tokenizer transforms a sentence s into a sequence of tokens { t 1 , t 2 , , t n } , represented as Tokenize ( s ) = [ t 1 , t 2 , , t n ] . The SentencePiece approach for tokenization is applied to effectively manage languages with large vocabularies and different scripts without requiring pre-segmented text. The SentencePiece model, denoted as S P , segments a given input sequence s into a sequence of subword units: S P ( s ) = { s w 1 , s w 2 , , s w m } , where { s w 1 , s w 2 , , s w m } are the subword tokens derived from the input string s. This approach is effective for managing out-of-vocabulary (OOV) words, as it breaks down rare words into known subwords or morphemes.
The encoding function E converts the sequence of tokens into a sequence of integer IDs, defined as E ( t i ) = i d i for each token t i , where i d i is the integer ID corresponding to the token t i . Once the tokens are converted into integer IDs, they are transformed into a tensor format suitable for neural network processing as the function T: X = T ( { i d 1 , i d 2 , , i d n } ) , where X is the tensor that will be input into the NMT model. This tensor encapsulates the sequence of integer IDs in a format that the neural network can efficiently process as follows:
P ( s ) = i = 1 m p ( s w i )
The tokenization S P ( s ) that maximizes the log probability of the segmented sentence is chosen, as represented in the following equation:
max i = 1 m log p ( s w i )
Each subword s w i is mapped to an integer ID i d i , which the proposed model will use. The tokenized sequences are then fed into the proposed NMT model, where an encoder encodes them and subsequently decodes them into the target language. This process forms the backbone of understanding and generating translations, as represented in the following equation:
P ( c | u ) = j = 1 | y | p ( c j | y < j , u )

3.9. Bilingual Curriculum Learning

The bilingual curriculum learning strategy is employed to improve the efficiency and performance of our NMT model. Curriculum learning organizes training data from simple to complex, enhancing learning dynamics. Our setup’s curriculum is based on sentence complexity and translation difficulty between Chinese and Urdu. The process involves evaluating sentence pairs by length, complexity, and rare words. We sort the dataset into tiers, starting with simple sentences and progressively adding more complex ones. Training is performed in phases: Phase 1 focuses on easy sentences, Phase 2 introduces moderate complexity, and Phase 3 fine-tunes the model on complex sentences. This approach improves convergence, enhances generalization, and reduces overfitting.

3.10. Proposed Model

A transformer-based M2M100 model [16] is proposed, and is a pre-trained multilingual encoder–decoder (seq-to-seq) with an attention mechanism. The architecture of this model is depicted in Figure 4. It is well suited for generating diverse translation candidates due to its background in handling a variety of linguistic contexts. Initially, multiple translation options are produced using the M2M100 variants with enhancements in contextual analysis and multi-head attention.
The encoder processes the input in the source language (Chinese or Urdu) by converting each token into an embedding vector, x i , and adding positional encodings, p i , to incorporate the positional information of the tokens in the sequence:
e i = x i + p i

3.10.1. BERT Embedding Integration

The encoder layers are initialized with BERT’s pre-trained weights to enhance the encoder’s ability to understand better and represent input text. After initialization, we fine-tune the BERT-augmented encoder layers on our datasets to adjust the pre-trained weights to the specifics of translation tasks. The dimensions of BERT embeddings ( d bert ) differ from the expected dimensions of the M2M model’s encoder embedding ( d m 2 m ). A linear transformation is applied to match the dimensions:
e bert = W transform e bert + b transform
where W transform and b transform are the weights and biases of the transformation layer, respectively. This ensures that the transformed BERT embeddings ( e bert ) are compatible with the M2M100 encoder.

3.10.2. Attention Mechanism

Some key modifications are introduced to the attention mechanisms, such as synchronous bidirectional attention and dynamic weight attention. Synchronous bidirectional attention enables the model to attend to both past and future tokens simultaneously within each layer, enhancing its understanding of context and improving its ability to capture long-range dependencies. This modification is achieved by adjusting the attention masks, which were originally unidirectional, to allow bidirectional context. Specifically, we modify the attention mask so the model can attend to all tokens in the sequence, regardless of their position. This bidirectional attention mask ensures that each token can attend to preceding and upcoming tokens in the sequence, enabling a more comprehensive understanding of the input. The attention matrix M bidirectional is modified to M bidirectional = 1, or all token pairs in the sequence.
This modification allows the model to leverage the full context of the input sequence, thereby improving translation accuracy, especially for sentences where relationships between distant tokens are critical. This leads to better handling of syntactic and semantic structures, resulting in more accurate translations.
The second modification, the refined attention mechanism, dynamically adjusts attention weights based on the importance of tokens. Token importance is identified through NER and semantic analysis. NER is applied at the token level to recognize key entities, while semantic analysis assesses the role and relevance of each token in the sentence. The importance of a token t is computed as I ( t ) = NER ( t ) · SemanticScore ( t ) .
Where NER ( t ) is a score indicating whether token t is a named entity and SemanticScore ( t ) measures its contextual relevance. Tokens with higher importance scores are given higher priority in the attention mechanism. The adjusted attention matrix A is computed by multiplying the standard attention matrix A with an importance modifier ImportanceMod ( I ( t ) ) : A = A · ImportanceMod ( I ( t ) ) .
This dynamic adjustment prioritizes critical tokens—such as named entities and key concepts—enhancing the model’s ability to generate semantically accurate translations.
Additionally, attention weights are recalculated during each attention operation, making them context-dependent and allowing the model to adapt to the significance of different tokens as the input sequence is processed. The attention mechanism with dynamic weighting is expressed as
Attention ( Q , K , V ) = softmax Q K T d k + A V
We employ a fallback mechanism to address named entities not present in the training data. In the case of an unseen named entity, we replace it with a predefined entity from a bilingual dictionary or an entity mapping list. If no direct match is found, the model relies on contextual clues from the surrounding tokens to infer the meaning of the entity. This ensures that the translation remains accurate and fluid, even in the presence of novel or rare entities. The fallback mechanism ensures the model can handle a wide range of named entities, improving its robustness, particularly for languages with limited resources.

3.10.3. Pre-Processing and Post-Processing

Applying layer normalization after the multi-head attention and feed-forward layers stabilizes the learning process and significantly enhances model performance. Dropout layers prevent overfitting by randomly setting some activations to zero during training. Implementing residual connections around the multi-head attention and feed-forward layers helps facilitate gradient flow and enhance the model’s ability to capture complex dependencies. This ensures that the output dimensions of the attention and feed-forward layers match the input dimensions. Equation (13) describes the process:
z = LayerNorm ( x + Dropout ( MultiHead ( Q , K , V ) ) )
The decoder generates the target language output (Urdu/Chinese) one token at a time. The masked self-attention layer prevents attending to future tokens in the output sequence using a masking mechanism as follows:
MaskedAttention ( Q , K , V ) = softmax Q K T d k + M V
where M is the mask matrix that prevents the model from looking at future tokens. The encoder–decoder attention layer handles the queries from the previous decoder layer, with keys and values from the encoder output, allowing each position in the decoder to attend to all positions in the input sequence as:
Attention ( Q dec , K enc , V enc ) = softmax Q dec K enc T d k V enc
The Feed-Forward Neural Network transforms the representation after attention integration, with layer normalization and dropout applied to ensure the dimensions match for subsequent layers:
FFN ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2
The proposed model is fine-tuned on our datasets to adapt it to our translation tasks. This process helps the model grasp the source language’s full semantic scope. The proposed model can produce translations that preserve longer texts’ coherence and intended meaning. Algorithm 2 and Table 3 provide an abstraction of the proposed model. The proposed model generates translations with two variants of M2M100, 418 M and 12 B, while the 1.2 B variant is used for re-ranking.

3.11. Re-Ranking Candidates

Re-ranking in neural machine translation (NMT) involves generating multiple candidate translations for a given input sentence and subsequently re-evaluating these candidates to select the best one. This process helps enhance translation accuracy and diversity by refining the initial outputs. The proposed model, as described in Section 3.10, is used to generate a set of candidate translations C = { T 1 , T 2 , , T n } , where each T i represents a candidate translation. For each candidate, a set of features is extracted to inform the re-ranking decision. These features include the initial translation score, length, word alignments, translation errors, and coherence.
The initial translation score is directly obtained from the primary NMT model’s output probabilities. The translation length is simply the number of tokens in the candidate translation, which provides a measure of completeness and conciseness. Word alignments are evaluated using tools like FastAlign to assess the mapping quality between source and target words. Translation errors, such as grammatical mistakes, mistranslations, or missing entities, are identified using error detection models or rule-based checks. Translation coherence is evaluated by measuring the semantic consistency of the candidate translation with reference sentences or ensuring its overall logical flow.
Algorithm 2 Translation Model with In-Trust Loss and Tokenization
  1:
Input: Dataset D , Clean Subset D clean , Epochs e p o c h s , Learning Rate l r
  2:
Output: Trained Model
  3:
Data Preprocessing:
  4:
for each ( c i , u i ) D  do
  5:
      if  ( c i , u i ) is clean then
  6:
           Add ( c i , u i ) to D clean
  7:
      end if
  8:
end for
  9:
Initialize M2M100Tokenizer, M2M100 model, BERT model, and SentencePiece model S P
10:
function InTrustLoss( p r e d i c t i o n s , t a r g e t s , α t )
11:
      return  ( α t · log ( p r e d i c t i o n s ) + ( 1 α t ) · log ( 1 p r e d i c t i o n s ) )
12:
end function
13:
Initialize Adam optimizer with learning rate l r
14:
for  e p o c h 1 to e p o c h s  do
15:
      for each ( s o u r c e , t a r g e t ) D  do
16:
            optimizer.zero_grad()
17:
            NER and OOV Management:
18:
            Perform NER on s o u r c e to recognize named entities
19:
            Handle OOV words in s o u r c e using subword units
20:
             s o u r c e _ i d s T o k e n i z e A n d E n c o d e ( s o u r c e )
21:
             t a r g e t _ i d s T o k e n i z e A n d E n c o d e ( t a r g e t )
22:
             o u t p u t s M 2 M 100 _ m o d e l ( i n p u t _ i d s = s o u r c e _ i d s , l a b e l s = t a r g e t _ i d s )
23:
             p r e d i c t i o n s softmax ( o u t p u t s . l o g i t s , dim = 1 )
24:
            lossInTrustLoss ( p r e d i c t i o n s , t a r g e t _ i d s , Tensor ( [ 0.5 ] ) )
25:
             l o s s . b a c k w a r d ( )
26:
            optimizer.step()
27:
            Print “Epoch”, e p o c h , “Loss:”, l o s s . i t e m ( )
28:
      end for
29:
end for
30:
Re-ranking Process:
31:
for each input sentence c do
32:
      Generate candidate translations C = { T 1 , T 2 , , T n }
33:
      for each candidate T i in C do
34:
            Extract features for T i (e.g., initial score, length, alignment, coherence)
35:
            Assign re-ranked score S ( T i ) using M2M100-1.2B model
36:
      end for
37:
      Select best translation T = arg max T i C S ( T i )
38:
end for
39:
return Trained Model
These features are then fed into a re-ranking neural network, which processes the information and produces a re-ranked score S ( T i ) for each candidate T i . The candidate with the highest re-ranked score is selected as the final translation output, ensuring the best possible translation is chosen.

Contrastive Re-Ranking Method

A contrastive re-ranking method is employed with the M2M100-1.2B model variant, as illustrated in Figure 5. It has a strong capability to evaluate and rank translations across different languages. Training in a broad range of language pairs provides a fine understanding of language that can be beneficial for re-ranking based on subtleties in translation quality. Positive samples come from high-quality translations in the bilingual corpus, and negative samples are drawn from Diverse Beam Search [48] outputs. The re-ranker evaluates each candidate based on the extracted features and assigns a new score. The re-ranker is implemented as a neural network combining features to predict the final translation quality. The candidate with the highest score is selected as the best translation. Let S ( T i ) be the re-ranked score of candidate T i . The best translation T is given by
T = arg max T i C S ( T i )
In this setup, h x represents the hidden features of the input text, and h T j are the features for the target samples, with h T + and h T j denoting positive and negative sample features, respectively. A non-linear projection layer is applied on top of M2M100 to refine these features. The calculation of two types of features with temperature is as follows:
L = log e sim ( h x , h T + ) / τ j = 1 n e sim ( h x , h T + ) / τ + e sim ( h x , h T j ) / τ
This approach ensures that the selected translation is diverse and high-quality.

4. Experimental Setup

4.1. Computational Environment

Python served as the primary programming language due to its extensive support for machine learning and natural language processing tasks in the experimental setup for evaluating the performance of a bidirectional translation system Chinese–Urdu parallel corpus. All experiments were carried out using the PyTorch framework. The experiments were performed on a cloud server with a high-performance NVIDIA RTX 4090 GPU, which provides the necessary computational power for training large-scale translation models. The implementation of the proposed NMT model with all enhancements is detailed in Algorithm 2.

4.2. Performance Metrics

Standard metrics for translation quality are used to evaluate the performance of machine translation systems. BLEU focuses on the precision of n-grams, which counts how many n-grams in the machine translation match the reference translation. Then, it adjusts the overall length to avoid favoring overly short translations. ChrF++ is helpful for languages with difficult word segmentation, such as Chinese and Urdu. It is sensitive to the text’s lexical and morphological properties. METEOR compares the translation to the reference by aligning them at the word level. It evaluates the alignment using various parameters such as synonymy and stemming. ROUGE-L evaluates recall and precision using the longest common subsequence between a candidate and a reference translation. It measures how many identical word sequences appear in both texts, which makes it particularly effective at assessing fluency and the overall structure of the content. TER measures the number of edits required to change a system-generated translation into one of the reference translations.

4.3. Hyper-Parameter Tuning

To fine-tune and compare different hyperparameters for training a model, we define two sets of parameters: one for a fine-tuned approach and one for a ptr-trained model. Below is a description of each, followed by a Table 4 summarizing the hyperparameters used in both scenarios.

5. Results

Table 5 presents the performance of the proposed translation model on different datasets and the combined dataset for translations between Chinese (Zh) and Urdu (Ur) in both directions. The proposed model demonstrates the highest overall performance across both translation directions on combined datasets. The model performs strongly in both translation directions, with a BLEU score of 68.21% for Zh→Ur and 69.37% for Ur→Zh. These results indicate that the combined datasets provide a more comprehensive training set and superior translation quality than the individual datasets. For the OPUS dataset, the model achieves a BLEU score of 54% (Zh→Ur) with ROUGE-L scores of 50.92%. Despite its moderate size, the OPUS dataset performs strongly due to its diverse and well-curated content and balanced data distribution. For the WMT dataset, the model performs slightly worse, with BLEU scores of 40.52% (Zh→Ur) and 39.74% (Ur→Zh). As the largest dataset, WMT provides valuable exposure to various linguistic structures. Though smaller, the Wili + Custom dataset still contributes significantly, with BLEU scores of 22% (Zh→Ur) and 23% (Ur→Zh). The Wili dataset’s specialized content enhances the model’s ability to handle specific terms, while the custom dataset offers highly relevant, tailored content for particular translation scenarios. The proposed model significantly improves Chinese–Urdu translation quality through effective pre-processing, careful adjustments, and strategic dataset selection.

5.1. Ablation Study

An ablation study was conducted to evaluate the impact of various components on the model’s performance. The study was carried out using the combined dataset, and the performance was measured using BLEU and TER across multiple runs for each variant, as shown in Table 6. The baseline models, M2M100_418 and M2M100_12B, were first assessed with the traditional cross-entropy (CE) loss function.
Incorporating BERT word embeddings led to an improvement of 0.39 in BLEU and a reduction of 0.96 in TER, highlighting the effectiveness of BERT’s contextual embeddings in improving translation quality, particularly in low-resource settings. Substituting the CE loss with the proposed in-trust loss function resulted in a more substantial performance boost, with BLEU increasing by 0.56 and TER decreasing by 0.0109. This demonstrates that the in-trust loss function is more robust in response to inaccuracies in the training data. The use of bilingual curriculum learning contributed to an additional improvement of 0.14 in BLEU and a reduction of 1.13 in TER, indicating its role in enhancing efficiency when dealing with low-resource corpora. Lastly, the application of contrastive re-ranking improved BLEU by 0.22. It reduced TER by 0.62, suggesting that this method aids in selecting the most accurate and contextually appropriate translations by increasing the diversity of candidates.
The performance of the proposed components was also evaluated across both translation directions (Zh→Ur and Ur→Zh), as presented in Table 7. While the improvements were consistent, the impact was slightly more pronounced in the Zh→Ur direction. In both cases, the in-trust loss function effectively reduced TER, underscoring its value in noisy, low-resource scenarios.
A comparison of convergence rates between the in-trust loss and the traditional CE loss function is shown in Figure 6 to further validate the benefits of the in-trust loss function. The in-trust loss function exhibited faster convergence during the early stages of training and smoother behavior in later stages, indicating a more stable and effective learning process, particularly in low-resource settings.

5.2. Syntactic Analysis

Our comparison of syntactic complexity compared the average sentence length and average clauses per sentence between the reference and translated texts, as shown in Figure 7. The blue bars represent the reference texts, and the orange bars represent the translations produced by the proposed model. Both metrics are closely matched with translations, showing a slight increase in average sentence length from 31.74 to 32.32 words and a nearly identical proportion of clauses per sentence (0.91 to 0.92). These results suggest that the model maintains the syntactic richness of the source text and indicates the practical preservation of linguistic structures without oversimplification or unnecessary complications in translations.
The length distribution analysis underscores the proposed model’s effectiveness in handling translations with a high degree of fidelity to the source text’s length and syntactic complexity. Figure 8 indicates that the model adeptly maintains a balance in translation length, with a preference for producing outputs that closely mirror the source in most cases.
NMT models often struggle with translating longer sentences. To examine this, sentences are grouped by length, with a BLEU score calculated for each group, as depicted in the left image of Figure 9. The right image of Figure 9 illustrates the BLEU scores alongside the average translation length for each group. While the standard transformer and its variant show commendable performance on shorter sentences, their efficacy diminishes with increased sentence length. The proposed model addresses this shortfall by leveraging past and future contextual information. The integration of synchronous bidirectional attention markedly enhances translation accuracy across all sentence-length groups.

5.3. OOV and NER Analysis

The proposed model significantly improves the handling of out-of-vocabulary (OOV) terms in Chinese–Urdu bidirectional translation. The M2M100 model variants were fine-tuned on a corpus with diverse Chinese and Urdu sentences. This process helped the model adapt better to the linguistic adoption of both languages. BERT embeddings are known for their deep contextual understanding, which allows the model to capture subtler meanings and associations within and across languages, particularly for OOV terms. The results are shown in Figure 10. It highlights the expected results alongside the outputs of both models. For phrases like “High-energy particles”, the pre-trained model generated generic translations like “Charged particles”. In contrast, the fine-tuned model provided more contextually accurate translations, such as “Particles with high energy”. Similarly, for “Digital currency”, the pre-trained model produced “Number exchange”, whereas the fine-tuned model improved the output to “Cryptocurrency”. Another example is “Artificial intelligence”, where the pre-trained model’s output was “Artificial brain”, but the fine-tuned model generated a more appropriate translation, “Machine intelligence”. In Urdu-to-Chinese translation, phrases like “Purpose of life” were translated by the pre-trained model as “Pursuit of life”, while the fine-tuned model refined it to “Life’s goal”. The table demonstrates that the fine-tuned model consistently produces more accurate, nuanced, and contextually relevant translations, effectively addressing the limitations of the pre-trained model in handling complex and rare phrases.
NER represents which types of entities are differentiated and which are missed in a Chinese–Urdu parallel corpus on test data; we present this in Figure 11. It highlights the types of entities (such as person, location, organization, date, event) correctly identified, partially identified, or missed by different model variants.

5.4. Translation Error Analysis

Figure 12 illustrates the range of translation errors encountered, which posed significant challenges to maintaining the integrity and accuracy of the translated content. These errors include addition errors, where unnecessary elements were introduced into the translation, omissions of key parts of the source text, ambiguous translations with multiple possible interpretations, and grammatical inaccuracies that disrupt the linguistic structure of the output. Contextual errors were particularly prominent, accounting for 2.5% of all identified mistakes, where translations failed to accurately reflect the situational context of the source material. Similarly, grammatical errors constituted 2% of the total errors.
To address these challenges, the proposed model integrates a refined attention mechanism and BERT embeddings, effectively capturing the most relevant portions of the input text and minimizing the risk of introducing extraneous elements. Furthermore, in-trust loss, contrastive re-ranking, and bilingual curriculum learning significantly enhance the model’s contextual understanding, grammatical precision, and overall translation quality. As a result, the proposed model achieves substantial reductions across all error categories, as seen in Figure 12, particularly in spelling errors (reduced from 4% to 0.5%) and contextual errors (reduced from 2.5% to 1.1%). These improvements ensure greater fluency, accuracy, and readability of the translations, making them more reliable and contextually appropriate for real-world applications.
Figure 13 presents a baseline (BLEU) score and an improved scenario for different strategies: Longer Context, Domain-Specific, Cross-Lingual, and Post-Processing. The results show that each strategy leads to a notable increase in BLEU scores. This demonstrates their effectiveness in enhancing translation accuracy. Notably, the Cross-Lingual strategy yielded the highest improvement and increased the BLEU score from 0.74 to 0.86, indicating its significant impact on translation quality.

Performance Analysis with OOV Words and NER

The proposed model significantly improved over the baseline in translating OOV terms accurately and contextually in both Chinese to Urdu and Urdu to Chinese translations, as shown in Figure 14. Subword tokenization, contextual BERT embeddings, and back-off strategies such as copying the OOV word directly into the output or replacing it with a placeholder, which can later be addressed through post-processing, allow the model to handle words that are not in the training vocabulary. This approach allows the model to handle OOV words by decomposing them into known subword units, which can then be recombined during translation. It ensures the translations remain accurate and meaningful even when encountering new or rare words. The performance of the proposed model on NER is measured by using three key metrics: precision, recall, and F1 score, as shown in Figure 14. Pre-processing with NER, context-aware translation, and post-processing adjustments, ensure that named entities are accurately translated and preserved, maintaining the integrity of the information and reducing the risk of errors. The precision metric evaluates how accurately the model identifies entities. A higher precision means that when the model predicts an entity, it is correct more often. However, high precision can sometimes be achieved without missing some actual entities (lower recall). In the graph, improvements in precision across models show that the model makes fewer false-positive errors. Recall measures the model’s ability to identify all dataset-related entities. The increase in recall values across models indicates that enhancements help the model capture more true entities without missing many. The F1 score combines precision and recall into a single measure that balances the two. It is particularly valuable when assessing a model’s overall performance, as identifying and correctly classifying entities are equally important.
The training and validation loss in Figure 15 shows that the proposed model is well optimized and effective at learning from the data. The training loss decreases steadily across the 40 epochs. This consistent reduction in training loss indicates that the model is effectively learning from the training data. As the epochs progress, the model minimizes the error on the training set, which suggests that it is becoming better at fitting the training data. The smooth downward trajectory without any sharp fluctuations implies that the learning process is stable and that the model is not facing issues during training. The decline in validation loss suggests that the model improves its performance on unseen data as it trains. The smooth and consistent decline in both loss metrics without sudden increases or instability indicates that the training process is proceeding smoothly without overfitting or under-fitting.

5.5. Comparative Analysis with Baseline Models

The LSTM-based NMT model [49] is implemented via OpenNMT and provides a baseline for evaluating the effectiveness of our proposed Chinese–Urdu translation model. LSTM networks excel at handling sequential data and capturing long-term dependencies, which makes them valuable in translation tasks.
mBART [27] adopts the BART model architecture and serves as a strong baseline for evaluating the proposed Chinese–Urdu translation model due to its pre-training on diverse, multilingual corpora, including Chinese–Urdu. It performs seq2seq noise reduction auto-encoding pre-training on a large-scale monolingual corpus. mBART is fine-tuned on the Chinese–Urdu test dataset, enabling a direct comparison with the proposed model’s performance.
GPT-2 [50] is a transformer-based language model developed by OpenAI and known for its strong text generation and translation capabilities. Despite being a general-purpose model, GPT-2’s ability to generate coherent and contextually relevant text makes it a valuable baseline for translation tasks. Fine-tuning GPT-2 on the Chinese–Urdu dataset allows for a comparison with the proposed model.
LLaMA 7B [51] is a smaller and resource-efficient version of the LLaMA model with 7 billion parameters. It balances performance and computational demands, making it a suitable baseline for tasks requiring strong translation capabilities. Fine-tuning LLaMA 7B on the Chinese–Urdu dataset enables a meaningful comparison with our proposed model.
Google Translate API (https://cloud.google.com/translate/docs/reference/libraries/v3/python) offers strong and reliable multilingual NMT capability. We implement translation using a JSON file that contains (1) the text to be translated (query), (2) the source language, and (3) the target language. By calling these open APIs, we perform translations to compare established translation systems in a low-resource context.
The performance comparison in Table 8 reveals that the proposed Chinese–Urdu translation model consistently outperforms the baselines in both translation directions (Zh→Ur and Ur→Zh). It achieves the highest BLEU, METEOR, and chrF++ scores while maintaining the lowest TER, which indicates superior translation accuracy and fluency. LLaMA 7B delivers strong performance, closely trailing the proposed model despite being smaller and more resource-efficient. This suggests that LLaMA 7B’s advanced architecture effectively captures linguistic nuances, making it a formidable contender. However, LLaMA 7B faces challenges in handling certain language-specific intricacies and cultural contexts in the Chinese–Urdu pair, which slightly impacts its overall performance compared to the proposed model. mBART, with its robust multilingual pre-training, performs well but struggles with the specific intricacies of the Chinese–Urdu language pair, leading to slightly lower scores. GPT-2, while effective, falls short in capturing complex linguistic structures and contextual nuances, resulting in lower performance. The LSTM-based model (OpenNMT) highlights the limitations of traditional RNN architectures in translation tasks, especially in low-resource settings, leading to the lowest scores. Google Translate API, although widely used, underperforms compared to the more specialized and fine-tuned models, particularly in low-resource scenarios. LLaMA 7B’s competitive performance indicates that while it is well suited for this translation task, it still encounters specific challenges that prevent it from fully matching the proposed model’s performance.

Comparison with Existing Studies

As detailed in Table 9, the comparative analysis highlights several studies on Chinese–Urdu neural machine translation (NMT). It focuses on the challenges of low-resource language pairs and issues related to hallucinations. Chen et al. developed a Chinese–Urdu NMT model integrating POS sequence prediction with the transformer architecture, achieving a BLEU score of 0.36 [35]. Zeeshan et al. implemented the OpenNMT framework using LSTM and RNN-based models, attaining a lower BLEU score of 0.18 [34]. A further study by Zeshan et al. compared LSTM with transformer models and demonstrated the superior performance of the transformer, which significantly improved BLEU scores from 0.077 to 0.52, compared to 0.41 for LSTM [30]. The seq2seq NMT system was introduced for Chinese–Urdu bidirectional translation by deploying a hybrid model such as RNN with LSTM cells. The model gained a BLEU score of 0.42 [31].
The proposed model stands out with the highest BLEU score of 0.69 among the studies focused on Chinese–Urdu low-resource language pairs, underscoring its effectiveness in mitigating semantic ambiguities and enhancing translation quality.

5.6. Limitation

The proposed NMT model demonstrates significant advancements in translation quality for the Chinese–Urdu language pair, mainly through multiple evaluation metrics. There are still areas where the model could be refined. The reliance on enriched datasets like the custom Chinese–Urdu corpus and the computational demands of the model limit its scalability and applicability to other low-resource languages and resource-constrained environments. Additionally, while using multiple metrics provides a more comprehensive assessment of translation quality, further exploration into additional or alternative metrics, such as human evaluations, could offer deeper insights into the translation’s cultural and contextual appropriateness. Moreover, refining components such as the contrastive re-ranking mechanism and the incomplete-trust loss function to enhance their adaptability across different language pairs could significantly increase the model’s generalizability. These limitations highlight the need for further refinement in handling noisy data and long, complex sentence structures.

6. Conclusions

This study presents a novel approach to enhancing neural machine translation (NMT) for low-resource languages, with a specific focus on Chinese–Urdu translation. We use transformer-based M2M100 variants and integrate advanced techniques, including bilingual curriculum learning, contrastive re-ranking, a refined attention mechanism, and the in-trust loss function. These additions help address critical challenges such as out-of-vocabulary (OOV) words and named entity recognition (NER), improving both translation quality and contextual accuracy. Experiments conducted on both private and public datasets, including the use of the back-translation technique, played a pivotal role in enriching the training dataset and enabling the model to generalize effectively from limited data. The results show that our model outperforms baseline models and those used in existing studies. Specifically, the model achieved a BLEU score of 0.68 for Chinese-to-Urdu and 0.69 for Urdu-to-Chinese translation on the combined dataset. For the OPUS dataset, we obtained a BLEU score of 0.54. Additionally, the ablation study demonstrates the incremental improvements brought by each component, confirming the model’s robustness in handling the challenges posed by data scarcity and linguistic complexities. Looking ahead, future work will focus on refining document-level translation, aiming to improve translation quality for longer texts and enhance contextual understanding across diverse domains.

Author Contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. BR24993166).

Data Availability Statement

We utilized publicly available datasets and appropriately cited them within the article. Upon completion of the ongoing research, the private dataset will be made available on a reputable platform.

Conflicts of Interest

The authors declare no competing interests. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Code Availability

This is ongoing research with future expansion plans. The code will be made available on a reputable public platform. In the meantime, the code developed for this article is available upon reasonable request from the corresponding author.

Sample Availability

Samples of the compounds are available from the authors.

References

  1. Setiawan, I. The Role of Language in Preserving Cultural Heritage and Religious Beliefs: A Case Study on Oral Traditions in the Indigenous Sasak Community of Lombok, Indonesia. Pak. J. Life Soc. Sci. 2025. [Google Scholar] [CrossRef]
  2. Ramírez, J.G.C. Natural Language Processing Advancements: Breaking Barriers in Human-Computer Interaction. J. Artif. Intell. Gen. Sci. (JAIGS) 2024, 3, 31–39. [Google Scholar]
  3. Ameur, M.S.H.; Meziane, F.; Guessoum, A. Arabic machine translation: A survey of the latest trends and challenges. Comput. Sci. Rev. 2020, 38, 100305. [Google Scholar] [CrossRef]
  4. Mishra, R. A Comparative Analysis of Statistical and Neural Machine Translation Models. Integr. J. Sci. Technol. 2024, 1, 2. [Google Scholar]
  5. Abidin, Z.; Junaidi, A. Wamiliana Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review. J. Inf. Syst. Eng. Bus. Intell. 2024, 10, 217–231. [Google Scholar] [CrossRef]
  6. Buttar, P.K.; Sachan, M.K. A review of the approaches to neural machine translation. In Natural Language Processing and Information Retrieval; CRC Press: Boca Raton, FL, USA, 2023; pp. 78–109. [Google Scholar]
  7. Li, B.; Weng, Y.; Xia, F.; Deng, H. Towards better Chinese-centric neural machine translation for low-resource languages. Comput. Speech Lang. 2024, 84, 101566. [Google Scholar] [CrossRef]
  8. Lankford, S.; Afli, H.; Way, A. Human evaluation of English–Irish transformer-based NMT. Information 2022, 13, 309. [Google Scholar] [CrossRef]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  10. Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
  11. Ranathunga, S.; Lee, E.S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
  12. Shukla, A.; Mishra, R. Unraveling the Complexities of Neural Machine Translation. Integr. J. Sci. Technol. 2024, 1. [Google Scholar]
  13. Moslem, Y. Language Modelling Approaches to Adaptive Machine Translation. arXiv 2024, arXiv:2401.14559. [Google Scholar]
  14. Chen, K.; Chen, B.; Gao, D.; Dai, H.; Jiang, W.; Ning, W.; Yu, S.; Yang, L.; Cai, X. General2Specialized LLMs Translation for E-commerce. arXiv 2024, arXiv:2403.03689. [Google Scholar]
  15. Shaukat, A.; Sadiq, A.H.B. Probing Translation Loss in the Urdu Translation of Alchemist. Harf-o-Sukhan 2024, 8, 300–306. [Google Scholar]
  16. Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res. 2021, 22, 1–48. [Google Scholar]
  17. Fakih, A.; Ghassemiazghandi, M.; Fakih, A.H.; Singh, M.K. Evaluation of Instagram’s Neural Machine Translation for Literary Texts: An MQM-Based Analysis. Gema Online J. Lang. Stud. 2024, 24. [Google Scholar] [CrossRef]
  18. Liu, Y.; Ma, Y.; Zhou, S.; Luo, X. A Survey of Research and Application of NLP-based Machine Translation. In Proceedings of the 2024 6th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 22–24 March 2024; pp. 315–319. [Google Scholar]
  19. Lu, J.; Yin, F. Research on Improving the Quality of Japanese Chinese Machine Translation Based on Deep Learning; IOS Press: Amsterdam, The Netherlands, 2024. [Google Scholar]
  20. Zhou, M.; Duan, N.; Liu, S.; Shum, H.Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering 2020, 6, 275–290. [Google Scholar] [CrossRef]
  21. Hailu, F. Tigrigna-English Bidirectional Machine Translation Using Deep Learning. Ph.D. Thesis, St. Mary’s University, San Antonio, TX, USA, 2024. [Google Scholar]
  22. Ephrem, M. Development of Bidirectional Amharic-Tigrinya Machine Translation Using Recurrent Neural Networks. Ph.D. Thesis, St. Mary’s University, San Antonio, TX, USA, 2024. [Google Scholar]
  23. Chen, Z.; Han, B.; Wang, S.; Qian, Y. Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1636–1649. [Google Scholar] [CrossRef]
  24. Vathsala, M.; Holi, G. RNN based machine translation and transliteration for Twitter data. Int. J. Speech Technol. 2020, 23, 499–504. [Google Scholar] [CrossRef]
  25. Ashraf, M.R.; Jana, Y.; Umer, Q.; Jaffar, M.A.; Chung, S.; Ramay, W.Y. BERT Based Sentiment Analysis for Low-resourced Languages: A Case Study of Urdu Language. IEEE Access 2023, 11, 110245–110259. [Google Scholar] [CrossRef]
  26. Malik, M.S.I.; Cheema, U.; Ignatov, D.I. Contextual Embeddings based on Fine-tuned Urdu-BERT for Urdu threatening content and target identification. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101606. [Google Scholar] [CrossRef]
  27. Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
  28. Guerreiro, N.M.; Voita, E.; Martins, A.F. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. arXiv 2022, arXiv:2208.05309. [Google Scholar]
  29. Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzmán, F.; Fan, A. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguist. 2022, 10, 522–538. [Google Scholar] [CrossRef]
  30. Khan, Z.; Zakira, M.; Slamu, W.; Slam, N. A study of neural machine translation from Chinese to Urdu. J. Auton. Intell. 2020, 2, 29–36. [Google Scholar]
  31. Zeeshan, J.; Zakira, M.; Niaz, M. A seq to seq machine translation from Urdu to Chinese. J. Auton. Intell. 2021, 4, 1–5. [Google Scholar]
  32. Liew, S.R.C.; Law, N.F. Use of subword tokenization for domain generation algorithm classification. Cybersecurity 2023, 6, 49. [Google Scholar] [CrossRef]
  33. Karthikeyan, M.; Mary Anita, E. Text classification; language-independent tokenization; sub word tokenization. Intell. Autom. Soft Comput. 2023, 35. [Google Scholar] [CrossRef]
  34. Zeeshan, Z.A.; Jawad, M.Z. Research on Chinese-Urdu machine translation based on deep learning. J. Auton. Intell. 2020, 3, 34–44. [Google Scholar]
  35. Chen, H.; Wang, J.; Muhammad, N.U.H. Chinese-Urdu neural machine translation interacting POS sequence prediction in Urdu language. Comput. Eng. Sci. 2024, 46, 518. [Google Scholar]
  36. Ortiz-Garces, I.; Govea, J.; Andrade, R.O.; Villegas-Ch, W. Optimizing Chatbot Effectiveness through Advanced Syntactic Analysis: A Comprehensive Study in Natural Language Processing. Appl. Sci. 2024, 14, 1737. [Google Scholar] [CrossRef]
  37. Qiu, J.; Li, S. A multi-encoder model for automatic code comment generation. In Proceedings of the Fourth International Conference on Sensors and Information Technology (ICSI 2024), Xiamen, China, 5–7 January 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13107, pp. 78–85. [Google Scholar]
  38. Li, G.; Zhao, X.; Wang, X. Quantum self-attention neural networks for text classification. Sci. China Inf. Sci. 2024, 67, 1–13. [Google Scholar] [CrossRef]
  39. Bensalah, N.; Ayad, H.; Adib, A.; Ibn El Farouk, A. CRAN: An hybrid CNN-RNN attention-based model for Arabic machine translation. In Networking, Intelligent Systems and Security: Proceedings of NISS 2021; Springer: Singapore, 2022; pp. 87–102. [Google Scholar]
  40. Dowling, M. An Investigation of English-Irish Machine Translation and Associated Resources. Ph.D. Thesis, Dublin City University, Dublin, Ireland, 2022. [Google Scholar]
  41. Tiedemann, J.; Aulamo, M.; Bakshandaeva, D.; Boggia, M.; Grönroos, S.A.; Nieminen, T.; Raganato, A.; Scherrer, Y.; Vazquez, R.; Virpioja, S. Democratizing neural machine translation with OPUS-MT. Lang. Resour. Eval. 2023, 58, 713–755. [Google Scholar] [CrossRef]
  42. Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 1–42. [Google Scholar] [CrossRef]
  43. Thoma, M. The WiLI benchmark dataset for written language identification. arXiv 2018, arXiv:1801.07779. [Google Scholar]
  44. Wastl, M.; Vamvas, J.; Sennrich, R. Machine Translation Models are Zero-Shot Detectors of Translation Direction. arXiv 2024, arXiv:2401.06769. [Google Scholar]
  45. Yan, Y.; Song, J.; Fu, B.; Ye, N.; Shi, X. Automatic Reference-Free Fine-Grained Machine Translation Error Detection via Named Entity Recognition and Back-Translation. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2024; pp. 306–317. [Google Scholar]
  46. Huang, X.; Chen, Y.; Wu, S.; Zhao, J.; Xie, Y.; Sun, W. Named Entity Recognition via Noise Aware Training Mechanism with Data Filter. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 4791–4803. [Google Scholar] [CrossRef]
  47. Yang, J.; Wu, S.; Zhang, D.; Li, Z.; Zhou, M. Improved neural machine translation with Chinese phonologic features. In Proceedings of the Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, 26–30 August 2018; Proceedings, Part I 7. Springer: Berlin/Heidelberg, Germany, 2018; pp. 303–315. [Google Scholar]
  48. Hotate, K.; Kaneko, M.; Komachi, M. Generating diverse corrections with local beam search for grammatical error correction. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 2132–2137. [Google Scholar]
  49. Khan, A.; Sarfaraz, A. RNN-LSTM-GRU based language transformation. Soft Comput. 2019, 23, 13007–13024. [Google Scholar] [CrossRef]
  50. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  51. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Figure 1. Proposed Chinese–Urdu neural machine translation model workflow.
Figure 1. Proposed Chinese–Urdu neural machine translation model workflow.
Electronics 14 00243 g001
Figure 2. Noisy data samples in bilingual translation.
Figure 2. Noisy data samples in bilingual translation.
Electronics 14 00243 g002
Figure 3. A snapshot of the dataset showing Chinese, Pinyin, and Urdu translations. Pinyin is an intermediary language to improve translation accuracy between Chinese and Urdu, aligning phonetic representations directly with the target language.
Figure 3. A snapshot of the dataset showing Chinese, Pinyin, and Urdu translations. Pinyin is an intermediary language to improve translation accuracy between Chinese and Urdu, aligning phonetic representations directly with the target language.
Electronics 14 00243 g003
Figure 4. Overview of proposed transformer-based Seq2Seq M2M100 encoder-decoder architecture [16].
Figure 4. Overview of proposed transformer-based Seq2Seq M2M100 encoder-decoder architecture [16].
Electronics 14 00243 g004
Figure 5. Overview of re-ranking Process.
Figure 5. Overview of re-ranking Process.
Electronics 14 00243 g005
Figure 6. Loss convergence comparison between cross-entropy and proposed in-trust loss.
Figure 6. Loss convergence comparison between cross-entropy and proposed in-trust loss.
Electronics 14 00243 g006
Figure 7. Comparison of syntactic complexity.
Figure 7. Comparison of syntactic complexity.
Electronics 14 00243 g007
Figure 8. Distribution of length ratios.
Figure 8. Distribution of length ratios.
Electronics 14 00243 g008
Figure 9. Performance analysis of the translation on test data concerning sentence length.
Figure 9. Performance analysis of the translation on test data concerning sentence length.
Electronics 14 00243 g009
Figure 10. Translation evaluation with OOV words.
Figure 10. Translation evaluation with OOV words.
Electronics 14 00243 g010
Figure 11. Named entity recognition evaluation.
Figure 11. Named entity recognition evaluation.
Electronics 14 00243 g011
Figure 12. Distribution of translation errors.
Figure 12. Distribution of translation errors.
Electronics 14 00243 g012
Figure 13. Translation quality with different aspects of phrases.
Figure 13. Translation quality with different aspects of phrases.
Electronics 14 00243 g013
Figure 14. Comparative analysis of OOV words and NER in translation quality metrics.
Figure 14. Comparative analysis of OOV words and NER in translation quality metrics.
Electronics 14 00243 g014
Figure 15. Training and validation loss.
Figure 15. Training and validation loss.
Electronics 14 00243 g015
Table 1. Literature review summary.
Table 1. Literature review summary.
Ref.DatasetsModel and ContributionLimitations
[6]Flores, WMT, OPUSReview of approaches in NMTDoes not address linguistic diversity and noise in data
[19]OPUSResearch on improving MT qualityRequires large parallel corpora and struggles with data sparsity
[13]OPUS, TICO 19Corpus-based MT using algorithms and modelsStruggles with handling OOV effectively
[36]SemEvalOptimizing chatbot effectivenessLimited syntactic and semantic understanding
[37]Tibetan–ChineseNMT for Tibetan–Chinese translation using RNNInadequate handling of complex sentences and long range dependency
[38]Yelp, IMDB, AmazonQuantum self-attention networks for text classificationComplex implementation and not specifically designed for low-resource language pairs
[22]Amharic and TigrinyaBidirectional Amharic-Tigrinya MT using RNN and LSTMLimited applicability to other low-resource language pairs
[21]Tigrinya–EnglishBidirectional Tigrigna–English MT using deep learningRequires significant amounts of training data and has limited syntactic understanding
[39]OPUSHybrid CNN-RNN attention-based model for Arabic MTFocuses on Arabic and lacks comprehensive context awareness
[40]ZH-CN SMS chatInvestigation of MT for Chinese–English SMS chat translationFocuses on a narrow application area and lacks OOV and NER
[25]Urdu textMultilingual BERT for sentiment analysis in UrduLimited to sentiment analysis and has inadequate handling of complex sentences
[26]Urdu textThreatening language detection in UrduLacks syntactic and semantic depth
[28]FLORES-101Use of M2M model for hallucination detectionLacks context awareness
[16]Many-to-ManyBeyond English-centric MT with M2M100 modelNo generalization for low-resource languages (Chinese, Urdu)
Table 2. Chinese–Urdu corpus data.
Table 2. Chinese–Urdu corpus data.
CorpusSentencesZh TokensUr TokensTrainingTestingValidation
OPUS493,0421,189,539899,376345,12973,95673,957
WMT608,4051,348,4941,106,496425,88391,26091,262
WiLi793833,35756,894555611901192
Custom56,332130,569104,74239,43284498451
Total 1,165,7172,701,9592,167,508815,000174,855174,862
Table 3. Key components of the M2M100 model implementation.
Table 3. Key components of the M2M100 model implementation.
ComponentAttributeValue
Shared EmbeddingsDimension1024
Padding Index1
BERT Embedding IntegrationInitializationPre-Trained BERT Weights
Dimension MatchingLinear Transformation to 1024
EncoderLayer Count12
Attention MechanismMulti-Head Attention
Attention Heads8
DecoderLayer Count12
IncludesCross-Attention
Attention Heads8
Self-AttentionProjections8 × 1024
Multi-Head AttentionProjections8 × 1024
Head Dimension128
Feed-Forward NetworksInput/Output1024/4096, 4096/1024
Layer NormalizationEpsilon1 × 10−5
DropoutRate0.1
Residual ConnectionsImplemented AroundMulti-Head Attention and Feed-Forward Layers
Language Model HeadOutput Size128,112
Table 4. Comparison of generic and fine-tuned hyperparameter settings for enhanced M2M-100 model.
Table 4. Comparison of generic and fine-tuned hyperparameter settings for enhanced M2M-100 model.
ParameterGeneric SettingsFine-Tuned Settings
Number of Training Epochs440
Training Batch Size per GPU1624
Save Steps22
Evaluation StrategyEpochEpoch
Learning Rate2 × 10−53 × 10−5
OptimizerAdamWAdamW
Beam Search Size58
Dropout Rate0.10.2
Gradient Clipping1.00.8
Warm-up Steps5001400
Attention Dropout0.10.15
Weight Decay0.010.03
BERT Embedding UsageNoYes
Multi-Head Attention Heads812
Re-Ranking StrategyNoneApplied
Table 5. Model performance on datasets and combined dataset (%).
Table 5. Model performance on datasets and combined dataset (%).
CorpusZh→UrUr→Zh
BLEU METEOR chrF++ ROUGE-L BLEU METEOR chrF++ ROUGE-L
Combined Dataset68.2155.3475.1171.1969.3753.5174.3372.11
OPUS54.2348.6762.1150.9253.1547.8959.3451.05
WMT40.5243.1254.8948.3439.7441.6262.3446.71
Wili + Custom22.8732.4527.6343.1223.1125.8938.7231.45
Table 6. Step-by-step experiments using different methods and their impact on final performance, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.
Table 6. Step-by-step experiments using different methods and their impact on final performance, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.
MethodBLEU-Avg ↑ (%)TER-Avg ↓ (%)
M2M100_418 (CE loss)63.2451.91
M2M100_12B (CE loss)64.5244.79
w/BERT word embedding64.91 (+0.39)43.83 (−0.96)
w/In-trust loss65.47 (+0.56)42.74 (−1.09)
w/Bilingual curriculum learning65.61 (+0.14)41.61 (−1.13)
w/Contrastive re-ranking65.83 (+0.22)40.99 (−0.62)
Table 7. Ablation study with each proposed component evaluated bidirectionally, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.
Table 7. Ablation study with each proposed component evaluated bidirectionally, where higher BLEU scores (↑) indicate better translation quality, and lower TER scores (↓) represent fewer translation errors.
MethodZh→UrUr→Zh
BLEU (%)↑TER (%)↓BLEU (%)↑TER (%)↓
M2M100_418 (CE loss)63.2443.4264.4259.77
M2M100_12B (CE loss)65.5443.4266.0159.77
w/BERT word embedding66.1542.1667.3758.29
w/In-trust-loss66.4641.7867.7357.81
w/Bilingual curriculum learning67.0142.3468.1058.31
w/Contrastive re-ranking67.1742.0868.2957.92
Table 8. Performance comparison of proposed model and baselines.
Table 8. Performance comparison of proposed model and baselines.
ModelZh→UrUr→Zh
BLEU (%)METEOR (%)chrF++ (%)TER (%)BLEU (%)METEOR (%)chrF++ (%)TER (%)
OpenNMT 64.851.371.135.663.549.767.246.2
mBART65.252.872.568.264.551.166.864.8
GPT-266.456.970.269.166.853.269.765.7
LLaMA 7B66.860.274.066.867.559.773.266.3
Google Translate API65.751.169.044.566.054.568.535.0
Proposed Model68.255.375.171.169.353.574.372.1
Table 9. Comparative analysis with state-of-the-art models.
Table 9. Comparative analysis with state-of-the-art models.
RefYearModelLanguage PairBLEU Score
[31]2021Open NMT, LSTM, and RNNChinese ↔ Urdu0.18
[30]2020NMT and LSTMChinese ↔ Urdu0.42
[34]2020LSTMChinese ↔ Urdu0.41
[34]2020TransformerChinese ↔ Urdu0.52
[35]2024Transformer for POSChinese ↔ Urdu0.36
Proposed M2M100↔ Urdu
Method In-trust loss and re-rankingChinese ↔ Urdu0.68
In-trust loss and re-rankingUrdu ↔ Chinese0.69
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Javed, A.; Zan, H.; Mamyrbayev, O.; Abdullah, M.; Ahmed, K.; Oralbekova, D.; Dinara, K.; Akhmediyarova, A. Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics 2025, 14, 243. https://doi.org/10.3390/electronics14020243

AMA Style

Javed A, Zan H, Mamyrbayev O, Abdullah M, Ahmed K, Oralbekova D, Dinara K, Akhmediyarova A. Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics. 2025; 14(2):243. https://doi.org/10.3390/electronics14020243

Chicago/Turabian Style

Javed, Arifa, Hongying Zan, Orken Mamyrbayev, Muhammad Abdullah, Kanwal Ahmed, Dina Oralbekova, Kassymova Dinara, and Ainur Akhmediyarova. 2025. "Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation" Electronics 14, no. 2: 243. https://doi.org/10.3390/electronics14020243

APA Style

Javed, A., Zan, H., Mamyrbayev, O., Abdullah, M., Ahmed, K., Oralbekova, D., Dinara, K., & Akhmediyarova, A. (2025). Transformer-Based Re-Ranking Model for Enhancing Contextual and Syntactic Translation in Low-Resource Neural Machine Translation. Electronics, 14(2), 243. https://doi.org/10.3390/electronics14020243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop