1. Introduction
Language is a fundamental tool for communication and plays a significant role in preserving cultural heritage. It is a powerful medium for fostering connections and understanding between nations, particularly in global initiatives like the ‘Belt and Road’ initiative [
1]. Effective communication between different language groups is essential for facilitating economic and cultural exchanges. However, overcoming language barriers remains a significant challenge, especially in low-resource languages, where translation accuracy and fluency are fundamental. In this context, machine translation (MT) becomes an essential tool for bridging these language gaps.
Machine translation (MT) is a field at the intersection of linguistics, computer science, artificial intelligence, and specifically natural language processing (NLP) [
2]. The primary aim of MT is to automatically translate text from one source language to a target language, thereby enhancing communication and mutual understanding. Early MT approaches relied on rule-based methods, where linguistic rules were manually created to translate text [
3]. For instance, rule-based systems like SYSTRAN use predefined grammar rules and dictionaries to map phrases between languages. While effective for simple translations, these systems struggled with complex sentence structures, idiomatic expressions, and contextual variations.
With the advent of statistical methods [
4], MT systems evolved to use statistical models trained on large bilingual corpora. These models could learn translation patterns from data, improving accuracy by capturing more subtle language nuances. Statistical models relied on feature engineering, where features like word alignments and phrase pairings were used to improve translation [
5]. However, these methods still struggled to capture context fully and to provide fluent translations, particularly with complex syntactic structures, such as varying word order or sentence construction, and idiomatic expressions like “break a leg”, which could be misinterpreted if translated literally.
The introduction of neural machine translation (NMT) marked a significant advancement in the field of machine translation [
6,
7]. NMT uses artificial neural networks, particularly architectures like recurrent neural networks (RNNs) [
8] and transformer-based models [
9], to translate text in an end-to-end manner. Unlike previous models, which relied on predefined rules or statistical data, NMT learns translation patterns directly from vast amounts of data, producing more fluent, contextually accurate, and natural translations [
10]. This shift enables the generation of translations that resemble human language. NMT models prioritize fluency and natural language usage, ensuring that translations are structured as humans would express the sentiment. By focusing on context, NMT ensures that the translation is grammatically correct and idiomatic, much like a human-generated translation. Specifically, context awareness in NMT refers to the model’s ability to understand and incorporate the surrounding context of a word, phrase, or sentence. Rather than translating isolated words, a context-aware system takes into account the broader sentence or passage to generate a more coherent and accurate translation. For example, when translating the word ‘bank’, the model can determine whether it refers to a financial institution or the side of a river based on the surrounding text. These improvements are primarily due to deep learning techniques, allowing the system to learn the intricate relationships between words and phrases within a given context, resulting in more natural and fluent translations akin to those produced by humans.
Despite the success of NMT for high-resource language pairs like English–French [
6], challenges remain for low-resource languages such as Chinese and Urdu. Low-resource NMT deals with the scarcity of parallel corpora, linguistic resources, and reduced computational power [
11]. Even though large populations speak both Chinese and Urdu, high-quality parallel corpora for these languages are limited [
12]. This data sparsity presents a challenge for NMT models, as they have fewer examples to learn from, particularly regarding rare words, idiomatic expressions, and complex sentence structures. The limited data also restricts the model’s ability to generalize linguistic patterns, making it difficult for NMT systems to handle domain-specific contexts [
13].
A key issue in NMT for low-resource languages is handling out-of-vocabulary (OOV) words which were not encountered during the training phase. These OOV words often present significant syntactic and linguistic challenges [
14], especially when translating between languages with very different linguistic structures. For instance, Chinese and Urdu have very different word order and morphology, which makes it difficult for the model to handle certain words effectively. To address this, named entity recognition (NER) [
15] plays a vital role in improving the contextual accuracy of translations. NER helps identify and classify essential terms, such as people, organizations, and locations, ensuring these elements are correctly translated. NER capabilities also assist in syntactic parsing, which is the process of analyzing the grammatical structure of a sentence to understand how different words relate to each other. This parsing allows the model to handle complex sentence structures better and ensures the translated text is syntactically correct.
In response to these challenges, our research aims to improve NMT performance for Chinese↔Urdu translations by addressing data scarcity and linguistic complexity. We leverage pre-trained models and curated datasets to enhance the model’s training process. The transformer-based M2M-100 model [
16] is proposed, which is trained on a “many-to-many” dataset covering 7.5 billion sentences across 100 languages. This model enhances translation quality by improving context awareness and coherence while addressing issues such as OOV words. Furthermore, integrating NER within the translation pipeline ensures that critical named entities are accurately translated, preventing errors and omissions. Back-translation techniques involve translating text from Urdu to Chinese and vice versa, creating synthetic parallel corpora, further enriching the training data. These techniques collectively optimize the model’s performance, enabling it to handle diverse linguistic patterns and improve its ability to produce high-quality translations for low-resource language pairs like Chinese and Urdu.
Contributions
The key contributions of the proposed model are as follows.
By employing back-translation, the model effectively combats the challenge of data scarcity and sparsity in low-resource language pairs.
This paper utilizes the incomplete-trust (in-trust) loss function to replace cross-entropy loss. This loss function enhances model robustness by reducing the impact of noisy data during training, thereby helping to prevent overfitting and improving generalization.
A contrastive re-ranking approach is employed to refine the translation output by evaluating and selecting the most accurate candidate translation from multiple candidate translations.
The transformer-based architecture of M2M with BERT embeddings and attention mechanisms allows the model to maintain a higher level of context awareness throughout the translation process. The model’s training includes focused layers on semantic parsing and understanding the relationships between words and phrases.
Incorporating NER tagging within the translation workflow ensures that the model prioritizes and accurately translates critical named entities or specific terms.
The proposed model aims to overcome the above-described primary challenges and ensures more accurate, fluent, and contextually relevant translations across diverse linguistic and domain-specific settings. The remaining article is organized in the following way.
Section 2 discusses a critical literature review of state-of-the-art paradigms of NMT.
Section 3 introduces the research methodology and its implementation process.
Section 4 describes the experimental structure.
Section 5 presents the outcomes of the proposed model and its effectiveness over the existing model with comparative analysis. Finally,
Section 6 provides concluding remarks and discusses potential future research directions.
2. Literature Review
Machine translation systems are generally categorized based on their underlying architecture. These categories include rule-based machine translation (RBMT), corpus-based machine translation (CBMT), hybrid approaches, and neural machine translation (NMT) [
3,
5,
6], each offering distinct methods for overcoming translation challenges. RBMT is one of the earliest MT approaches, relying heavily on linguistic rules to translate text between languages. There are two key strategies within RBMT: the transfer-based approach and the interlingua-based approach. The transfer approach translates the source language into the target language by applying syntactic and semantic rules and bilingual dictionaries. In contrast, the interlingua approach generates an intermediate semantic representation that can be translated into any language, bypassing the need for direct source-to-target mapping [
3]. RBMT systems require extensive knowledge of source and target languages. The need for comprehensive rule creation often limits them, making them less scalable for language pairs with limited linguistic resources [
17].
CBMT emerged to overcome the limitations of rule-based systems. Liu et al. [
18] utilized this approach on large bilingual corpora to learn translation patterns automatically. CBMT systems rely on statistical and probabilistic models, which map phrases in the source language to their corresponding translations in the target language. The shift from rule-based to corpus-based systems allowed for greater scalability, particularly for languages with extensive text corpora available [
13]. However, CBMT still requires significant amounts of parallel data to ensure high-quality translations, and it struggles in low-resource language pairs where such data are scarce.
Hybrid approaches aim to combine the strengths of both RBMT and CBMT. These systems integrate linguistic knowledge, such as rules and dictionaries, with data-driven methods to enhance translation quality [
5]. Hybrid models benefit low-resource language pairs or specialized domains where purely rule-based or data-driven methods may fail to capture all nuances [
19]. These approaches attempt to leverage the best of both worlds—rule-based precision and corpus-based flexibility.
The most significant breakthrough in MT in recent years has been the advent of neural machine translation (NMT) [
6]. NMT uses deep learning techniques, particularly artificial neural networks (ANNs), to learn translations directly from data without the need for predefined linguistic rules [
7]. Unlike RBMT and CBMT, which require explicit mappings or rules, NMT systems learn a sequence-to-sequence translation process from large parallel corpora, making them more adaptable to various languages and domains. A key feature of NMT is its ability to generate translations that account for the context of entire sentences rather than just word-to-word translations [
20].
One of the early successes of NMT was in the development of the sequence-to-sequence (Seq2Seq) model, which relies on recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) networks [
21]. These models performed well on translation tasks where previous methods struggled, particularly with long-range dependencies in sentences [
22]. The utilization of the attention mechanism further improved the Seq2Seq model by allowing the system to focus on the most relevant parts of the source sentence during translation, improving the fluency and accuracy of the output [
23].
The introduction of transformer-based models by Vaswani marked a significant turning point in NMT [
9]. Unlike RNNs and LSTMs [
24], which process data sequentially, transformers leverage a self-attention mechanism that allows them to process the entire input sequence simultaneously. This enables parallel computation, significantly speeding up training and inference times. Transformers excel at handling languages with complex syntactic structures and long-range dependencies, making them particularly well suited for languages like Chinese and Urdu, which have significant differences in both syntax and semantics.
Several studies have demonstrated the effectiveness of transformer models in low-resource languages. For example, multilingual BERT has been successfully adapted for sentiment analysis in Urdu [
25], showing how transformer models can process and understand underrepresented languages. Similarly, Malik et al. [
26] used transformers to detect threatening language in Urdu, emphasizing their ability to maintain semantic integrity in sensitive-content areas. The ability of transformers to manage diverse languages, even those with limited resources, has contributed to their dominance in the NMT landscape.
Transformer models have also enabled the development of multilingual and cross-lingual models, which can simultaneously perform translation tasks across many languages. Liu et al. [
27] demonstrated the effectiveness of the XGLUE benchmark for evaluating named entity recognition (NER), Part-of-Speech (POS) tagging, and news classification across different languages. Another notable approach is multilingual machine translation, which leverages transformer models trained in various languages. Guerreiro et al. [
28] applied the M2M model for hallucination detection in translations, showing how large multilingual models can improve translation quality even in challenging cases. These models, trained on massive datasets like FLORES-101 [
29], have effectively provided accurate translations.
Despite the success of transformer models, low-resource languages continue to present significant challenges in NMT. One of the primary obstacles is data sparsity, which limits the availability of parallel corpora necessary for training high-quality models. Khan et al. [
30] explored the use of OpenNMT for translating Chinese to Urdu, highlighting issues such as the difficulty of handling non-universal language pairs and named entity recognition (NER). They noted that a lack of data and insufficient context awareness were significant barriers to improving translation accuracy. To address these challenges, data augmentation methods have been proposed to expand the available datasets artificially. For example, Zeeshan et al. [
31] used a Seq2Seq model with attention mechanisms to improve translation quality in Chinese–Urdu NMT, focusing on enhancing context awareness. Despite these advancements, the fixed-length context vector used in Seq2Seq models often fails to capture longer sequences, resulting in information loss during translation. Subword tokenization has emerged as a crucial technique in modern NMT systems to address some of these challenges. This approach allows models to break down words into smaller, more manageable units (subwords), which is particularly helpful when dealing with rare or unseen words. By reducing the vocabulary size, subword tokenization ensures the model can handle various word combinations, including those in languages with complex word formation rules or rich morphology [
32]. This method improves the handling of unseen words and enhances the overall performance of NMT systems by facilitating better generalization, especially in low-resource settings where large corpora may not be available [
33].
Zeeshan et al. [
34] focused on developing an electronic dictionary for Chinese–Urdu translation using both LSTM and transformer architectures. Their study showed improvements in translation accuracy, though challenges persist due to the limitations of the available domain-specific language data. Similarly, Chan et al. [
35] proposed incorporating Part-of-Speech (POS) sequence prediction into transformer models to refine translations. The system can generate more accurate translations by first predicting the target language’s POS sequence. However, this method can introduce errors if the POS tagging is inaccurate or incomplete, highlighting the need for high-quality, well-annotated data.
This literature review highlights several approaches and challenges in neural machine translation (NMT), particularly for low-resource language pairs, as summarized in
Table 1. Traditional rule-based and corpus-based methods rely heavily on large datasets and linguistic rules, which are often inadequate for low-resource settings. Recent advancements in transformer-based models and techniques, such as back-translation and the integration of pre-trained language models, have shown significance in addressing some of these issues. Despite these advancements, there is a significant research gap in developing NMT models that can perform effectively with minimal data while maintaining high translation quality.
3. Materials and Methods
This section outlines the materials and methods used in this study. The dataset preparation, text direction, data preprocessing with tokenization, and proposed model architecture are explained.
Figure 1 illustrates the research workflow.
3.1. Preliminaries
The neural machine translation (NMT) model, denoted by
, transforms a sentence
s from the source language into a sentence
t in the target language. The model
M utilizes a parallel training corpus
to optimize the log-likelihood of accurately predicting
t given
s, with each pair
assumed to be independent and identically distributed. The optimization objective is formulated as
The model employs an encoder–decoder architecture. The encoder processes the source sentence into a sequence of hidden states. The decoder then constructs the target words based on these hidden states and the previously generated target words.
3.2. Dataset Preparation and Validation
Due to data scarcity in low-resource languages, a diverse Chinese and Urdu parallel corpus was developed through human effort and web crawlers (ParaCrawl, Bitextor, Common Crawl, and OpenNMT). The text was thoroughly reviewed and validated by experts in the native language and NLP. To ensure accuracy and reliability, we utilized various translators, including Google Translate (
https://translate.google.com) and Baidu Translate (
https://www.baidu.com). The checks focused on completeness, accuracy, uniform formats, naming conventions, and the removal of duplicates. Text consistency checks addressed spelling, grammar, and terminology. Sampling validation ensured accurate population representation for better model generalization and bias detection. Text alignment and statistical analysis were performed to evaluate linguistic diversity and coverage across domains. For each word
in the dataset, the frequency
was calculated using Equation (
2):
This analysis helped identify common and rare words across the dataset. To measure the degree of alignment between the source and target texts, the Correlation between sentence lengths in both languages was analyzed using the Pearson correlation coefficient [
6]. The Type–Token Ratio (TTR) was computed to assess the vocabulary richness. This ratio compares the number of unique words (types) to the total number (tokens). Lastly, we used expert and fluent reviewers in both languages to assess a random dataset sample for accuracy and naturalness. We analyzed the frequency distribution of words or phrases to understand the dataset’s linguistic diversity.
In addition, we incorporated publicly available datasets.
The OPUS dataset is a collection of parallel corpora for various languages, including Chinese and Urdu, widely used for machine translation and linguistic research [
41]. The dataset is available at (
http://opus.nlpl.eu/).
The Workshop on Machine Translation (WMT) provides a benchmark in machine translation. It includes a variety of parallel corpora for different language pairs and is used in annual machine translation competitions [
42].
The WiLi 18 benchmark dataset of short text extracts from Wikipedia contains 1000 paragraphs in 235 languages, a total of 23,500 paragraphs. Each language in this dataset contains 1000 rows/paragraphs. After data selection and preprocessing, we selected the same 45 paragraphs in Chinese and Urdu with the help of the middle language, English [
43].
These datasets were compiled and formatted in CSV format.
Table 2 shows the details of the datasets. We consolidated these datasets into a single comprehensive dataset in a unified format to facilitate experiments on large corpora. The experiments were conducted dataset-wise and on the consolidated dataset to ensure thorough analysis and validation. The sizes of the datasets are denoted as
for the training set,
for the testing set, and
for the validation set as a ratio of
,
, and
respectively, where
D represents the dataset. These sizes were calculated using the lengths of the training, testing and validation data, respectively, as follows:
3.3. Text Direction Analysis
Bidirectional Chinese–Urdu machine translation involves text direction analysis and its linguistic impact. Chinese and Urdu belong to the families of Sino-Tibetan and Indo-European languages. Chinese is written horizontally from left to right or vertically from top to bottom. Urdu is written from right to left. This fundamental difference in text direction affects the syntactic and semantic processing in natural language processing (NLP) tasks. The probabilistic autoregressive NMT model is utilized to calculate the conditional probabilities of sentences [
44] given a parallel sentence pair
, where
c represents Chinese and
u represents Urdu. The primary task determines the original and translated side between language
C and language
U. This is achieved by comparing the conditional translation probability
using an NMT model
with the conditional translation probability
using a model
operating in the opposite direction. We assume that NMT models assign the translations higher conditional probabilities than the originals, so if
, we predict that
u is the translation and
c is the original. The original translation direction is
. Equation (
3) illustrates this process, where we obtain
as a product of the individual token probabilities.
The average token-level
probabilities are represented in Equation (
4).
To detect the original translation direction (OTD) as shown in Equation (
5),
and
are compared:
3.4. Text Pre-Processing
This process removes special characters, HTML tags, punctuation, missing values, and extraneous whitespace. It also standardizes test cases and corrects common misspellings. The main steps include Replacing & with & in the data. If a traditional Chinese character appears in the target sentence, it is converted to Simplified Chinese to ensure that the translation source is consistent. Consistency checks are performed on the custom dataset using FastAlign to verify the validity. More precisely, inspired by back-translation [
45], we use the Google API to provide alternative translations, helping generate ground truth results. Specifically, the original training datasets were enhanced by replacing semantically similar phrases with word vectors. As a result, the original training datasets were 10 times larger than the original data. For Chinese, the Stanford NLP tool was employed due to its robust handling of logographic script. We utilized a custom-trained Urdu model capable of recognizing entities in the extended Arabic script.
3.5. Integrating Named Entity Recognition (NER) with BERT Encoder
We integrated NER into the BERT encoder to enhance the model’s capability to handle context-specific entities. This integration is implemented through a multi-step process to accurately identify and prioritize critical entities within source and target sentences. Initially, a pre-trained NER model is applied to the input sentences in both languages to detect entities such as names, locations, and organizations. Once these entities are identified, they are enclosed with special tokens, specifically ‘<NER_START>’ and ‘<NER_END>’, to explicitly delineate their boundaries. This tagging process ensures that the BERT encoder recognizes and differentiates these entities from the surrounding text. The tagged sentences are then fed into the BERT encoder, which processes them to generate enriched contextual embeddings that emphasize the identified entities. By doing so, the encoder is better equipped to maintain the semantic integrity of these entities during translation, ensuring that they are accurately and consistently translated across language pairs. This method improves the translation quality of entity-rich sentences and enhances the generated translations’ overall semantic coherence. During the training phase, the model learns to associate the special NER tokens with their corresponding entity types, further refining its ability to handle diverse and complex entity structures within the translation tasks.
3.6. In-Trust Loss Function
We utilized the incomplete-trust (in-trust) loss function to address noisy data within the parallel corpus and modify the traditional cross-entropy loss. Noise in the training data can adversely affect the model’s performance, leading to overfitting on erroneous examples and degrading translation quality.
Figure 2 depicts selected noisy samples from our dataset and their English descriptions for a better understanding. Noise is present in the training corpus’s source and target segments. Noise control aims to improve model robustness by minimizing the impact of noisy data. Regularization techniques are employed to filter out noisy data samples. For a dataset
with potential noisy pairs
, we define
as a subset of
that is free of significant noise, as shown in Equation (
6).
Once the noisy samples are over-fitted, the translation performance of the final model will be impacted. Inspired by previous work [
46], it is suggested that noisy datasets may also offer valuable insights. Incomplete-trust (in-trust) loss is proposed as a substitute for the original cross-entropy loss function to decrease the disparity between synthetic and real examples affected by noise. The loss function quantifies the discrepancy between the predicted translation
and the actual target translation
t. This new loss function is expressed as follows:
where the equation elements are defined as follows: is the incomplete-trust loss.
are the source and target translation pairs in the training dataset .
is the predicted probability of the target translation t given the source sentence c and model parameters .
is the trust factor for the target translation t, which adjusts the weight of each term in the loss function to account for the noise in the data.
This formulation ensures that the model trusts cleaner data while learning from noisy examples, thereby reducing the impact of overfitting on noisy samples.
Estimating the Trust Factor ()
The trust factor plays a pivotal role in balancing the influence of clean and noisy data within the in-trust loss function. To estimate , we employ a dynamic approach based on the confidence level of each training sample. The implementation involves several key steps:
Each training sample is initially assigned a baseline trust factor , typically set to 0.5, indicating an equal weighting between trusting and distrusting the data. After each training epoch, the model evaluates the loss for each sample on a validation set. Samples that consistently yield lower loss values—indicating higher confidence—have their increased (e.g., to 0.7), amplifying their influence on the loss function. Conversely, samples with higher loss values undergo a decrease in (e.g., down to 0.3), reducing their impact to mitigate the effect of potential noise.
To formalize the adjustment of
, we introduce the following update rule:
where the equation elements are defined as follows:
is the loss for sample .
is a predefined loss threshold.
is the increment/decrement value (e.g., 0.2).
This update rule ensures that remains within the [0,1] range, maintaining stability during training. Integrating the in-trust loss function into our training process enhances the model’s robustness against noisy data, leading to improved translation performance.
The implementation of the in-trust loss function involves the following steps:
Initialization: Assign an initial trust factor to all training samples in .
Training Iteration: For each epoch, compute the in-trust loss
using Equation (
7).
Model Update: Perform backpropagation and update the model parameters accordingly.
Trust Factor Adjustment: After each epoch, evaluate the loss
for each sample on a validation set and adjust
using Equation (
8).
Normalization: Ensure that remains within the valid range [0,1].
This adaptive mechanism allows the model to prioritize learning from cleaner data while extracting useful information from noisy samples.
3.7. Converting Chinese Text to Pinyin
The conversion of Chinese characters into Pinyin turns them into the Latin alphabet, aiding NMT models that cannot handle Chinese scripts directly. Using Pinyin as an intermediary language offers a novel solution to avoid using English as a middle language. This improves translation efficiency by aligning Chinese phonetics directly with Urdu, reduces ambiguities, and enhances model performance. Research by [
47] supports the effectiveness of phonetic representations in translation models. To address homophones in Pinyin, we use deep contextual analysis, enhanced tokenization, and post-processing corrections for accurate translations.
To demonstrate how Pinyin serves as an intermediary in the translation process, we present a snapshot of our dataset in
Figure 3, which shows the alignment of Chinese, Pinyin, and Urdu.
Algorithm 1 utilizes the
pinyin library to translate each Chinese character into its corresponding Pinyin representation. The text is then split into words based on spaces. It was initially designed to potentially invert these words. The current implementation maintains the original order. To apply this conversion across the dataset containing Chinese text, the
convert _to_pinyin function is mapped to each sentence in the dataset. This conversion facilitates the alignment of Chinese phonetics directly with Urdu, enhancing translation accuracy and efficiency.
Algorithm 1 Convert Chinese Characters to Pinyin and Maintain Word Order |
- 1:
procedure convert_to_pinyin() - 2:
▹ Convert the Chinese characters in the sentence to Pinyin, keeping spaces between syllables. - 3:
▹ Split the Pinyin result into a list of words (syllables). - 4:
▹ Rejoin the words (syllables) to form the final Pinyin sentence. - 5:
return ▹ Return the final Pinyin sentence, ready for further processing. - 6:
end procedure
|
3.8. Tokenization and OOV Management
The tokenizer transforms a sentence s into a sequence of tokens , represented as . The SentencePiece approach for tokenization is applied to effectively manage languages with large vocabularies and different scripts without requiring pre-segmented text. The SentencePiece model, denoted as , segments a given input sequence s into a sequence of subword units: , where are the subword tokens derived from the input string s. This approach is effective for managing out-of-vocabulary (OOV) words, as it breaks down rare words into known subwords or morphemes.
The encoding function
E converts the sequence of tokens into a sequence of integer IDs, defined as
for each token
, where
is the integer ID corresponding to the token
. Once the tokens are converted into integer IDs, they are transformed into a tensor format suitable for neural network processing as the function
T:
, where
X is the tensor that will be input into the NMT model. This tensor encapsulates the sequence of integer IDs in a format that the neural network can efficiently process as follows:
The tokenization
that maximizes the log probability of the segmented sentence is chosen, as represented in the following equation:
Each subword
is mapped to an integer ID
, which the proposed model will use. The tokenized sequences are then fed into the proposed NMT model, where an encoder encodes them and subsequently decodes them into the target language. This process forms the backbone of understanding and generating translations, as represented in the following equation:
3.9. Bilingual Curriculum Learning
The bilingual curriculum learning strategy is employed to improve the efficiency and performance of our NMT model. Curriculum learning organizes training data from simple to complex, enhancing learning dynamics. Our setup’s curriculum is based on sentence complexity and translation difficulty between Chinese and Urdu. The process involves evaluating sentence pairs by length, complexity, and rare words. We sort the dataset into tiers, starting with simple sentences and progressively adding more complex ones. Training is performed in phases: Phase 1 focuses on easy sentences, Phase 2 introduces moderate complexity, and Phase 3 fine-tunes the model on complex sentences. This approach improves convergence, enhances generalization, and reduces overfitting.
3.10. Proposed Model
A transformer-based M2M100 model [
16] is proposed, and is a pre-trained multilingual encoder–decoder (seq-to-seq) with an attention mechanism. The architecture of this model is depicted in
Figure 4. It is well suited for generating diverse translation candidates due to its background in handling a variety of linguistic contexts. Initially, multiple translation options are produced using the M2M100 variants with enhancements in contextual analysis and multi-head attention.
The encoder processes the input in the source language (Chinese or Urdu) by converting each token into an embedding vector,
, and adding positional encodings,
, to incorporate the positional information of the tokens in the sequence:
3.10.1. BERT Embedding Integration
The encoder layers are initialized with BERT’s pre-trained weights to enhance the encoder’s ability to understand better and represent input text. After initialization, we fine-tune the BERT-augmented encoder layers on our datasets to adjust the pre-trained weights to the specifics of translation tasks. The dimensions of BERT embeddings (
) differ from the expected dimensions of the M2M model’s encoder embedding (
). A linear transformation is applied to match the dimensions:
where
and
are the weights and biases of the transformation layer, respectively. This ensures that the transformed BERT embeddings (
) are compatible with the M2M100 encoder.
3.10.2. Attention Mechanism
Some key modifications are introduced to the attention mechanisms, such as synchronous bidirectional attention and dynamic weight attention. Synchronous bidirectional attention enables the model to attend to both past and future tokens simultaneously within each layer, enhancing its understanding of context and improving its ability to capture long-range dependencies. This modification is achieved by adjusting the attention masks, which were originally unidirectional, to allow bidirectional context. Specifically, we modify the attention mask so the model can attend to all tokens in the sequence, regardless of their position. This bidirectional attention mask ensures that each token can attend to preceding and upcoming tokens in the sequence, enabling a more comprehensive understanding of the input. The attention matrix is modified to = 1, or all token pairs in the sequence.
This modification allows the model to leverage the full context of the input sequence, thereby improving translation accuracy, especially for sentences where relationships between distant tokens are critical. This leads to better handling of syntactic and semantic structures, resulting in more accurate translations.
The second modification, the refined attention mechanism, dynamically adjusts attention weights based on the importance of tokens. Token importance is identified through NER and semantic analysis. NER is applied at the token level to recognize key entities, while semantic analysis assesses the role and relevance of each token in the sentence. The importance of a token t is computed as .
Where is a score indicating whether token t is a named entity and measures its contextual relevance. Tokens with higher importance scores are given higher priority in the attention mechanism. The adjusted attention matrix is computed by multiplying the standard attention matrix A with an importance modifier : .
This dynamic adjustment prioritizes critical tokens—such as named entities and key concepts—enhancing the model’s ability to generate semantically accurate translations.
Additionally, attention weights are recalculated during each attention operation, making them context-dependent and allowing the model to adapt to the significance of different tokens as the input sequence is processed. The attention mechanism with dynamic weighting is expressed as
We employ a fallback mechanism to address named entities not present in the training data. In the case of an unseen named entity, we replace it with a predefined entity from a bilingual dictionary or an entity mapping list. If no direct match is found, the model relies on contextual clues from the surrounding tokens to infer the meaning of the entity. This ensures that the translation remains accurate and fluid, even in the presence of novel or rare entities. The fallback mechanism ensures the model can handle a wide range of named entities, improving its robustness, particularly for languages with limited resources.
3.10.3. Pre-Processing and Post-Processing
Applying layer normalization after the multi-head attention and feed-forward layers stabilizes the learning process and significantly enhances model performance. Dropout layers prevent overfitting by randomly setting some activations to zero during training. Implementing residual connections around the multi-head attention and feed-forward layers helps facilitate gradient flow and enhance the model’s ability to capture complex dependencies. This ensures that the output dimensions of the attention and feed-forward layers match the input dimensions. Equation (
13) describes the process:
The decoder generates the target language output (Urdu/Chinese) one token at a time. The masked self-attention layer prevents attending to future tokens in the output sequence using a masking mechanism as follows:
where
is the mask matrix that prevents the model from looking at future tokens. The encoder–decoder attention layer handles the queries from the previous decoder layer, with keys and values from the encoder output, allowing each position in the decoder to attend to all positions in the input sequence as:
The Feed-Forward Neural Network transforms the representation after attention integration, with layer normalization and dropout applied to ensure the dimensions match for subsequent layers:
The proposed model is fine-tuned on our datasets to adapt it to our translation tasks. This process helps the model grasp the source language’s full semantic scope. The proposed model can produce translations that preserve longer texts’ coherence and intended meaning. Algorithm 2 and
Table 3 provide an abstraction of the proposed model. The proposed model generates translations with two variants of M2M100, 418 M and 12 B, while the 1.2 B variant is used for re-ranking.
3.11. Re-Ranking Candidates
Re-ranking in neural machine translation (NMT) involves generating multiple candidate translations for a given input sentence and subsequently re-evaluating these candidates to select the best one. This process helps enhance translation accuracy and diversity by refining the initial outputs. The proposed model, as described in
Section 3.10, is used to generate a set of candidate translations
, where each
represents a candidate translation. For each candidate, a set of features is extracted to inform the re-ranking decision. These features include the initial translation score, length, word alignments, translation errors, and coherence.
The initial translation score is directly obtained from the primary NMT model’s output probabilities. The translation length is simply the number of tokens in the candidate translation, which provides a measure of completeness and conciseness. Word alignments are evaluated using tools like FastAlign to assess the mapping quality between source and target words. Translation errors, such as grammatical mistakes, mistranslations, or missing entities, are identified using error detection models or rule-based checks. Translation coherence is evaluated by measuring the semantic consistency of the candidate translation with reference sentences or ensuring its overall logical flow.
Algorithm 2 Translation Model with In-Trust Loss and Tokenization |
- 1:
Input: Dataset , Clean Subset , Epochs , Learning Rate - 2:
Output: Trained Model - 3:
Data Preprocessing: - 4:
for each do - 5:
if is clean then - 6:
Add to - 7:
end if - 8:
end for - 9:
Initialize M2M100Tokenizer, M2M100 model, BERT model, and SentencePiece model - 10:
function InTrustLoss() - 11:
return - 12:
end function - 13:
Initialize Adam optimizer with learning rate - 14:
for to do - 15:
for each do - 16:
optimizer.zero_grad() - 17:
NER and OOV Management: - 18:
Perform NER on to recognize named entities - 19:
Handle OOV words in using subword units - 20:
- 21:
- 22:
- 23:
- 24:
loss ← InTrustLoss - 25:
- 26:
optimizer.step() - 27:
Print “Epoch”, , “Loss:”, - 28:
end for - 29:
end for - 30:
Re-ranking Process: - 31:
for each input sentence c do - 32:
Generate candidate translations - 33:
for each candidate in C do - 34:
Extract features for (e.g., initial score, length, alignment, coherence) - 35:
Assign re-ranked score using M2M100-1.2B model - 36:
end for - 37:
Select best translation - 38:
end for - 39:
return Trained Model
|
These features are then fed into a re-ranking neural network, which processes the information and produces a re-ranked score for each candidate . The candidate with the highest re-ranked score is selected as the final translation output, ensuring the best possible translation is chosen.
Contrastive Re-Ranking Method
A contrastive re-ranking method is employed with the M2M100-1.2B model variant, as illustrated in
Figure 5. It has a strong capability to evaluate and rank translations across different languages. Training in a broad range of language pairs provides a fine understanding of language that can be beneficial for re-ranking based on subtleties in translation quality. Positive samples come from high-quality translations in the bilingual corpus, and negative samples are drawn from Diverse Beam Search [
48] outputs. The re-ranker evaluates each candidate based on the extracted features and assigns a new score. The re-ranker is implemented as a neural network combining features to predict the final translation quality. The candidate with the highest score is selected as the best translation. Let
be the re-ranked score of candidate
. The best translation
is given by
In this setup,
represents the hidden features of the input text, and
are the features for the target samples, with
and
denoting positive and negative sample features, respectively. A non-linear projection layer is applied on top of M2M100 to refine these features. The calculation of two types of features with temperature is as follows:
This approach ensures that the selected translation is diverse and high-quality.