4.4.1. Comparative Experiments
To further verify the effectiveness of the proposed ILFDN-Transformer model in the Mongolian–Chinese neural machine translation task, we conduct comparative experiments against four representative baseline models: LGP-NMT [
32], Transparent [
33], LaSyn [
34], and GNMT [
35]. All models are evaluated on the same dataset under consistent experimental settings. The BLEU score is adopted as the main evaluation metric to assess translation quality. The detailed performance results are presented in
Table 4.
From the experimental results shown in
Table 4, the proposed ILFDN-Transformer model achieves the highest BLEU score, significantly outperforming the other models. This indicates that ILFDN-Transformer is capable of accurately capturing long-distance dependencies and implicit structural information in the source language, thereby improving both the accuracy and fluency of the translation.
The relatively low BLEU score of the Transparent model may be attributed to its limited ability to integrate source language information and effectively capture contextual dependencies. In contrast, the GNMT model performs better than the Transparent model, suggesting its enhanced capacity for contextual modeling and local feature extraction. The BLEU scores of the LaSyn and LGP-NMT models are relatively close, and both demonstrate comparable improvements over the baseline. However, their performance still falls short of ILFDN-Transformer, possibly due to instability in their feature learning processes, which negatively affects the final translation quality.
Notably, compared to the baseline Transformer model, ILFDN-Transformer achieves a BLEU improvement of 3.53 points, clearly validating the effectiveness of the proposed enhancements: the BART-based implicit semantic feature augmentation, the mBERT-guided deliberation decoding network, and the weighted hybrid positional encoding mechanism. These innovations significantly strengthen the model’s ability to capture deep semantic information and long-range dependencies, refine semantic reasoning during decoding, and ultimately enhance translation accuracy and fluency.
4.4.2. Ablation Experiments
To assess the contribution of each module to the overall performance of the model, we conducted a series of ablation studies. First, we compared the performance of mBERT with other pre-trained models in capturing implicit linguistic features. Second, we examined the impact of different knowledge distillation strategies on the model’s generalization ability and translation quality. Finally, we progressively added the key components of our model—including the BART-based implicit semantic enhancement mechanism, the deliberation decoding structure, and the weighted hybrid positional encoding—to evaluate their individual and combined contributions. These experiments provide theoretical support for further model optimization.
(1) Comparison between mBERT and Other Pre-trained Models
To assess the effectiveness of different pre-trained models in guiding the deliberation decoding process, we conducted comparative experiments using four models: ELMo [
36], BERT [
37], GPT-2 [
38], and mBERT.
Table 5 presents their respective impacts on translation performance within this framework.
The results indicate that mBERT achieves a significantly higher BLEU score compared to the other models. This demonstrates that its multilingual pre-training strategy enables it to capture richer syntactic and semantic representations, thereby providing stronger guidance for the second-stage decoder and improving overall translation accuracy. ELMo performs relatively poorly in this task, suggesting that its BiLSTM-based architecture lacks the capacity to supply sufficient linguistic knowledge for effective decoding guidance. Although BERT leverages a bidirectional Transformer structure and can model contextual relationships more comprehensively via bidirectional attention, its training on monolingual data limits its capacity to support the deliberation decoder effectively. GPT-2, despite its strong performance in monolingual language modeling, adopts a unidirectional architecture that restricts its ability to capture bidirectional semantic dependencies. As a result, its guidance effect in the deliberation process is inferior to that of mBERT. By comparing the performance of various pre-trained models in the nudging module, we observe that mBERT demonstrates greater advantages than other models in terms of semantic alignment and translation quality. This indirectly confirms the effectiveness and compatibility of the heterogeneous model combination we adopt—monolingual BART and multilingual mBERT. In summary, considering its structural strengths in multilingual adaptability and bidirectional encoding, along with its validated performance in this task, we select mBERT as the knowledge source for the nudging module.
(2) Impact of Different Distillation Strategies on the Model
To evaluate the impact of different knowledge distillation strategies within the BART-based implicit linguistic feature enhancement mechanism, we designed a series of comparative experiments. In addition to the feature-based distillation (FD) strategy adopted in this paper, we also explored target-based distillation (LD), gradient-aligned distillation (GKD), and task-aware layer-wise distillation (TED).
Table 6 demonstrates the effects of different knowledge distillation strategies on the model’s translation performance.
The results demonstrate that feature-based distillation (FD) yields the best performance in terms of the BLEU score. This is due to the fact that FD acts directly on the hidden layer, which enables the student model to better inherit the feature representation capability of the teacher model, thus acquiring rich semantic information. In contrast, target-based distillation (LD) performs the worst, likely because it only provides soft target supervision at the output layer, without leveraging the deeper semantic representations from the teacher model. Gradient-aligned distillation (GKD) improves the student model’s performance by aligning the gradient directions during training. Although GKD achieves better results than LD, its performance remains inferior to TED and FD due to its indirect transfer of semantic knowledge. Task-aware layer-wise distillation (TED) enhances feature learning by aligning intermediate representations at each layer. However, its effectiveness is hindered by potential noise introduced during layer-wise alignment, which may reduce stability and limit further performance gains compared to FD. To more intuitively illustrate the impact of different distillation strategies, we plot the accuracy and loss curves for the four methods, as shown in
Figure 5 and
Figure 6, respectively.
(3) Impact of Different Modules on the Model
To evaluate the contributions of the BART-based implicit linguistic feature enhancement mechanism, the mBERT-integrated deliberation decoding mechanism, and the hybrid positional encoding mechanism to overall model performance, we conducted a series of ablation studies. By progressively introducing each of these core components, we assessed their respective impacts on translation accuracy and semantic understanding. The changes in BLEU scores at each stage are presented in
Table 7.
Starting from the baseline Transformer model with a BLEU score of 41.88, incorporating the weighted hybrid positional encoding (MPE) led to an increase to 42.91, which is an improvement of 1.03 points. This demonstrates that the MPE mechanism enhances the model’s ability to encode and utilize positional information more effectively. Building upon this, the addition of the mBERT-guided nudge decoding module further boosted the BLEU score to 44.29, yielding a 2.41-point gain over the previous stage. This result confirms that the nudging decoder significantly improves the model’s capacity to capture long-range dependencies and semantic nuances in the source language. Finally, integrating the BART-based implicit linguistic feature enhancement mechanism resulted in a BLEU score of 45.41, representing a total improvement of 3.53 points over the baseline Transformer. This highlights the effectiveness of incorporating implicit linguistic priors into the model and validates their role in enhancing translation quality. These results collectively show that each proposed module contributes meaningfully to performance gains. Moreover, the synergistic integration of multiple modules significantly optimizes the overall translation performance.
Considering that automatic evaluation metrics alone may not fully capture model performance in terms of semantic expression and linguistic naturalness, this paper further incorporates manual subjective evaluation based on the four models mentioned above, focusing on two key dimensions: semantic adequacy (Adequacy) and linguistic fluency (Fluency). We randomly selected 200 sentence pairs from the test set and invited three linguistics graduate students proficient in both Mongolian and Chinese to independently score each translation using a 5-point scale. The averaged scores are summarized in
Table 8.
From the results, the Transformer model exhibits more semantic omissions and incoherent expressions, resulting in relatively lower scores. After incorporating the MPE module, semantic coverage improves; however, issues with word order and syntax remain. With the introduction of the nudging mechanism, the model shows significant enhancements in expression hierarchy and semantic completeness. The ILFDN-Transformer model achieves the highest scores in both semantic fidelity and linguistic naturalness, demonstrating superior language generation capabilities and robustness. These manual evaluation results further validate the effectiveness of each module in improving translation quality, particularly highlighting that the synergistic integration of all three modules can substantially optimize the model’s overall performance.
To more intuitively illustrate the practical impact of the ILFDN-Transformer model in real-world translation scenarios, we randomly selected three sample translations from the test set for qualitative analysis. By comparing the outputs of different ablation configurations on the same source sentences, we clearly demonstrate the positive effects of implicit linguistic features, the nudging decoder, and hybrid positional encoding on translation quality. The examples are shown in
Figure 7.
Furthermore, we conduct a more fine-grained analysis of the translation examples in
Figure 8, systematically illustrating the specific contributions of each module in addressing key translation challenges. The hybrid positional encoding significantly improves the word order and structural representation; the nudging mechanism demonstrates strong performance in capturing semantic nuances and restoring information; and the implicit linguistic feature encoding further enhances overall semantic consistency and contextual coherence. The synergistic effect of all modules leads to notable improvements in translation accuracy, clarity of expression, and language fluency, thereby validating the effectiveness of the proposed approach.