1. Introduction
In recent years, machine translation has emerged as a prominent research domain within the field of Natural Language Processing (NLP). Thanks to advances in computer science, technology, and hardware resources, machine translation has made great strides. Deep learning based on neural machine translation (NMT) methods has superseded statistical machine translation methods, emerging as the prevailing approach in the field. As NMT evolves, the fundamental encoder–decoder framework has remained unchanged, while the backbone network has progressed from a recurrent neural network (RNN) to a Convolutional Neural Network (CNN), and ultimately to the Transformer model.
In the early stages, RNNs were widely used as the backbone network. Lei et al. demonstrated that using independent recurrent neural networks can greatly enhance the performance of machine translation models [
1,
2]. Sutskever et al of Google team proposed RNN-RNN model based on the former [
3], which became the later common Sequence to Sequence model. Baliyan et al. combined Long Short-Term Memory (LSTM) with RNNs to complement the sentiment analysis capability with language translation [
4]. In this model, knowledge-based context vectors were integrated to facilitate the mapping of multilingual vocabulary, while an RNN was employed to ensure good results. Wang et al. designed a bidirectional Gated Recurrent Unit (GRU) model for English translation analysis based on the RNN model, which made full use of word vectors in the construction of language sequences [
5]. Later recurrent neural networks were gradually replaced by other backbone networks. Kalchbrenner and Blunsom proposed to apply CNNs in a recurrent continuous translation model, which improves the learning ability of the mapping between continuous representations of implicit links between phrases and sentences during the translation process [
6]. Wang et al. proposed a generative adversarial network-based neural machine translation model that utilizes adversarial thinking to consider the order of emotional directions to make the translation results more humanized [
7].
The field of machine translation witnessed a groundbreaking shift with the advent of the Transformer model, ushering in a new era by integrating self-attention, thereby enhancing the efficacy and precision of the translation process. Hu et al. [
8] introduced a Transformer-based NMT model designed specifically for important information fusion. Notably, this model demonstrated exceptional performance in effectively translating long sentences. To address the fact that traditional attention mechanisms could not fully utilize the hidden information of target words, Li et al. proposed a novel enhanced attention mechanism that incorporates hidden details from target words into both RNN-based and self-attention-based translation models. Theoretical analysis demonstrated the facilitative role of the hidden details in improving translation prediction [
9]. To improve word representations and translation performance, Wang et al. incorporated part-of-speech sequence information [
10]. While this approach improved translation performance, it remains unclear whether the artificially added information enhances the model’s encoding ability or merely serves as noise to bolster robustness, thus lacking interpretability.
Transformer models are still the most widely used in the selection of backbone networks for NMT systems. We list some key reasons analyzing the prevalence of the Transformer architecture in contemporary NMT frameworks:
Efficiency and Scalability: Transformer models are characterized by a streamlined architecture that necessitates fewer parameters, thereby reducing the computational power required.
Enhanced Computational Throughput: The Transformer architecture’s ability to process inputs in parallel marks a significant departure from the sequential processing inherent to recurrent neural network (RNN)-based NMT systems. This parallelization facilitates a remarkable improvement in computational efficiency, allowing for the utilization of larger datasets within the same computing constraints.
Superior Language Representation: The Transformer model is able to overcome the limitations associated with encoding long-distance dependencies—a notable challenge for RNNs. Through the Self-Attention (SA) mechanism, Transformers can flexibly and efficiently capture relationships between any elements in the input sequence, regardless of their positional distance.
However, the Transformer model has its limitations. Although multi-head semantic analysis can capture the inter-word relationship well, it ignores the inter-word structural information to some extent. Therefore, in recent years, a series of Large Language Models (LLMs) based on the Transformer model architecture, such as OpenAI’s GPT and Google’s BERT, have been applied in the field of machine translation more and more widely [
11]. In order to improve the ability of the model to use context learning, Li et al. proposed a demonstration-aware Large Translation Model (LTM) based on mixed presentation types [
12]. It determines the presentation type of the training sample by randomly selecting sentence pairs in the training set as sentence-level presentations or continuous context text as document-level presentations. Zhu et al. proposes a robust approach that enables LLMs to achieve robust translation with In-Contextual Learning (ICL) [
13]. This method adopts the multi-view method and considers both sentence-level and word-level information to capture the relationship between words and sentences effectively, so as to select the presentation that effectively avoids noise.
Besides various backbone models being used in NMT models, there are different input language granularity and text characterization methods explored in NMT.
Table 1 summarizes some word embedding methods of NMT systems.
As shown in
Table 1, NMT systems take different combinations of language granularity and text characterization methods. However, all text characterizations use static word vectors. Among them, the Random method used by Vaswani et al. [
16] initializes an
word embedding matrix randomly and synchronizes with the model for the training. After training, each row of the word embedding matrix correlates to a fixed word. Vector representations are, therefore, essentially static word vectors. Static word vectors are generally based on corpus pre-training to obtain distributed representations of words. The notable characteristic of this approach is its capability to leverage pre-trained word vectors across diverse downstream NLP tasks, eliminating the need for additional training and thereby enhancing efficiency. Nevertheless, real language environments encompass polysemy, where a single word may encompass multiple semantic nuances and grammatical interpretations. Static word vectors cannot effectively represent such differences, resulting in the deviation of semantic grammar.
Moreover, in terms of language granularity, whether it is character-based, word-based, or subword-based, most of the representative NMT systems have a single-granularity input. This can limit the encoder’s representation capacity and the model’s performance, as the ability to accurately interpret the input sentence directly affects the NMT system’s overall effectiveness. In particular, East Asian languages like Chinese do not have spaces as a natural division between words, which is different from other alphabet-based languages. Therefore, when dealing with NLP tasks based on the Chinese corpus, the Chinese corpus must be segmented first. Using different word segmentation tools will result in different granularity of word segmentation results, resulting in performance differences. Morishita et al. [
17] used a hierarchical network to fuse multiple subword granularities as input. Su et al. [
18] employed a word lattice to combine different levels of word granularity for improved Chinese–English translation.
To address the challenges of word representations of Transformers in Chinese, we introduce the Dynamic Multi-Granularity Translation System (DMGTS), a novel Transformer model modification that incorporates multi-granularity position encoding and self-attention mechanisms. Specifically, four different word segmentation methods are applied to the inputs to obtain four levels of granularities. A Directed Acyclic Graph (DAG) is utilized to transform multi-granularity inputs into position encodings, and ELMo [
19] is employed to generate dynamic word embeddings. These dynamic embeddings are fed into a modified Transformer where multi-granularity self-attention mechanisms replace conventional self-attention layers. Our extensive evaluation of the DMGTS on benchmark datasets reveals substantial improvements over methods with single-granularity input and static word embedding.
2. Dynamic Multi-Granularity Translation System
In this section, we introduce the DMGTS, which modifies the word embedding of the traditional Transformer. Multi-granularity position encoding and multi-granularity self-attention mechanisms will be introduced. The architecture of the DMGTS is shown in
Figure 1. The DMGTS comprises four main components: pre-trained multi-granularity word segmentation, dynamic word embedding, multi-granularity relative position encoding, and encoder–decoder where multi-granularity self-attention mechanism is introduced.
Specifically, the DMGTS processes the text inputs as follows:
The text input sequence } is segmented into multiple granularities, including character-level granularity and three other granularities. The details of multi-granularity word segmentation are given later. These granularities are then modeled using a Directed Acyclic Graph (DAG) fusion approach.
We use the DAG to transform the multi-granularity representation of the input into the granularity sequence and the corresponding position representation and .
Input is passed through the ELMo pre-training model to obtain the dynamic vector .
We convert the position representation sequence
and
into a multi-granularity relative position representation
, and further convert it into a relative position encoding
by the trigonometric function [
20,
21,
22].
The dynamic text feature vector is fed into the model encoder. We integrate the positional encoding accessed in step 4 into the calculation of the attention weights, by the multi-granularity self-attention layer. Specifically, the model considers the current position and the relative distance from the attention position when computing the attention. We then obtain a new text feature vector and feed it to the position feedforward layer. Subsequently, we pass the results through the subsequent encoder layers from bottom to top.
The encoder–decoder attention layer uses the output of the last encoder layer as its input. Following word embedding and positional encoding, the decoder receives the translated outcome and passes it through the self-attention layer and the position feedforward layer across its six layers.
The text feature vector generated by the decoder is fed into the output layer to produce the output.
2.1. Pre-Trained Multi-Granularity Word Segmentation
Similar to the work of Su et al. [
18], we use the open source toolbox of Stanford University, and train three word segmentation tools on datasets with various segmentation standards of MSR, PKU, and CTB, respectively, and use these three different word segmentation tools to divide the machine translation corpus to obtain different word granularities.
At the beginning of the training process, the data are partitioned into three distinct sets: the training set, the validation set, and the test set, with allocation proportions of 70%, 10%, and 20%, respectively. Cross-validation is employed using a five-fold strategy during training. To assess the word segmentation performance, the F
1 value (F
1), recall rate (Recall), and precision rate (Precision) are computed using Equation (1) upon completion of the training process.
Table 2 and
Figure 2 display the performance metrics of the three word segmentation tools obtained through training.
2.2. Dynamic Word Embedding
In the original Transformer model, the word embedding module uses a random initialization method, which randomly initializes a vocabulary of size , where represents vocabulary size and is the model dimension, and the trainable parameter is True. After the training is completed, each word in the vocabulary corresponds to a fixed vector representation, which is a static word vector. Instead, we explored building a Transformer model using different word embedding modules to evaluate the influence of dynamic and static word embeddings.
Word2vec and GloVe are chosen as the representative static embedding methods. To facilitate the Transformer model’s input, a fully connected neural network layer is cascaded to transform the vector into a 512-dimension one. The word2vec and GloVe models are pre-trained in advance using the WMT Chinese and English datasets and the NIST Chinese and English datasets. The ELMo model is selected as the dynamic word vector method. Due to the high pre-training cost of the ELMo model, we select the open source ELMo model pre-trained on the large-scale Chinese corpus: Chinese Gigaword.
Table 3 provides a comprehensive summary of the statistical information for the three word embedding modules used in our experiments, highlighting their distinct characteristics and pre-training datasets. By comparing these methods, we aim to gain deeper insights into how static and dynamic embeddings influence the overall performance of the Transformer model in machine translation tasks.
Whether it is a static or dynamic word–vector word embedding module, model finetuning will be performed on our specific datasets, with a small learning rate. In our experiments, the learning rate is best set to 5 × 10−5.
2.3. Multi-Granularity Position Encoding
2.3.1. Relative Position Encoding
In the development of the Transformer model, there are two major methods for the position encoding: absolute position coding and relative position.
Since the Transformer model processes inputs in parallel, it cannot obtain the position information of the input sequence by using the input sequence like RNNs. Therefore, the original Transformer model [
16] used a trigonometric (Sinusoidal) position encoding, which is essentially an absolute position encoding that encodes the position representation as a vector using the sin and cos function. The absolute position encoding obtained by using trigonometric functions can reflect the relative distance between word inputs to a certain extent but cannot represent the direction. Therefore, the existing studies using Transformer rarely use a single absolute position encoding [
21,
22,
23].
- 2.
Relative Position Encoding
Figure 3a illustrates that the original SA mechanism cannot represent the temporal relationship between words. Therefore, some researchers [
24] proposed the Relative Position Representation (RPR). The absolute position representation is reflected in the sum of the position code and the input word embedding, and the relative position representation involves adding a set of trainable embeddings to the SA mechanism, so that the current position is compared with the received word in the calculation [
25,
26].
After adding the relative position representation as shown by the numbers in the figure, two instances of “I” in different positions will have different output encodings.
Figure 3b shows the output encoding process of the first “I” and
Figure 3c shows the output encoding process of the second “I”. When the Transformer calculates the attention between “I” and “therefore”, for the first “I”, because “therefore” is the second word to the right of the first “I”, the model will use the information in the first; and for the second “I”, since “therefore” is the first word on the left relative to the second “I”, the model will use the first word.
2.3.2. Multi-Granularity Relative Position
The original Transformer model used a single position representation sequence for absolute position encoding. Similarly, the relative position representation was based on aligning the input sequence with its corresponding positional sequence. However, when processing multi-granularity feature input, two positional representation sequences are required, making it difficult to use the existing relative position representation.
To address this issue, based on relative position encoding, we propose a new Multi-granularity Position Encoding (MGPE) approach, which combines the existing relative position representation with multi-granularity feature input. This method allows for more precise encoding of relative positions and enhances overall performance. Inspired by graph-based relative position encoding [
27], we utilize a DAG to handle the multi-granularity features.
Figure 4 shows the position representation of the multi-granularity feature using a DAG, where the two positions are the head and the tail node positions.
Specifically, we propose to use the position of the head node (head) and the position of the tail node (tail) to calculate the relative position of words from four different granularities. These four relative distance matrices are then concatenated to form a comprehensive multi-granularity representation. To further refine the representation, we apply a nonlinear transformation to the concatenated matrix, producing the final Multi-Granularity Position Encoding (MGPE). This approach enables the model to effectively capture positional relationships across different granularities, enhancing the precision of position encoding in multi-granularity NMT systems. The detailed formula for this method is provided in the following section.
For any word
, we have
and
to present its position. The relative position encoding vector
between
and
can be calculated under the multi-granularity feature. First, we calculate the relative distance between
and
from four perspectives, as shown in Equations (2a)–(2d). Four relative distance matrices
will be obtained, as shown in
Figure 5.
To obtain the position encoding vector
between
and
, the four relative distance matrices are first concatenated, followed by nonlinear transformation, as described in Equation (3).
where
is a trainable parameter and
can be calculated by the absolute position coding of Equations (4) and (5).
2.4. Multi-Granularity Self-Attention
In this section, we introduce the multi-granularity self-attention mechanism which replaces the conventional self-attention mechanism. We first illustrate the encoder–decoder structure of our DMGTS. The DMGTS consists of a six-layer encoder, an output layer, and a six-layer decoder. Multi-granularity self-attention and a position feed-forward layer compose the encoder. Similarly, multi-head attention, position feed-forward, and encoder–decoder attention compose the decoder. Finally, a linear transformation and Softmax layer compose the output layer.
Based on Shaw et al.’s work [
28], we integrate the multi-granularity relative position encoding
, discussed in the previous section, into the original model. As a result, the SA mechanism is transformed into a multi-granularity self-attention mechanism (MGSA).
Equation (6) shows the general form of the SA mechanism with absolute position encoding, in which
and
represent the absolute position encoding,
denotes the self-attention weight, and
represents the final output of the self-attention layer.
The expanded result of
is shown in Equation (7).
To introduce relative position information for training synchronously, we remove the expression
in Equation (6), change the second item
to relative position vector
, and change the attention output
to the relative position vector
. By converting to a relative position vector, we can obtain the SA mechanism with relative position encoding using Equations (8) and (9):
and
are two relative position factors that can be trained and learned. These factors are closely related to the multi-granularity relative position representation. Specifically, the connection between the multi-granularity relative position code and representation can be expressed mathematically as shown in Equations (10) and (11):
where
represents the absolute position encoding, generally using the trigonometric function encoding form such as Equations (4) and (5); the function
indicates that the output is within a certain window range. For example, when the window size is 3,
is 0 and
is 6. Therefore,
means to select the value of
and limit it to the range of 0 to 6.
and
are trainable, corresponding to the relative position weight between words in the key vector and the value vector, respectively.
The multi-granularity SA mechanism is based on Equations (8) and (9) for calculating
and
. To derive the corresponding multi-granularity form, Equation (4) is used. The calculation method is then given in Equations (12) and (13).
4. Experiments
Our DMGTS was built using the Fairseq toolbox [
32], and the machine learning models were developed with PyTorch 2.0.1 on the Linux server. The hardware resource was a server equipped with 32 GB memory, RTX-2080ti graphics card and Intel i7-9700k CPU. In this study, the control variable method was employed to conduct comparative experiments and verify the impact of each module in the proposed DMGTS.
In our experimental framework, the Bilingual Evaluation Understudy (BLEU) score serves as the universal metric for assessing the quality of the translations produced by our models. BLEU, a widely accepted measure in the field of machine translation, quantifies the correspondence between a machine’s output and that of a human translator, focusing on the precision of n-gram matches across the two texts. This metric enables a quantitative evaluation of the translation’s fidelity and fluency, providing a standardized way to compare the effectiveness of different model configurations and input granularities.
4.1. Multi-Granularity Feature Experiment
The goal of the multi-granularity feature experiment is to evaluate how varying levels of granularity in the input affect the performance of translation tasks, specifically comparing multi-granularity inputs against single-granularity inputs. To ensure the reliability of the results, it is crucial to maintain consistency across all other variables in the experiment as much as possible.
Therefore, we adopted the traditional Transformer model as the backbone and integrated different single-granularity word segmentation methods with it. Considering the source language of the Chinese corpus, single-granularity inputs are categorized into four distinct types: character-level granularity and three separate word-level granularities. Specifically, , , and denote the Transformer model with character, MSR, PKU, and CTB segmentations, respectively. In this experiment, we employed random initialization for the word embedding to our DMGTS multi-granularity input and denoted this variant as .
Table 6 and
Table 7 show the experimental results of multi-granularity input compared to single-granularity input on WMT2019Zh-En dataset and NIST Chinese–English dataset.
In
Table 6, we observe that the
model, which incorporates multi-granularity input, outperforms all other models on both the validation set (newstest2018) and the test set (newstest2019). The BLEU scores show a clear advantage for the DMGTS model with scores of 25.47 on the validation set and 30.37 on the test set. This is a significant improvement over the single-granularity input models. Similarly,
Table 7 shows the improvement of
over single-granularity models on BLEU scores. Notably, on the MT06 set, the DMGTS achieves a BLEU score of 41.27, surpassing the single-granularity models by at least 1.87 points.
Across both datasets, the model consistently outperforms the baseline Transformer models that use single-granularity inputs. This indicates that the model’s approach to integrating multi-granularity information into the Transformer’s architecture not only addresses the limitations of single granularity but also capitalizes on the rich linguistic information present at multiple levels of language structure.
It is worth noting that among methods with single-granularity inputs, it is evident that character-level input outperforms the word-level alternatives. Additionally, when comparing the translation outputs from models trained on MSR, PKU, and CTB word segmentations, the performance metrics are closely matched.
4.2. Ablation Study on Relative Position Factors
To study the impact of the relative position factors and on the model performance, we tested different combinations of with and . Experiments were conducted on both datasets with our model.
Table 8 presents the impact of relative position factors
and
on translation quality as measured by BLEU scores. As shown in
Table 8, if both
and
are eliminated, the BLEU value on the WMT dataset is only 17.32, and the BLEU value on the NIST dataset is only 22.67. When any relative position factor is added, the BLEU value corresponding to the model is greatly improved. The inclusion of both positional expressions yields the highest BLEU scores of 30.37 for WMT and 41.25 for NIST. However, the marginal gains when both factors are included compared to just one are relatively slight. This suggests that each factor independently possesses a strong capability to encapsulate relative positional information effectively.
4.3. Ablation Study on Word Embeddings
To evaluate the impact of different word embedding methods on the performance of machine translation models, we conducted an ablation study with the DMGTS as the framework and with various word embedding modules on both datasets. The compared methods include , , , and that employ random initialization, word2vec, GloVe and ELMo for their word embedding, respectively. The first three are static word vectors, and the last one uses dynamic word vectors.
Table 9 shows the BLEU scores with different word embedding modules. The implementation of the ELMo dynamic word embedding module yields a notable improvement in translation quality, as evidenced by the increased BLEU scores—31.53 for the WMT dataset and 42.61 for the NIST dataset. These scores are at most 1.16 and 1.55 points higher than those achieved using static word vector methods. This enhancement substantiates the premise that dynamic word embeddings contribute positively to model performance. Among the static embeddings,
marginally outperforms the others, with
and
following closely, indicating that while there is a slight variation in their effectiveness, the differences are minimal.
5. Discussion
In addressing the limitations of current NMT systems, our research focuses on overcoming the challenges associated with static word vectors and the constraints of single-granularity inputs. Static word vectors often fail to capture the nuanced differences in semantic grammar, leading to deviations that can affect the accuracy of translations. Furthermore, the prevalent approach in NMT systems of relying on a single level of language granularity—whether character-based, word-based, or subword-based—restricts the encoder’s ability to represent input sentences efficiently and reliably. To bridge these research gaps, we introduced the DMGTS. This innovative adaptation of the Transformer model integrates multi-granularity position encoding and self-attention mechanisms to accommodate multiple levels of language granularity.
Through experiments on two Chinese–English translation datasets, we demonstrate the importance of multi-granularity position encoding and dynamic word embedding on improving the translation quality of the DMGTS. Specifically, with multi-granularity position encoding, achieved BLEU scores of 25.47 on the validation set (newstest2018) and 30.37 on the test set (newstest2019), indicating a significant enhancement in translation accuracy and fluency. On the MT06 set, it attained a BLEU score of 41.27, outstripping single-granularity models by a margin of at least 1.87 points. Additionally, the implementation of the ELMo dynamic word embedding module within the DMGTS further amplifies its translation quality. This is evidenced by the improved BLEU scores, which saw an increase of 1.53 for the WMT dataset and 42.61 for the NIST dataset, surpassing the results of static word vector methods by averages of 1.16 and 1.35 points, respectively.
While our experiments validate the effectiveness of the proposed DMGTS in the field of neural machine translation, particularly for Chinese–English translation tasks, there are some areas for future work. One valuable direction is to extend the application of DMGTS to other translation tasks. The effectiveness of multi-granularity position encoding and self-attention mechanisms has been demonstrated within this language pair, but its applicability and performance in other translation tasks remain untested. Language pairs with different syntactic, grammatical, and semantic structures may present unique challenges that the current model configuration might not address effectively. Expanding the testing to include diverse language pairs, such as those involving non-Indo-European languages with varying levels of morphological complexity, would provide a more comprehensive understanding of the model’s versatility and areas for improvement.
Another direction worth exploring is to study multi-granularity position encoding in applications beyond translation. While our DMGTS showcases the potential of integrating multi-level linguistic information in translation tasks, the broader applicability of this concept across other NLP tasks has not been investigated. Tasks such as text summarization, question answering, and natural language inference could potentially benefit from the nuanced understanding and representation of language that multi-granularity approaches offer.
Furthermore, the current DMGTS implementation may encounter limitations related to computational efficiency and resource demands. The integration of multi-granularity inputs and dynamic embeddings, while beneficial for capturing linguistic nuances, increases the model’s complexity and the computational resources required for training and inference, compared to methods with single-granularity input and static word embedding.
6. Conclusions
In this paper, we developed the DMGTS, a novel NMT model that enhances translation accuracy by integrating multi-granularity features with dynamic word vectors. Through comprehensive ablation studies on both WMT and NIST Chinese–English datasets, we demonstrated the significant impact of multi-granularity input on improving translation performance. Additionally, our experiments underscored the critical role of relative position factors in the model’s effectiveness. By comparing the performance of dynamic word vector embedding, as realized through the ELMo model, against traditional static word vector embeddings like word2vec and GloVe, we showcased the superior capability of our model to handle the complexity of language, particularly in capturing the context-sensitive nature of words.
Our findings reveal that the integration of multi-granularity features with dynamic word embeddings substantially outperforms conventional static embedding methods, yielding an average increase of 1.10 and 1.39 BLEU scores on the WMT and NIST Chinese–English translation tasks, respectively. This advancement highlights the limitations of existing NMT models that rely on static embeddings and underlines the benefits of our approach, which leverages ELMo to enhance the encoder network for more context-aware, dynamic representations.