Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Liu, Yongwen; Liu, Dongqing; Zhu, Shaolin

doi:10.3390/math12152361

Open AccessArticle

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

by

Yongwen Liu

^1,*,

Dongqing Liu

² and

Shaolin Zhu

¹

College of Software Engineering, Zhengzhou University of Light Industry, Zhengzhou 450001, China

²

National Engineering Laboratory for Internet Medical Systems and Applications, Zhengzhou University, Zhengzhou 450052, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(15), 2361; https://doi.org/10.3390/math12152361

Submission received: 31 May 2024 / Revised: 3 July 2024 / Accepted: 22 July 2024 / Published: 29 July 2024

(This article belongs to the Special Issue Mathematical Methods Applied in Artificial Intelligence and Multi-Agent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Current multimodal neural machine translation (MNMT) approaches primarily focus on ensuring consistency between visual annotations and the source language, often overlooking the broader aspect of multimodal coherence, including target–visual and bilingual–visual alignment. In this paper, we propose a novel approach that effectively leverages target–visual consistency (TVC) and bilingual–visual consistency (BiVC) to improve MNMT performance. Our method leverages visual annotations depicting concepts across bilingual parallel sentences to enhance multimodal coherence in translation. We exploit target–visual harmony by extracting contextual cues from visual annotations during auto-regressive decoding, incorporating vital future context to improve target sentence representation. Additionally, we introduce a consistency loss promoting semantic congruence between bilingual sentence pairs and their visual annotations, fostering a tighter integration of textual and visual modalities. Extensive experiments on diverse multimodal translation datasets empirically demonstrate our approach’s effectiveness. This visually aware, data-driven framework opens exciting opportunities for intelligent learning, adaptive control, and robust distributed optimization of multi-agent systems in uncertain, complex environments. By seamlessly fusing multimodal data and machine learning, our method paves the way for novel control paradigms capable of effectively handling the dynamics and constraints of real-world multi-agent applications.

Keywords:

multi-modal neural machine translation; bilingual-visual harmony; visual annotation

MSC:

68-02

1. Introduction

The complex and uncertain environments of multi-agent systems, along with inaccurate system dynamics, present significant challenges for effective modeling, control, and optimization. In multimodal neural machine translation (MNMT), the concept of visual annotation, which captures the essence of content in bilingual sentence pairs, has gained much attention [1,2,3,4,5,6,7,8]. Visual annotations are typically represented as images or videos that depict the main concepts and actions described in the corresponding text. This method uses an extra encoder to turn visual annotations into visual representations, effectively conveying the content of the source sentence [9]. These visual representations are then integrated into the decoder along with the source sentence representation [10]. This process enriches the context vector, which evolves over time and contributes to the step-by-step generation of the target translation. The successful integration of visual information has led to the development of the bi-encoder-to-decoder framework, which simultaneously translates the source sentence and its visual annotation into a target sentence with the same meaning, opening up new possibilities in MNMT [11,12].

While MNMT has demonstrated the ability to extract valuable translation cues from visual information, thereby enhancing the context vector’s role in generating the target translation through auto-regressive decoding, current approaches primarily focus on aligning the visual annotation with the source language. This narrow focus fails to fully address the broader concept of multimodal consistency, which encompasses not only source–visual alignment but also target–visual and bilingual–visual alignment. Visual annotation captures the meaning of both the source sentence (e.g., English) and the corresponding target sentence (e.g., German), a phenomenon known as target–visual consistency. Exploiting this consistency allows for the MNMT model to extract valuable contextual information from the visual input, leading to a more informed and accurate translation process. Furthermore, the visual annotation reflects the semantic content of both the source and target sentences, introducing the concept of bilingual–visual consistency. To fully leverage the potential of visual information, MNMT should aim to replicate this bilingual alignment, promoting semantic harmony between the parallel sentences and the visual annotation. Incorporating both target–visual and bilingual–visual consistencies enables MNMT to fully exploit the rich information contained in visual annotations, leading to significant improvements in translation quality. By leveraging these two forms of consistency, MNMT can access a wealth of contextual information, resulting in more accurate, nuanced, and coherent translations.

Despite the potential benefits of incorporating target–visual and bilingual–visual consistencies, current MNMT methods often treat visual data as supplementary information rather than integrating them deeply into the core translation process. This limitation arises because these methods fail to fully exploit the potential of visual information, resulting in suboptimal contextual integration and less coherent translations. To address this issue, there is a need for novel MNMT approaches that deeply integrate visual information into the translation process, leveraging target–visual and bilingual–visual consistencies to improve translation quality and coherence.

To address the limitations of current MNMT approaches and fully leverage the potential of visual information, we propose a novel multimodal consistency approach that effectively utilizes target–visual consistency (TVC) and bilingual–visual consistency (BiVC) derived from visual annotations. For TVC, we employ an attention layer to extract future context from the visual annotation under the supervision of the ground-truth future target textual context, forming a multimodal target context. This extracted feature is then fed into a masked self-attention module to learn a target representation summarizing both past and future context information, enabling the model to capture long-range dependencies and generate more coherent translations. To promote BiVC, we introduce a bilingual–visual consistency loss term to guide the training of MNMT, encouraging semantic agreement between the learned bilingual sentence representations and the visual representation. The main contributions of this work are as follows:

New target–visual consistency approach: We propose a new method to leverage future context cues from visual data, addressing the limitations of auto-regressive decoders and enabling the model to generate more accurate and coherent translations.
Bilingual–visual consistency: We introduce a new loss term that guides the learning of semantically aligned textual and visual representations, fostering a tighter semantic integration and improving the overall quality of the translations.
Performance evaluation:Through extensive experiments on widely used multimodal translation datasets, such as Multi30k English-to-French/German/Czech [13] and Flickr30kEnt-JP Japanese-to-English [14], we demonstrate that our approach achieves significant performance improvements over strong baselines and sets new state-of-the-art results.

By deeply integrating visual information into the translation process, our approach not only enhances the accuracy and fluency of translations but also opens up new possibilities for multimodal communication. This work paves the way for future research that can harness the rich contextual cues provided by visual data, ultimately leading to the development of more advanced and human-like language processing systems capable of understanding and translating complex, multimodal content.

2. Related Work

MNMT encompasses the translation of a target sentence alongside pertinent non-linguistic cues, such as visual information [4,15,16,17]. A notable approach, introduced by [18], involves a latent variable model that intricately intertwines visual information and textual features, forming a robust foundation for MNMT. This pioneering work delved into the complex interplay between visual and textual modalities, showcasing the substantial benefits of incorporating visual cues into the translation process. Another noteworthy contribution, as discussed by [19], explored MNMT by incorporating visual information as an additional spatiotemporal context to facilitate the translation of a source sentence into the target language. Their approach dynamically emphasized key words within the source sentence and integrated essential spatiotemporal cues from images into the decoder, enabling the generation of the target sentence. While their method effectively leveraged visual information to guide the decoding process, it is important to note that they did not directly encode image features or explicitly model the varying importance of different modalities. Furthermore, the work by [12] introduced a multimodal approach based on the transformer architecture [20]. Their approach induced hidden representations of images from the text, guided by image-aware attention mechanisms. This innovative methodology laid the groundwork for a more comprehensive integration of textual and visual information, enhancing the model’s capacity to understand and generate translations that faithfully capture the essence of both modalities. Taken together, these seminal works highlight the growing importance of integrating visual cues into the MNMT framework and propose diverse strategies to effectively harness the complementary nature of textual and visual contents, ultimately leading to significant improvements in translation quality.

In the realm of MNMT, the concept of multimodal consistency revolves around the synchronization of visual and textual information to convey the same underlying semantics. An influential study by [2] integrated global visual features into an encoder–decoder framework, leveraging an attention-based recurrent neural network (RNN). This work laid the foundation for subsequent approaches, such as [17,21,22], which harnessed global visual information to establish simultaneous neural machine translation (NMT). These methods effectively utilize visual cues to complement the incomplete textual modality during the decoding process, demonstrating the potential of multimodal consistency in enhancing the robustness and efficiency of MNMT systems. Furthermore, visual information has been leveraged as a pivot to facilitate the creation of a shared multilingual visual–semantic embedding space in various approaches. For instance, Ref. [23] highlighted the significance of visual information in enhancing alignments within the latent language spaces, emphasizing the shared physical perceptual nature of visual cues across different languages. This insight has important implications for the development of more effective multilingual MNMT models that can better capture cross-lingual semantic correspondences. Additionally, Ref. [24] put forth a technique that employed visual agreement regularization during training to foster bilingual representations by aligning source-to-target and target-to-source models, further underscoring the critical role of multimodal consistency in improving the quality and coherence of translations. Moreover, LSTM networks have been widely used in NMT and have shown robust performance in various tasks. They effectively handle sequential data and maintain long-term dependencies through their gating mechanisms. While LSTMs are effective, they tend to be less efficient in capturing long-range dependencies compared to transformers. Additionally, LSTMs rely on sequential processing, which can be a bottleneck for training speed and scalability. The self-attention mechanism in transformers allows for more flexible and context-aware representations, which are crucial for handling the complexities of multimodal inputs. Transformers are more scalable and efficient for large datasets, a critical factor given the size of our training data.

Drawing inspiration from these advancements, our study harnesses the power of multimodal consistency in two key aspects to push the boundaries of MNMT performance. Firstly, we utilize multimodal consistency to enable our model to capture future contexts from visual cues, introducing a novel approach to enhance target context modeling and generate more accurate and fluent translations. Secondly, we employ this consistency to ensure semantic coherence between bilingual parallel sentences and the anchored visual annotation, proposing a new training objective that encourages the model to learn more robust and semantically aligned representations. By deeply integrating multimodal consistency into the core of our MNMT framework, we aim to unlock the full potential of visual information and set a new state of the art in the field.

3. Background of Multimodal Transformer

In this section, we introduce an advanced multimodal transformer framework for MNMT [12], which has achieved state-of-the-art performance on the Multi30k multimodal translation task. Unlike the classical transformer framework, this model incorporates a multimodal self-attention mechanism to encode both textual and visual information, learning a visually aware representation of the source sentence that serves as input to the decoder for generating the target translation word-by-word.

3.1. Multimodal Self-Attention

Given an input textual sentence of length J, represented as

X^{t e x t} = (x_{1}, \dots, x_{J})

, the traditional self-attention mechanism, denoted as

{ATT}_{s}

, computes a new representation

H^{t e x t} = (h_{1}, h_{2}, \dots, h_{J})

. This mechanism dynamically weights the importance of each word within the sentence when computing the representation of each word. The traditional self-attention mechanism

{ATT}_{s}

projects each word

x_{i}

into Query (

x_{i} W_{e}^{Q}

), Key (

x_{j} W_{e}^{K}

), and Value (

x_{j} W_{e}^{V}

) spaces using layer-specific trainable matrices

W_{e}^{Q}

,

W_{e}^{K}

, and

W_{e}^{V} \in R^{d_{m o d e l} \times d_{m o d e l}}

, where

d_{m o d e l}

is the dimension of the word embedding. The attention score for each word pair

(i, j)

is computed using the scaled dot-product:

{score}_{i j} = \frac{(x_{i} W_{e}^{Q}) {(x_{j} W_{e}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}} .

(1)

Next, the new representation

h_{i}

for each word

x_{i}

is computed as the weighted sum of the value projections:

h_{i} = \sum_{j = 1}^{J} α_{i j} (x_{j} W_{e}^{V}),

(2)

where

α_{i j} = softmax ({score}_{i j})

and softmax is applied to obtain the attention weights, ensuring that they sum up to 1.

Formally, this process is expressed as

h_{i} = {ATT}_{s} (x_{i}, X^{t e x t}) = \sum_{j = 1}^{J} softmax (\frac{(x_{i} W_{e}^{Q}) {(x_{j} W_{e}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (x_{j} W_{e}^{V}) .

(3)

This model enhances focus on relevant parts of the input sequence by assigning higher weights to more significant words based on their contextual relationships.

In contrast to the traditional self-attention mechanism that processes only textual modality, the multimodal self-attention mechanism, denoted as MATT, seamlessly integrates visual information into the text processing framework. Guided by image-aware attention, this mechanism adaptively combines textual and visual inputs to enhance the representational power of the model. Formally, the inputs consist of two modalities: text represented by

X^{t e x t} \in R^{J \times d_{m o d e l}}

, and image represented by

X^{i m a g e} \in R^{N \times d_{m o d e l}}

. These are concatenated into a single input

B^{m u l t i} = {X^{t e x t} : X^{i m a g e} \in R^{(J + N) \times d_{m o d e l}}}

, simplified as

B^{m u l t i} = (b_{1}, b_{2}, \dots, b_{M})

, where

M = J + N

. Each multimodal input

b_{m}

and text

x_{j}

is projected into Query, Key, and Value spaces. Similar to the discussion of ATT, the process of MATT is formally expressed as

z_{m} = {MATT}_{s} (b_{m}, X^{t e x t}) = \sum_{j = 1}^{J} softmax (\frac{(b_{m} W_{e}^{Q}) {(x_{j} W_{e}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (x_{j} W_{e}^{V}) .

(4)

This results in a visually-informed representation

Z^{m u l t i} = (z_{1}, z_{2}, \dots, z_{M})

, where

Z^{m u l t i} \in R^{(J + N) \times d_{m o d e l}}

can effectively capture the nuances of both textual and visual inputs.

3.2. Auto-Regressive Decoder

The auto-regressive decoder in MNMT generates target words sequentially, conditioned on previously generated words and multimodal inputs. Given the previously generated target words

y_{< t} = (y_{1}, y_{2}, \dots, y_{t - 1})

, the decoder computes the target representation

s_{t}

using the target attention mechanism

{ATT}_{t}

, as defined in Equation (5):

s_{t} = {ATT}_{t} (q_{t}, y_{< t}) = \sum_{k = 1}^{t - 1} softmax (\frac{(q_{t} W_{d}^{Q}) {(y_{k} W_{d}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (y_{k} W_{d}^{V}),

(5)

where

q_{t}

represents the target hidden state at time step t, and

W_{d}^{V}, W_{d}^{Q}, W_{d}^{K} \in R^{d_{m o d e l} \times d_{m o d e l}}

are trainable parameter matrices specific to the decoder.

s_{t}

captures the dependencies among the previously generated target words and guides the decoding process.

Next, the decoder employs the context attention module

{ATT}_{c}

to compute the context vector

c_{t}

, which integrates multimodal context information

Z^{m u l t i}

:

c_{t} = {ATT}_{c} (s_{t}, Z^{m u l t i}) = \sum_{m = 1}^{M} softmax (\frac{(s_{t} W_{c}^{Q}) {(z_{m} W_{c}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (z_{m} W_{c}^{V}),

(6)

where

W_{c}^{V}, W_{c}^{Q}, W_{c}^{K} \in R^{d_{m o d e l} \times d_{m o d e l}}

are additional trainable matrices. The context vector

c_{t}

is then passed through a feed-forward neural network to compute the probability distribution over the next target word

{\hat{y}}_{t}

:

P ({\hat{y}}_{t} | y_{< t}, X^{t e x t}, X^{i m a g e}) = softmax (W_{o} \tanh (W_{w} c_{t})),

(7)

where

W_{o}

and

W_{w}

are learnable parameters. This formulation ensures that each target word prediction is conditioned on both previous target words and the multimodal input

B^{m u l t i}

.

Training the MNMT model

θ

involves maximizing the log-likelihood of the correct translation sequence

Y

given textual and visual inputs

X^{t e x t}

and

X^{i m a g e}

:

\underset{θ}{arg max} \sum_{t = 1}^{T} log P (y_{t} | y_{< t}, X^{t e x t}, X^{i m a g e}; θ),

(8)

where T is the length of the target sequence. This objective is commonly optimized using cross-entropy loss, ensuring that the model learns to generate accurate translations based on both textual and visual contexts.

4. Multimodal Consistency-Based MNMT

In this section, we first propose to extract the target future context information from the visual annotation using the target–visual consistency for enhancing the dependent-time target representation, which is abbreviated as TVC. We then use the bilingual–visual consistency to guide the training of MNMT, thereby encouraging the semantic agreement between the learned bilingual sentences and the pivoted visual annotation, which is abbreviated as BiVC. Figure 1 shows an overview of the proposed multimodal consistency-based MNMT.

4.1. Target–Visual Consistency-Enhanced Target Representation

Auto-regressive decoders in NMT and MNMT are known to struggle with effectively modeling future target context during generation [25,26,27,28,29]. However, as discussed in Section 1, visual annotations encapsulate semantic information of both source and target sentences. We leverage this target–visual consistency to extract prospective target–side contextual cues from the visual input, thereby mitigating the aforementioned decoder limitation.

We first extract a visual feature related to the target future context under the supervision of the ground-truth target future textual context. This is achieved through an attention mechanism over the visual annotation:

u_{t} = {ATT}_{i m a g e} (q_{t}, X^{i m a g e}) = \sum_{n = 1}^{N} softmax (\frac{(q_{t} W_{r}^{Q}) {(r_{n} W_{r}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (r_{n} W_{r}^{V}),

(9)

where

u_{t} \in R^{1 \times d_{m o d e l}}

is the extracted visual feature, and

q_{t}

is the query vector based on previously generated target words. To ensure

u_{t}

captures the desired target future context, we introduce an L1 regularization loss to minimize the mean absolute error between

u_{t}

and the ground-truth future target words

y_{t}^{f u t u r e}

:

t v l o s s = L 1 loss (u_{t}, {Linear}_{f u t u r e} (y_{t}^{f u t u r e})),

(10)

where

{Linear}_{f u t u r e}

reduces the dimension of

y_{t}^{f u t u r e}

to match

u_{t}

. Both

q_{t}

and

u_{t}

are then fed into a masked multimodal self-attention module to learn an enriched target representation

s_{t}^{'}

:

{\hat{s}}_{t} = {MATT}_{t} ([q_{t}, u_{t}], y^{p a s t}) = \sum_{k = 1}^{k < t} softmax (\frac{([q_{t}, u_{t}] W_{d}^{Q}) {(y_{k} W_{d}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (y_{k} W_{d}^{V}) .

(11)

Here,

s_{t}^{'} \in R^{1 \times d_{m o d e l}}

encodes both the target past context from previously generated words and the target future context from the visual annotation. This enriched representation

s_{t}^{'}

is then used to compute the context vector

c_{t}^{'}

via cross-attention with the source representation

Z^{m u l t i}

:

c_{t}^{'} = {ATT}_{c} (s_{t}^{'}, Z^{m u l t i}) = \sum_{m = 1}^{M} softmax (\frac{(s_{t}^{'} W_{c}^{Q}) {(z_{m} W_{c}^{K})}^{⊤}}{\sqrt{d_{m o d e l}}}) (z_{m} W_{c}^{V}) .

(12)

Finally,

c_{t}^{'}

is used to predict the current target word

{\hat{y}}_{t}

:

P ({\hat{y}}_{t} | y < t, X^{t e x t}, X^{i m a g e}) \propto \exp (W_{o} \tanh (W_{w} c_{i}^{'}) .

(13)

The training objective is revised to maximize the conditional translation probability while minimizing the target–visual consistency loss

t v l o s s

:

J (ϕ) = \underset{ϕ}{arg max} c e l o s s (Y | X^{t e x t}, X^{i m a g e}) - t v l o s s (Y, X^{i m a g e}) .

(14)

By explicitly modeling target–visual consistency during training and inference, our approach can effectively leverage future context cues from visual data, overcoming the inherent limitations of auto-regressive decoders and leading to more informed and coherent target translations.

4.2. Bilingual–Visual Consistency-Guided MNMT

While traditional text-only NMT models rely on source–target consistency, MNMT introduces an additional dimension, i.e., bilingual–visual consistency. This concept posits that the visual annotation should coherently represent the content described in both source and target sentences. However, existing MNMT approaches often overlook this crucial aspect, potentially underutilizing the rich information present in visual annotations.

To address this limitation, we introduce a bilingual–visual consistency loss term to guide our MNMT model training. This encourages semantic agreement between the learned bilingual sentence representations and the pivotal visual annotation representation. Given the source textual representation

Z^{m u l t i} \in R^{J \times d_{m o d e l}}

and the target context representation

C^{'} = (c_{1}^{'}, c_{2}^{'}, \dots, c_{T}^{'}) \in R^{T \times d_{m o d e l}}

, we first project them to match the dimension of the visual representation

X^{i m a g e}

:

\begin{matrix} Z^{m u l t i} = {Linear}_{s} (Z^{m u l t i}, X^{i m a g e}) \end{matrix}

(15)

\begin{matrix} C^{'} = {Linear}_{s} (C^{'}, X^{i m a g e}) \end{matrix}

(16)

Then, we compute the mean absolute error (or L1Loss) between the converted bilingual sentence representations {

Z^{m u l t i}, C^{'}

} and the pivoted visual representation

X^{i m a g e}

:

b i v l o s s (X^{t e x t}, Y^{t e x t}, X^{i m a g e}) = L 1 {Loss}_{s 2 i} (Z^{m u l t i}, X^{i m a g e}) + L 1 {Loss}_{t 2 i} (C^{'}, X^{i m a g e}),

(17)

where

L 1 {loss}_{s 2 i}

focuses on the mean absolute error (or L1Loss) between

Z^{m u l t i}

and

X^{i m a g e}

, and

L 1 {loss}_{t 2 i}

focuses on the mean absolute error (or L1Loss) between

C^{'}

and

X^{i m a g e}

When the values of

b i v l o s s (X^{t e x t}, Y^{t e x t}, X^{i m a g e})

are smaller, the semantic consistency between the learned bilingual sentences and the pivoted visual annotation is higher. To obtain a bilingual–visual consistency-guided MNMT model

φ

, the training objection maximizes the conditional translation probability over the training dataset

{[X^{t e x t}, X^{i m a g e}, Y]}

as follows:

L (φ) = \underset{φ}{arg max} {c e l o s s (Y | X^{t e x t}, X^{i m a g e}) - b i v l o s s (X^{t e x t}, Y^{t e x t}, X^{i m a g e})} .

(18)

By explicitly encouraging the alignment of learned bilingual sentence representations with visual annotations during training, our approach effectively captures the underlying semantic relationships across modalities. This bilingual–visual consistency serves as an inductive bias, guiding the model to learn more coherent and semantically aligned representations, thereby enhancing translation performance. The proposed TVC and BiVC components synergize, with TVC leveraging visual data to enrich target representations and BiVC ensuring a tight semantic coupling between bilingual text and visual information. Together, they enable our MNMT approach to fully exploit the complementary strengths of textual and visual modalities, overcoming the limitations of previous methods. Our multimodal consistency framework introduces a novel paradigm for MNMT, moving beyond traditional techniques that treat visual data as merely supplementary. Instead, our approach deeply integrates visual information into the core translation process, enabling more informed, contextually rich, and semantically coherent translations.

5. Experimental Setup

5.1. Dataset and Setup

In this section, we introduce the dataset and evaluation metrics, and provide the detailed experimental settings. We conducted experiments on four language pairs from two widely used multimodal translation datasets, including Multi30k [13] for English-to-German (En-De), English-to-French (En-Fr), and English-to-Czech (En-Cs), and Flickr30kEnt-JP for Japanese-to-English (En-Ja) [14]. The Multi30k dataset contains 29K bilingual parallel sentence pairs with visual annotation, 1K validation instances, and 1K test instances. Flickr30kEnt-JP contains Japanese translations of the first two original English captions for each image in the Flickr30k [30] Entities dataset. We used the Test2017 and Test2016 datasets for the evaluation of the English–German task. Additionally, we used the Test2016 and Test2017 test sets to evaluate the proposed methods on the English-to-Czech and English-to-French tasks, respectively. All sentences were preprocessed by tokenizing and normalizing the punctuation using the Moses Toolkit [31]. To tokenize Japanese, we used the MeCab version 0.996 (http://taku910.github.io/mecab, accessed on 18 February 2023).

For evaluating the translation performance, we used two widely used automatic evaluation metrics, BLEU [32] and METEOR [33]. We employed the transformer [20] as the underlying architecture to design our model. Each encoder and decoder of the model has 6-layer stacked self-attention networks, 8 heads, 1024 hidden units, and 2048 feed-forward filter size. We used the Adam optimizer with a minibatch size of 64. For the learning rate, we used the default configuration of the transformer. Specifically, the size of the word embedding was set to 256 dimensions, and embeddings were learned from scratch. We extracted global image features using ResNet-50. The spatial features were 14 × 14 × 1024-dimensional vectors, which are representations of local spatial regions of the image. We trained the model for 20 epochs and set the warmup steps to 8000. During the training, the attention dropout and residual dropout were p = 0.1. An extra linear layer was utilized to project all visual features into 256 dimensions.

For text preprocessing, we tokenized the text data using the Moses tokenizer and performed sentence segmentation to ensure consistency in text length. Byte Pair Encoding (BPE) was applied to handle rare words and improve vocabulary efficiency. For image preprocessing, we resized images to a fixed resolution to maintain consistency across the dataset and normalized the pixel values to have zero mean and unit variance. Data augmentation techniques, such as random cropping and horizontal flipping, were used to increase the robustness of the model. In terms of aligning text and images, each sentence was aligned with its corresponding image based on the dataset annotations, ensuring accuracy by cross-referencing with the dataset documentation and performing manual checks on a subset of the data. The main hyperparameters used in our experiments are shown in Table 1 below.

5.2. Baselines

We compare our proposed approach with the following representative and competitive baselines:

DMMT [4]: This method proposes distilling translations to solve the problem where visual information is only used by a second-stage decoder.
IMG [2]: This approach uses global features extracted from visual information using a pre-trained convolutional neural network. These global image features are then incorporated into the translation model.
SMMT [18]: This method models the interaction between visual and textual features through a latent variable, which is then used in the target-language decoder to predict image features.
EMMT [12]: This approach introduces a new attention mechanism to learn the representations of images based on textual information, avoiding the encoding of irrelevant visual information into latent representations.
VMMT [24]: This method employs a visual agreement regularized training on source-to-target and target-to-source models to obtain bilingual representations.

These baselines represent diverse approaches in MNMT, ranging from feature incorporation and latent variable modeling to attention mechanisms and regularization techniques. By comparing our method against these competitive baselines, we aim to demonstrate the effectiveness of our multimodal consistency approach in leveraging visual information for improved translation quality.

6. Results and Discussions

6.1. Main Results

Table 2 presents the primary results for our proposed methods and comparison methods on the En-De Test2016 and Test2017 test sets. The findings underscore the significant performance gains achieved by our EMMT+TVC, EMMT+BiVC, and EMMT+TVC+BiVC models over the baseline EMMT model, highlighting the effectiveness of TVC and BiVC in leveraging visual annotation to enhance MNMT performance.

Specifically, our experiments show that the EMMT+TVC model consistently outperformed the EMMT+BiVC model on both Test2016 and Test2017 test sets. This suggests that extracting target future context information from visual annotations contributes more effectively to translation quality than solely enforcing semantic agreement between bilingual sentences and visual annotations. Furthermore, the EMMT+TVC+BiVC model achieved higher BLEU scores compared to both the EMMT+TVC and EMMT+BiVC models on Test2016 and Test2017 test sets. This demonstrates that combining target–visual consistency and bilingual–visual consistency offers synergistic benefits, resulting in additional improvements in MNMT performance.

Our analysis indicates that the superior performance of the proposed method can be attributed to several key factors. The handling of specific linguistic phenomena is significantly improved; our model excels in translating sentences with ambiguous or context-dependent terms, as the visual context helps disambiguate such terms, leading to more accurate translations. For instance, in sentences with polysemous words, the visual context provides additional cues that help the model choose the correct meaning. Furthermore, the proposed method shows uniform performance across various sentence types, including simple declarative sentences, complex sentences with multiple clauses, and sentences with idiomatic expressions. The integration of visual information enhances the model’s contextual understanding, which is particularly beneficial for translating descriptive texts where visual elements play a crucial role. Examples from our experiments show that sentences describing scenes, objects, or actions are translated more accurately when visual context is incorporated.

6.2. Evaluation of Semantic Agreement via Bilingual–Visual Consistency Loss

The bilingual–visual consistency loss (bivloss) is a pivotal element in our proposed approach, designed to promote semantic coherence between bilingual parallel sentences and visual annotations. To assess its impact, we conducted a comprehensive analysis of bivloss scores alongside corresponding BLEU, METEOR, and TER (Translation Edit Rate) scores for the baseline EMMT model, the EMMT+BiVC model, and the EMMT+TVC+BiVC model on the En-De Test2016 and Test2017 test sets. Table 3 presents compelling evidence that integrating the bivloss term significantly enhances model performance. On both the Test2016 and Test2017 sets, the EMMT+BiVC model achieved markedly lower bivloss scores compared to the baseline EMMT model. Specifically, on Test2016, the EMMT+BiVC model recorded a bivloss score of 13.89, surpassing the baseline EMMT’s score of 15.01. Similarly, on Test2017, the EMMT+BiVC model achieved a bivloss score of 11.08, significantly lower than the baseline EMMT’s 16.77. Importantly, these reductions in bivloss were accompanied by improved BLEU and METEOR scores. On Test2016, the EMMT+BiVC model achieved a BLEU score of 39.11, outperforming the baseline EMMT’s score of 38.61, and a METEOR score of 57.8 compared to the baseline’s 56.2. This trend persisted on Test2017, where the EMMT+BiVC model scored 28.53 BLEU and 51.6 METEOR compared to the baseline’s 28.00 and 51.1, respectively.

Moreover, the TER scores provide additional insights into the translation quality. On Test2016, the EMMT+BiVC model achieved a TER of 37.5, improving over the baseline’s 38.7. On Test2017, the TER for the EMMT+BiVC model was 46.7, compared to the baseline’s 47.2. These findings underscore that encouraging bilingual–visual consistency through the bivloss term effectively aligns bilingual sentence representations with visual annotations, resulting in more coherent and higher-quality translations, as evidenced by the enhanced BLEU, METEOR, and TER scores. Furthermore, the EMMT+TVC+BiVC model, incorporating both TVC and BiVC approaches, achieved even lower bivloss scores compared to the EMMT+BiVC model. Specifically, on Test2016, the EMMT+TVC+BiVC model achieved a bivloss score of 6.88, further improving alignment between bilingual sentences and visual annotations. On Test2017, this score reduced to 5.93, indicating substantial progress in enhancing semantic coherence.

While the improvement in bivloss scores for the EMMT+TVC+BiVC model was moderate compared to the gain in BLEU scores over the EMMT+BiVC model, the TVC component played a pivotal role in enhancing translation quality. For instance, on Test2016, despite the bivloss score decreasing from 13.89 to 6.88, the EMMT+TVC+BiVC model achieved a BLEU score of 41.27, surpassing the EMMT+BiVC model’s 39.11 BLEU. A similar trend was observed on Test2017, where the EMMT+TVC+BiVC model’s BLEU score of 29.70 outperformed the EMMT+BiVC model’s 28.53 BLEU, despite a smaller reduction in bivloss (from 11.08 to 5.93). Additionally, the METEOR and TER scores highlight the comprehensive improvement achieved by the EMMT+TVC+BiVC model. On Test2016, the METEOR score increased to 59.2, and the TER improved to 36.1. On Test2017, the METEOR score reached 52.2, and the TER was 44.2. These results highlight that while the bilingual–visual consistency loss effectively aligns textual and visual representations, the target–visual consistency introduced by the TVC component plays a critical role in enhancing overall translation quality. By leveraging visual annotations to extract future target context, the EMMT+TVC+BiVC model mitigates autoregressive decoder limitations, thereby generating more informed and coherent translations.

6.3. Learning Curves of Loss and BLEU Scores for Multimodal Consistency-Based MNMT

To investigate the effect of multimodal consistency on MNMT, we analyze the learning curves of loss scores for both the baseline EMMT and the EMMT+TVC+BiVC models. We focus on the En-De development set for loss curves and on the En-De Test2016 and Test2017 test sets for BLEU score curves.

The baseline EMMT model employs the standard cross-entropy loss, denoted as

c e l o s s_{E M M T}

in Equation (8). In contrast, the EMMT+TVC+BiVC model integrates additional loss components: a target–visual consistency loss

t v l o s s_{+ T V C + B i V C}

(Equation (10)) and a bilingual–visual consistency loss

b i v l o s s_{+ T V C + B i V C}

(Equation (18)), alongside the standard cross-entropy loss

c e l o s s_{+ T V C + B i V C}

.

Figure 2a illustrates the learning curves of these loss components for the EMMT+TVC+BiVC model on the En-De development set. The

c e l o s s_{+ T V C + B i V C}

curve shows a consistent downward trend, indicating effective optimization of the standard cross-entropy loss throughout training, crucial for generating accurate target translations. Notably, both

t v l o s s_{+ T V C + B i V C}

and

b i v l o s s_{+ T V C + B i V C}

exhibit decreasing trends over time.

t v l o s s_{+ T V C + B i V C}

, responsible for aligning target-side context with visual annotations, initially starts higher but steadily decreases as training progresses. Similarly,

b i v l o s s_{+ T V C + B i V C}

, aimed at maintaining semantic consistency between bilingual sentence representations and visual data, also shows a steady decline. These converging trends across all three loss components suggest that the EMMT+TVC+BiVC model effectively optimizes both standard translation objectives and multimodal consistency goals, enhancing overall model coherence and performance.

Figure 2b presents the learning curves of BLEU score, highlighting that the EMMT+TVC+BiVC model consistently outperforms the baseline EMMT model on both En-De Test2016 and Test2017 test sets throughout the training epochs. Starting with higher BLEU scores, the EMMT+TVC+BiVC model demonstrates continuous improvement, underscoring the benefits of multimodal consistency approaches. Importantly, the gap in BLEU scores between the EMMT+TVC+BiVC model and the baseline EMMT model widens over time, indicating that integrating multimodal consistency objectives not only boosts immediate performance but also facilitates more effective model learning. This alignment across textual and visual modalities leads to superior translation quality compared to traditional approaches.

6.4. Ablation Study for Visual Annotation

To better understand the role and importance of visual annotation in our proposed multimodal consistency approaches, we conducted a series of ablation experiments. Specifically, we evaluated the performance of the baseline EMMT model and our EMMT+BiVC, EMMT+TVC, and EMMT+TVC+BiVC models using random image annotations instead of ground-truth image annotations. The results presented in Table 4 provide valuable insights into the significance of visual information in our models.

Firstly, the results clearly demonstrate that all models, including the baseline EMMT and our proposed variants, perform significantly better when using ground-truth image annotations compared to random image annotations. On the En-De Test2016 test set, the BLEU score of the EMMT model drops from 38.61 with ground-truth images to 35.62 with random images. A similar pattern is observed on the En-De Test2017 test set, where the BLEU score decreases from 28.00 to 27.47 when using random images. This indicates that visual annotation is a crucial component contributing to the overall performance of multimodal machine translation models, with the content of the visual annotation playing a pivotal role in alignment with textual context.

Furthermore, we observe that with random visual annotations, our proposed models (EMMT+BiVC, EMMT+TVC, and EMMT+TVC+BiVC) actually perform worse than the baseline EMMT model in terms of both BLEU and METEOR scores. For example, on the En-De Test2016 test set, the EMMT+BiVC model scores 34.59 BLEU with random images, lower than the baseline EMMT’s 35.62 BLEU. A similar trend is seen on the En-De Test2017 test set, where the EMMT+BiVC model’s BLEU of 27.14 is inferior to the baseline EMMT’s 27.47 BLEU. This suggests that when visual annotations are not aligned with textual content, our multimodal consistency approaches, which heavily leverage visual information, extract more noise than useful signal. In contrast, the baseline EMMT model, relying primarily on textual information and using visual input as supplementary information, is less affected by modal mismatches.

However, when ground-truth image annotations are used, the situation reverses. Our proposed models, EMMT+BiVC, EMMT+TVC, and EMMT+TVC+BiVC, consistently outperform the baseline EMMT model across both the En-De Test2016 and Test2017 test sets. Specifically, the EMMT+TVC+BiVC model achieves the highest BLEU scores of 41.27 on Test2016 and 29.70 on Test2017, significantly outperforming the baseline EMMT’s 38.61 and 28.00 BLEU, respectively. This demonstrates that when visual annotations are accurate and well-aligned with textual content, our multimodal consistency approaches effectively leverage this information to enhance translation quality. The target–visual consistency and bilingual–visual consistency components enable our models to extract richer contextual cues from visual data and maintain tighter semantic coherence between textual and visual modalities, resulting in superior translation performance.

6.5. Impact of Different Loss Functions on Performance

We employed the smooth L1 loss as the primary loss function for our proposed multimodal consistency approaches. However, given the importance of selecting an appropriate loss function in deep learning models, we conducted further investigations to explore the impact of using different loss functions within our framework. Table 5 presents the results of our experiments evaluating the performance of our models when trained with various loss functions, including the L2 loss, KLDiv loss, BCEWithLogits loss, HingeEmbedding loss, and the L1 loss used in our main approach.

We observe that the L1 loss consistently yields the most promising performance across the majority of evaluated scenarios. For instance, on the En-De Test2016 test set, the model trained with the L1 loss achieves a BLEU score of 41.27, outperforming the other loss functions by a significant margin. Specifically, the L2 loss-based model scores 39.21 BLEU, the KLDiv loss-based model scores 34.75 BLEU, the BCEWithLogits loss-based model scores 36.31 BLEU, and the HingeEmbedding loss-based model scores 35.78 BLEU. This trend persists on the En-De Test2017 test set, where the L1 loss-based model achieves a BLEU score of 50.46, surpassing the scores of the L2 loss-based model (50.21 BLEU), the KLDiv loss-based model (50.08 BLEU), the BCEWithLogits loss-based model (50.57 BLEU), and the HingeEmbedding loss-based model (50.06 BLEU). The superior performance of the L1 loss-based model is also evident in other language pair tasks. On the En-Fr Test2017 test set, the L1 loss-based model achieves a BLEU score of 32.87, outperforming the L2 loss-based model (31.06 BLEU), the KLDiv loss-based model (31.78 BLEU), the BCEWithLogits loss-based model (31.24 BLEU), and the HingeEmbedding loss-based model (26.14 BLEU).

Similarly, on the En-Cs Test2016 test set, the L1 loss-based model scores 29.70 BLEU, while the L2 loss-based model scores 30.02 BLEU, the KLDiv loss-based model scores 27.13 BLEU, the BCEWithLogits loss-based model scores 27.01 BLEU, and the HingeEmbedding loss-based model scores 27.45 BLEU. These results clearly demonstrate the effectiveness of the L1 loss function in the context of our proposed multimodal consistency approach for machine translation. Known for its robustness and ability to handle outliers, the L1 loss proves particularly suitable for aligning textual and visual representations, as well as capturing target–visual and bilingual–visual consistencies. In contrast, other loss functions such as the L2 loss, KLDiv loss, BCEWithLogits loss, and HingeEmbedding loss do not perform as well in our experiments. The L2 loss, being more sensitive to outliers, may struggle to provide the necessary guidance for the model to learn desired multimodal representations. The KLDiv loss, designed for probabilistic distributions, may not be the optimal choice for tasks involving the alignment of structured textual and visual features. Additionally, the BCEWithLogits loss and HingeEmbedding loss, typically used for classification tasks, appear less suitable for the multimodal translation problem compared to the L1 loss.

6.6. Universality of Multimodal Consistency

To assess the universality and broader applicability of our proposed multimodal consistency approaches, we conducted experiments across multiple language pairs beyond the English–German (En-De) task, which was the focus of our main analysis. Specifically, we evaluated the performance of the baseline EMMT model and our EMMT+TVC, EMMT+BiVC, and EMMT+TVC+BiVC models on the English–French (En-Fr), English–Czech (En-Cs), and English–Japanese (En-Ja) multimodal translation tasks. The results presented in Table 6 provide valuable insights into the generalizability and effectiveness of our multimodal consistency techniques. Firstly, the results demonstrate that our proposed approaches consistently outperform the baseline EMMT model across all the evaluated language pairs. On the En-Fr task, the EMMT+TVC+BiVC model achieves a BLEU score of 32.87, which is a 1.50 point improvement over the baseline EMMT’s 31.37 BLEU. Similarly, on the En-Cs task, the EMMT+TVC+BiVC model scores 29.70 BLEU compared to 28.00 BLEU for the baseline EMMT. Even on the more distant language pair of En-Ja, the EMMT+TVC+BiVC model outperforms the baseline by 1.08 BLEU points, scoring 45.73 BLEU versus the EMMT’s 44.65 BLEU. These findings suggest that the core principles underlying our multimodal consistency approaches, namely, TVC and BiVC, are universal and effectively applicable to a diverse range of multimodal translation tasks beyond the initial En-De setup.

Interestingly, the magnitude of improvement achieved by our proposed models varies across the different language pairs. The EMMT+TVC+BiVC model shows the most significant BLEU score improvements of 2.04 and 1.50 points on the more linguistically similar En-Fr and En-Cs tasks, respectively. In contrast, the improvement on the more distant En-Ja task is relatively smaller at 1.08 BLEU points. This pattern indicates that multimodal consistency approaches may be particularly beneficial when the target language is more closely related to the source language, leveraging visual annotations to strengthen semantic coherence across bilingual parallel sentences. Even for the more distant language pair of En-Ja, our multimodal consistency techniques still outperform the baseline, highlighting their broad applicability. In addition to BLEU score improvements, we also analyze the bivloss scores for different models and language pairs. Consistent with findings from the En-De experiments, the EMMT+TVC+BiVC model consistently achieves the lowest bivloss scores across the En-Fr, En-Cs, and En-Ja tasks, indicating its effectiveness in aligning learned bilingual sentence representations with visual annotations. For example, on the En-Fr task, the bivloss score for the EMMT+TVC+BiVC model is 6.71, significantly lower than the 14.05 and 14.82 scores for the EMMT+BiVC and baseline EMMT models, respectively. Similar trends are observed in the En-Cs and En-Ja tasks, further supporting the ability of our multimodal consistency approaches to foster tighter semantic integration between textual and visual modalities.

6.7. Impact of Different Dataset Sizes

To evaluate the impact of dataset size on our proposed model, we conducted experiments using different portions of the full dataset. We created subsets of the original dataset with varying sizes: 25%, 50%, 75%, and 100% of the full dataset. Each subset was used to train our model separately, ensuring that the training conditions remained consistent across different dataset sizes in Test2017. The performance of our model was evaluated using BLEU, METEOR, and TER metrics for each subset. The results are summarized in Table 7 below:

When using only 25% of the dataset, our model’s performance was significantly lower across all metrics. This indicates that a smaller dataset limits the model’s ability to learn effectively from the available data, resulting in poorer translation quality. Training with 50% of the dataset showed a noticeable improvement in performance, though it still lagged behind the results achieved with the full dataset. Using 75% of the dataset further improved the model’s performance, bringing it closer to the results obtained with the full dataset. The best performance was achieved with the full dataset, confirming the importance of a larger dataset for training robust and accurate translation models.

6.8. Discussions

Our approach dynamically integrates textual and visual information while maintaining bilingual consistency, making it adaptable across diverse datasets. The flexibility of our self-attention mechanism and visual integration module enables an effective processing of various textual and visual inputs. Future research will involve experiments with additional datasets and languages to validate the robustness and versatility of our approach. Despite the significant improvements in translation quality, our model has some limitations. Visual context ambiguities, low-quality images, and increased computational complexity can adversely affect performance. The model’s generalization to other domains or languages remains to be fully explored. Future work includes extending our approach to incorporate other modalities, such as audio, optimizing the model for real-time translation scenarios, enhancing robustness to ambiguous or irrelevant visual contexts, and adapting the model to different domains and languages. By addressing these limitations and exploring new directions, we can further advance multimodal neural machine translation, making it more versatile and applicable to a wider range of real-world scenarios.

7. Conclusions

In this study, we present a novel multimodal consistency approach that advances the state-of-the-art in MNMT. Our approach synergistically combines two complementary facets: TVC and BiVC. The integration of target–visual consistency enables our MNMT model to extract valuable target-side contextual cues from the visual annotation. By effectively leveraging the future context information, our model can generate more accurate and coherent target translations, overcoming the inherent limitations of autoregressive decoders. Simultaneously, the bilingual–visual consistency acts as a guiding force, steering our MNMT model to maintain a tight semantic alignment between the learned bilingual sentence representations and the corresponding visual annotation. This ensures that the textual and visual modalities are tightly coupled, further enhancing the translation quality. The synergistic combination of these two multimodal consistency components propels our approach beyond the capabilities of prior MNMT techniques. Extensive empirical evaluations on diverse multimodal translation tasks, including English–German, English–French, English–Czech, and English–Japanese, demonstrate the effectiveness and universality of our approach. Notably, our models achieve new state-of-the-art benchmarks across these language pairs, underscoring the aptitude of our multimodal consistency framework in harnessing the complementary strengths of textual and visual information. This significant performance improvement highlights the pivotal role that multimodal consistency plays in advancing the field of MNMT. Future work involves further exploration of multimodal consistency within the MNMT framework. We aim to uncover additional dimensions of multimodal coherence and investigate their impact on translation quality. Furthermore, we intend to extend the applicability of our proposed approach to other multimodal language tasks, unlocking its potential across a broader spectrum of real-world applications. By seamlessly integrating textual and visual modalities through the lens of multimodal consistency, our work paves the way for a new paradigm in MNMT. This visually aware, data-driven framework represents a significant advancement, positioning it as a valuable tool for intelligent language understanding and generation in complex, multimodal environments.

Author Contributions

Conceptualization, D.L.; methodology, S.Z.; writing—review & editing, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

The present research was supported by the National Natural Science Foundation of China (Grant No.62276188) and the Natural Science Foundation of Henan Province (Grant No.242300420677).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to thank the anonymous reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Specia, L.; Frank, S.; Sima’an, K.; Elliott, D. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, 11–12 August 2016; pp. 543–553. [Google Scholar] [CrossRef]
Calixto, I.; Liu, Q. Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 992–1003. [Google Scholar] [CrossRef]
Hewitt, J.; Ippolito, D.; Callahan, B.; Kriz, R.; Wijaya, D.T.; Callison-Burch, C. Learning Translations via Images with a Massively Multilingual Image Dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2566–2576. [Google Scholar] [CrossRef]
Ive, J.; Madhyastha, P.; Specia, L. Distilling Translations with Visual Awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6525–6538. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Li, Z.; Zhao, H. Neural Machine Translation with Universal Visual Representation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Yin, Y.; Meng, F.; Su, J.; Zhou, C.; Yang, Z.; Zhou, J.; Luo, J. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3025–3035. [Google Scholar] [CrossRef]
Wang, X.; Thomason, J.; Hu, R.; Chen, X.; Anderson, P.; Wu, Q.; Celikyilmaz, A.; Baldridge, J.; Wang, W.Y. (Eds.) Advances in Language and Vision Research, Proceedings of the First Workshop on Advances in Language and Vision Research, Online, 9 July 2020; The Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Berahmand, K.; Daneshfar, F.; Salehi, E.S.; Li, Y.; Xu, Y. Autoencoders and their applications in machine learning: A survey. Artif. Intell. Rev. 2024, 57, 28. [Google Scholar] [CrossRef]
Zhu, S.; Li, S.; Xiong, D. VisTFC: Vision-guided target-side future context learning for neural machine translation. Expert Syst. Appl. 2024, 249, 123411. [Google Scholar] [CrossRef]
Zhu, S.; Li, S.; Lei, Y.; Xiong, D. PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 13433–13447. [Google Scholar]
Calixto, I.; Liu, Q.; Campbell, N. Doubly-Attentive Decoder for Multi-modal Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1913–1924. [Google Scholar] [CrossRef]
Yao, S.; Wan, X. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4346–4350. [Google Scholar] [CrossRef]
Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany, 27 June–1 July 2016; pp. 70–74. [Google Scholar]
Nakayama, H.; Tamura, A.; Ninomiya, T. A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4204–4210. [Google Scholar]
Elliott, D.; Kádár, Á. Imagination Improves Multimodal Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, 27 November–1 December 2017; pp. 130–141. [Google Scholar]
Nishihara, T.; Tamura, A.; Ninomiya, T.; Omote, Y.; Nakayama, H. Supervised Visual Attention for Multimodal Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics, 8–13 December 2020; pp. 4304–4314. [Google Scholar]
Imankulova, A.; Kaneko, M.; Hirasawa, T.; Komachi, M. Towards Multimodal Simultaneous Neural Machine Translation. In Proceedings of the WMT, Online, 19–20 November 2020. [Google Scholar]
Calixto, I.; Rios, M.; Aziz, W. Latent Variable Model for Multi-modal Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6392–6405. [Google Scholar]
Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.F.; Wang, W.Y. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4580–4590. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Alinejad, A.; Siahbani, M.; Sarkar, A. Prediction Improves Simultaneous Neural Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3022–3027. [Google Scholar]
Arivazhagan, N.; Cherry, C.; Macherey, W.; Foster, G. Re-translation versus Streaming for Simultaneous Translation. In Proceedings of the 17th International Conference on Spoken Language Translation, Online, 9–10 July 2020; pp. 220–227. [Google Scholar]
Huang, P.Y.; Hu, J.; Chang, X.; Hauptmann, A. Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8226–8237. [Google Scholar]
Yang, P.; Chen, B.; Zhang, P.; Sun, X. Visual agreement regularized training for multi-modal machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9418–9425. [Google Scholar]
Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; Wang, H. Asynchronous Bidirectional Decoding for Neural Machine Translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; pp. 5698–5705. [Google Scholar]
Zheng, Z.; Zhou, H.; Huang, S.; Mou, L.; Dai, X.; Chen, J.; Tu, Z. Modeling Past and Future for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2018, 6, 145–157. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, J.; Zong, C. Synchronous Bidirectional Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2019, 7, 91–105. [Google Scholar] [CrossRef]
Zheng, Z.; Huang, S.; Tu, Z.; Dai, X.Y.; Chen, J. Dynamic Past and Future for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 931–941. [Google Scholar]
Duan, C.; Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Zhu, C.; Zhao, T. Modeling Future Cost for Neural Machine Translation. IEEE/Acm Trans. Audio Speech Lang. Process. 2021, 29, 770–781. [Google Scholar] [CrossRef]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; et al. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, 25–27 June 2007; pp. 177–180. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadephia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 376–380. [Google Scholar]

Figure 1. An overview of our method.

Figure 2. Learning curves and BLEU scores comparison for baseline EMMT and EMMT+TVC+BiVC models. (a) Learning curves comparison; (b) BLEU scores comparison. It shows the learning curves of loss scores for the baseline EMMT model and the EMMT+TVC+BiVC model on the En-De development set. It also presents the BLEU scores of both models on the En-De Test2016 and Test2017 test sets. The results are averaged over 5 training runs.

Table 1. Hyperparameter settings.

Hyperparameter	Value	Description
Embedding Size	512	Size of the word embeddings.
Hidden Size	512	Size of the hidden layers in the network.
No. of Layers	6	Number of layers in the encoder and decoder.
Attention Heads	8	Number of attention heads.
Dropout Rate	0.1	Dropout rate used to prevent overfitting.
Learning Rate	0.0001	Initial learning rate for the optimizer.
Batch Size	64	Number of samples per batch.
Optimizer	Adam	Optimizer used for training the model.
Weight Decay	0.01	Weight decay factor.
Gradient Clipping	1.0	Maximum norm for gradient clipping.
Epochs	50	Number of training epochs.
BiVC Weight	0.5	Weight for the bilingual visual consistency loss.
TVC Weight	0.5	Weight for the temporal visual consistency loss.

Table 2. BLEU and METEOR scores for the proposed methods compared to benchmark methods on the Multi30k En-De Test2016 and Test2017 test sets. Results are averaged over five training runs.

Methods	Test16		Test17
Methods	BLEU	METEOR	BLEU	METEOR
Only-text NMT	35.61	53.6	23.8	45.3
Existing MNMT systems
DMMT [4]	36.9	54.5	-	-
IMG [2]	37.3	55.1	-	-
SMMT [18]	37.5	55.8	26.1	49.9
EMMT [12]	38.5	55.7	-	-
VMMT [24]	-	-	29.3	51.2
Our MNMT systems (±std)
EMMT	38.61 ± 0.5	56.2 ± 0.5	28.00 ± 0.6	51.1 ± 0.5
+TVC	40.71 ± 0.5	58.6 ± 0.3	29.11 ± 0.5	51.9 ± 0.4
+BiVC	39.11 ± 0.6	57.8 ± 0.4	28.53 ± 0.7	51.6 ± 0.5
+TVC+BiVC	41.27 ± 0.5	59.2 ± 0.4	29.70 ± 0.6	52.2 ± 0.5

Table 3. Semantic agreement metrics between bilingual sentence representations and visual representations on the Multi30k En-De Test2016 and Test2017 test sets. Results are averaged over 5 training runs.

Methods	Test2016				Test2017
Methods	bivloss	BLEU	METEOR	TER	bivloss	BLEU	METEOR	TER
EMMT	15.01	38.61	56.2	38.7	16.77	28.00	51.1	47.2
+TVC	8.13	40.71	58.6	36.8	9.02	29.11	51.9	45.0
+BiVC	13.89	39.11	57.8	37.5	11.08	28.53	51.6	46.7
+BiVC+TVC	6.88	41.27	59.2	36.1	5.93	29.70	52.2	44.2

Table 4. Results of an ablation study comparing the baseline EMMT model and our enhanced models using random image annotations on the En-De Test2016 and Test2017 test sets. Results are averaged over 5 training runs.

Methods	Test2016				Test2017
	Truth Image		Random Image		Truth Image		Random Image
	BLEU	METEOR	BLEU	METEOR	BLEU	METEOR	BLEU	METEOR
EMMT	38.61	56.2	35.62	53.24	28.00	51.1	27.47	50.9
+BiVC	39.11	57.8	34.59	52.11	28.53	51.6	27.14	50.41
+TVC	40.71	58.6	35.71	52.59	29.11	51.9	26.39	49.74
+BiVC+TVC	41.27	59.2	34.77	51.78	29.70	52.2	26.78	50.26

Table 5. Results of the impact of different loss functions on BLEU scores.

	En-De Test2016	En-De Test2017	En-Fr Test2017	En-Cs Test2016
L1loss	41.27	29.70	50.46	32.87
L2loss	39.21	30.02	50.21	31.06
KLDivloss	34.75	27.13	50.08	31.78
BCEWithLogitloss	36.31	27.01	50.57	31.24
HingeEmbeddingloss	35.78	27.45	50.06	26.14

Table 6. Performance of multimodal consistency models across different multimodal language pairs. Results are averaged over 5 training runs.

Methods	En-Fr			En-Cs			En-Ja
Methods	METEOR	BLEU	bivloss	METEOR	BLEU	bivloss	METEOR	BLEU	bivloss
EMMT	67.59	48.42	13.62	52.56	31.37	16.01	60.36	44.65	14.82
+TVC	68.07	49.69	9.60	53.86	32.13	8.36	62.07	44.97	8.08
+BiVC	68.89	49.16	13.05	53.14	31.95	15.92	61.21	44.13	14.05
+TVC+BiVC	70.17	50.46	5.14	54.39	32.87	6.97	62.73	45.73	6.71

Table 7. Three performance metrics for different dataset sizes.

Dataset Size	BLEU	METEOR	TER
25%	22.9	25.2	0.53
50%	31.5	42.3	0.49
75%	33.2	48.6	0.47
100%	34.5	52.1	0.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Liu, D.; Zhu, S. Bilingual–Visual Consistency for Multimodal Neural Machine Translation. Mathematics 2024, 12, 2361. https://doi.org/10.3390/math12152361

AMA Style

Liu Y, Liu D, Zhu S. Bilingual–Visual Consistency for Multimodal Neural Machine Translation. Mathematics. 2024; 12(15):2361. https://doi.org/10.3390/math12152361

Chicago/Turabian Style

Liu, Yongwen, Dongqing Liu, and Shaolin Zhu. 2024. "Bilingual–Visual Consistency for Multimodal Neural Machine Translation" Mathematics 12, no. 15: 2361. https://doi.org/10.3390/math12152361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Abstract

1. Introduction

2. Related Work

3. Background of Multimodal Transformer

3.1. Multimodal Self-Attention

3.2. Auto-Regressive Decoder

4. Multimodal Consistency-Based MNMT

4.1. Target–Visual Consistency-Enhanced Target Representation

4.2. Bilingual–Visual Consistency-Guided MNMT

5. Experimental Setup

5.1. Dataset and Setup

5.2. Baselines

6. Results and Discussions

6.1. Main Results

6.2. Evaluation of Semantic Agreement via Bilingual–Visual Consistency Loss

6.3. Learning Curves of Loss and BLEU Scores for Multimodal Consistency-Based MNMT

6.4. Ablation Study for Visual Annotation

6.5. Impact of Different Loss Functions on Performance

6.6. Universality of Multimodal Consistency

6.7. Impact of Different Dataset Sizes

6.8. Discussions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI