Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion

Tian, Erlin; Zhu, Zengchao; Liu, Fangmei; Li, Zuhe; Gu, Ran; Zhao, Shuai

doi:10.3390/electronics13153084

Open AccessArticle

Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion

by

Erlin Tian

^1,*,

Zengchao Zhu

¹,

Fangmei Liu

¹,

Zuhe Li

¹,

Ran Gu

² and

Shuai Zhao

²

¹

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China

²

School of Software, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3084; https://doi.org/10.3390/electronics13153084

Submission received: 26 June 2024 / Revised: 23 July 2024 / Accepted: 2 August 2024 / Published: 4 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Existing research on multimodal machine translation (MMT) has typically enhanced bilingual translation by introducing additional alignment visual information. However, picture form requirements in multimodal datasets pose important constraints on the development of MMT because this requires a form of alignment between image, source text, and target text. This limitation is especially compounded by the fact that the inference phase, when aligning images, is not directly available in a conventional neural machine translation (NMT) setup. Therefore, we propose an innovative MMT framework called the DSKP-MMT model, which supports machine translation by enhancing knowledge distillation and feature refinement methods in the absence of images. Our model first generates multimodal features from the source text. Then, the purified features are obtained through the multimodal feature generator and knowledge distillation module. The features generated through image feature enhancement are subsequently further purified. Finally, the image–text fusion features are generated as input in the transformer-based machine translation reasoning task. In the Multi30K dataset test, the DSKP-MMT model has achieved a BLEU of 40.42 and a METEOR of 58.15, showing its ability to improve translation effectiveness and facilitating utterance communication.

Keywords:

MMT; knowledge distillation; feature fusion

1. Introduction

Machine translation is a key concept in the field of natural language processing (NLP), which aims to automatically convert text from one language to another. As one of the most important and complex branches of NLP, multimodal machine translation (MMT) plays a key role in international communication, culture, sports, entertainment, and media design. With the continuous advancement in deep learning techniques and neural networks, we are striving to address the challenges related to achieving efficient and accurate machine translation. These advances are crucial to enhance cross-language understanding and promote seamless communication across global contexts.

Ahmed [1] introduced a multimodal machine translation (MMT) method that improves text translation by integrating supplementary visual data. This concept has been reflected in early feature engineering and the popular CNN-based network architectures in recent years. For instance, multimodal neural machine translation [2], visually aware translation [3], and video-guided machine translation based on two-layer back-translation [4] exemplify this approach. Vaswani [5] proposed that the transformer-based model also performs well in the field of multimodal information fusion. Visual feature fusion remains a critical research direction in machine translation. Although introducing text features into the transformer model to form dependencies can improve translation quality, most existing methods still encounter several issues: (1) In practical translation, language is often accompanied by various contexts, such as the surrounding environment or the speaker’s gestures, which play a significant role in enhancing translation accuracy. Relying solely on linguistic semantic features without fully utilizing visual features results in models that struggle to integrate different types of information. Consequently, the translation quality remains unsatisfactory. Therefore, strengthening the connection between features of different modalities is crucial for enhancing the model’s robustness. (2) For low-resource languages, the lack of sufficient multimodal data limits the performance of MMT for these language pairs. Insufficient new recognition information during training leads to poor generalization ability of the model, which is detrimental to practical applications, such as the insertion and analysis of important visual information [6].

In the field of NLP, research on large-scale pre-train models offers a reliable method for advancing computer vision. Scholars such as Calixto [7], Ive [8], and Yin [9] have employed the pre-train and then translate paradigm to enhance the MMT system’s translation capabilities. However, the number and quality of supplementary images present a major obstacle to advancing MMT, since obtaining these valuable resources can be both limited and costly, as exemplified by the Multi30K [10] dataset. Elliott, Kádár, and colleagues [11] introduced a novel approach to multi-task learning for MMT, which derives visual features through an auxiliary vision-based task. This approach addresses, to some extent, the limitation that images must be present in MMT. Zhang [12] proposed a method for image retrieval that can find images related to a topic from a small-scale dataset, thereby strengthening the connection between different modalities. Additionally, Long et al. [13] utilized generative adversarial networks to obtain virtual visual features. Overall, these image-free frameworks share a common goal of learning and further acquiring a representation of the generated visual features. By combining or constraining these features with textual features, they eliminate the need for actual image data during inference. While these methods perform better on specific benchmark datasets, they still lack sufficient fusibility between different modalities.

To overcome the shortcomings of previous studies, we borrowed research ideas from parameter-efficient transfer learning in the field of NLP [14,15,16]. We introduced a more cost-effective and efficient method [17,18,19] into computer vision for efficient image-to-different-modality knowledge transfer. This method transforms image features into text features through an additional module, retaining multimodal knowledge while optimizing speed performance. For example, Caglayan et al. [20] speculated that the model might perform poorly due to learning poor-quality features; Arslan et al. [21] identified insufficient visual distribution coverage as a key problem; Helcl et al. [22] noted that the multimodal feature fusion stage might not be appropriate; and Calixto et al. [23] raised concerns about the lack of training stability. These issues have been fully considered and addressed in our design.

We propose a new DSKP-MMT framework that combines image features with a text fusion module to improve the performance of multimodal machine translation. As depicted in Figure 1, this approach aids in synthesizing visual features to closely correlate with image features. Unlike previous approaches [24] that primarily focused on visual feature generation or relied on late fusion stages, our method introduces a teacher network at the initial input stage, leveraging pretrained weights, while concurrently training the student model from scratch. By employing inter-modal and intra-modal distillation techniques, we aim to produce superior multimodal characteristics, fusing image and text data into cohesive composite features. These features are then utilized to refine and fuse original image features, thereby reducing the gap in representation across modalities and facilitating new multimodal feature creation through knowledge transfer. This approach significantly enhances the efficiency of utilizing visual and textual information comprehensively. Experimental results underscore the effectiveness of DSKP-MMT, showcasing notable improvements in training stability and the quality of final translations. By bridging the gap between modalities and effectively utilizing distilled knowledge, our framework marks a notable breakthrough in multimodal machine translation, offering improved capabilities for managing diverse translation tasks across multiple domains and languages.

We carried out extensive tests using the Multi30K dataset, which is the most frequently utilized dataset in MMT research. The BLEU and METEOR scores of our method were validated in a conventional resource setting and compared with other technical methods through various ablation experiments. The primary contributions of this research are:

Visual feature research: We investigated the impact of visual elements within the model and validated the approach of incorporating these elements into MMT. Our specific findings indicate that adding visual features enhances the performance of MMT in text prediction.
Multimodal feature fusion: We proposed a method to generate new multimodal features by combining image features with text features using knowledge distillation technology. This method provides new insights into MMT with cross-modal feature fusion.
DSKP-MMT model: We studied the application of the DSKP-MMT model combined with a pre-trained model in the MMT system. The experimental results clearly demonstrate the superior performance of this method over the existing image-text fusion MMT framework, thereby affirming the effectiveness of our model in leveraging image information to enhance machine translation.

The subsequent sections of this study are structured as follows: In Section 2, we present a thorough review of relevant literature on MMT. In Section 3, we describe the overall architecture of the model and its various components. Section 4 showcases the model’s effectiveness and evaluates each proposed module extensively through rigorous experiments. Finally, Section 5 provides a detailed summary of the study.

2. Related Works

In this section, we start with a brief overview of current research in multimodal machine translation (MMT) within academia (Section 2.1). Next, we introduce our approach to integrating diverse modalities through knowledge distillation technology (Section 2.2). Finally, we discuss the theoretical robustness and practical feasibility of our feature refinement techniques (Section 2.3). This structured approach aims to highlight our contributions and insights into advancing MMT methodologies.

2.1. MMT Backbone

Positioned at the crossroads of multimodal fusion and neural machine translation (NMT) technologies, MMT has garnered significant interest from researchers in recent years. Current studies primarily focus on integrating visual information effectively into the MMT system. For example, Calixto et al. [23] proposed a dual attention decoder to better leverage visual information, concentrating individually on the source text and visual elements. Ive et al. [8] introduced a method to improve translation quality by using visual features. Specifically, they used image information to improve fuzzy information through the second-level decoder, and effectively used text context and visual context through joint training to generate more accurate and fluent translation results, significantly improving the quality of translation. Yao and Wang [25] developed a multimodal transformer model that improves the integration of visual features derived from text using image-aware attention. Yin et al. [9] employed integrated multimodal images to identify various semantic relationships between linguistic units, enhancing the model’s comprehension and processing of multimodal data. High-quality multimodal data is often scarce and costly, which severely limits the development and practical application of MMT technology. In this work, our task is to obtain more accurate feature information for machine translation. This new MMT method effectively breaks through the limitation of data scarcity.

2.2. Knowledge Distillation

Knowledge distillation (KD) was initially introduced by Buciluǎ et al. [26] and Hinton et al. [27]. It functions by conveying knowledge from a sophisticated, high-capacity teacher model (referred to as T) to a more compact, efficient student model (referred to as S). The objective is for the student model to closely match the teacher model’s performance while using fewer resources and less memory, thus simplifying the model, and enhancing its efficiency. This technology has undergone thorough research and been implemented in numerous domains. Romero et al. [28] expanded the application scope of knowledge distillation by transferring knowledge through intermediate hidden layers. Yim et al. [29] introduced a technique to extract knowledge via inter-layer flow transfer, achieved through the computation of the inner product between features extracted from two layers. Within multimodal fusion research, Gupta [30] initially introduced the method of transferring guidance across images from diverse modalities. Yuan and Peng [31] proposed symmetric distillation networks tailored for tasks related to synthesizing images from text. Building upon these developments, we introduced the Enhanced Knowledge Distillation module. To tackle the issues posed by limited data availability and expensive annotation costs in MMT, this module maximizes the use of knowledge distillation technology to produce multimodal features. Our module not only effectively integrates multimodal information but also enhances translation efficiency in limited data environments.

2.3. Image-Text Feature Fusion

Currently, a common approach involves inputting fused visual features and text features into models to generate translations using visual information. Fisch et al. [32] utilized an MMT dataset to create a simultaneous interpretation dataset, using user-provided question–answer pairs as supervision to learn visual information requirements. Calixto et al. [33] introduced a method for capturing the interaction between visual and textual features using latent variables, which are integrated into the target language decoder to predict image features. Ranaldi and Leonardo [34] proposed a method to use the knowledge learned by the transformer model based on neural models. This method uses the structure of knowledge to implement a series of intelligent operations, thereby using modern artificial intelligence to inherit and apply what has been learned. They also proposed a method to enhance the cross-lingual capabilities of instruction-adapted LLMs by building semantically aligned translation trace demonstrations between languages [35]. In our study, the objective is to augment the distilled synthetic image–text features with original image features, thereby further enhancing the interdependence between image and text.

In summary, by integrating and aligning images and texts, we demonstrate how visual data enhances translation outcomes, driven by knowledge distillation techniques. This approach innovatively uses natural language text descriptions for initial visual perception prediction, aligning feature representations on diverse image–text pairs through contrast loss to generate new synthesis features. The refined model excels in fusing visual and textual information effectively. Our DSKP-MMT framework leverages enhanced knowledge distillation techniques and feature refinement to generate effective multimodal features. By doing so, it overcomes the challenges of data scarcity, quality, and annotation costs, ultimately achieving superior machine translation results.

3. Method

This section begins by outlining the framework of dual-enhanced knowledge distillation and feature synthesis refinement to achieve high-quality machine translation models in Section 3.1. Then, we detail our three key components: the image-free MMT backbone in Section 3.2, enhanced multimodal feature generation in Section 3.3, and enhanced feature fusion. The multimodal student network and the visual teacher network play pivotal roles as components for improving multimodal feature generation, offering adaptability to datasets that conventionally involve images. Finally, the learning objectives of our approach are elaborated in Section 3.4.

3.1. Mirrorless MMT Backbone

According to the characteristics of the final refinement

X = (x_{1}, \dots, x_{n})

dimension (Section 3.3 and Section 3.4), every token undergoes mapping to a word-embedding vector through text embedding with positional encoding. Among them,

d_{w}

and

t = (E_{x_{1}}, \dots E_{x_{n}})

are word-embedding dimensions and word-embedding text features, respectively [36]. The proposed method can simultaneously encode the position of the entire source sentence better than recurrent networks whose computation is constrained by time dependencies. Following this, we input the image features along with the enhanced multimodal features

m

into the multimodal feature transformer encoder at the multimodal encoder layer. Ultimately, the multimodal features are combined with the image features to reconstruct the new multimodal features into the query vector, formulated as follows:

\bar{x} = [t; m W^{m}] \in ℝ^{(I + P) * d}

(1)

Among them,

I

represents the original sentence length, while

P

denotes the dimensionality of the multimodal features. From a nodes and graphs perspective, we interpret this modal fusion by treating each source token as a node. Moreover, every region in the multimodal features can be considered a pseudo-tag that integrates into the source token map for modality fusion. The key and value vectors continue to retain their textual features,

t

, and the calculation of the multimodal encoder layer proceeds as follows:

c_{k} = \sum_{i = 1}^{I} \tilde{α} (t_{i} W^{V})

(2)

{\tilde{α}}_{k i} = softmax (\frac{({\tilde{x}}_{k} W^{Q}) {(t_{i} W^{k})}^{T}}{\sqrt{d}})

(3)

In this study, we directly adopted the encoder and decoder of the transformer [37]. They are based on a novel dual-level interactive multimodal rogue encoder that integrates multiple modalities. This encoder can extract valuable visual features to enhance text-level machine translation. Based on the target sentence

Y = (y_{1}, \dots, y_{n})

, our framework generates the predicted probability of the target word

y_{i}

as follows:

p (y_{i} | y_{< i}, X, m) \propto \exp (W^{h} H_{j}^{L} + b^{h})

(4)

where

H_{j}^{L}

denotes the highest output of the decoder at the

j

decoding step,

W^{h}

and

b^{h}

refer to trainable multi-layer perceptron and the SoftMax layer within the exp() function.

3.2. Multimodal Feature Generation

In this section, we begin by detailing the framework, notation definitions, and task objectives of our enhanced knowledge distillation approach.

3.2.1. Preliminaries

The improved knowledge distillation framework includes an F multimodal feature generator, a visual teacher model T, and a multimodal student model S. Specifically, the interaction and data exchange between the multimodal student model S and the visual feature teacher module T are reciprocal, facilitating effective transmission and fusion of diverse modal information. The multimodal feature generator F is responsible for processing and fusing data from different modalities. The visual teacher model T extracts image features through a pre-train CNN, and the student model S generates the final multimodal representation by learning and integrating these features. This framework is designed to be flexible and can be easily integrated with any CNN network structure, such as VGG19 used by Simonyan and Zisserman [38] and AlexNet used by Krizhevsky et al. [39]. This framework sufficiently enhances the capabilities of feature extraction and representation. The parameters of model S are designated as

θ^{s}

. When all text features

t

are supplied to S, the hidden representation produced at layer

l

is denoted as

φ_{l}^{S} (t, θ_{l}^{s})

. F generates multimodal features

m

, while S produces inverse features

I_{s}

after passing through the S-conv1 layer. The combination of real images and these inverse features is denoted as

{I_{s}, I_{r}} \in ℝ^{m * n * 3}

. When provided with the features as input, the hidden representation generated by layer

l

of T is denoted as

φ_{l}^{T} (I)

.

The goal is to directly derive features that integrate multiple modes from the source text, reducing the dependency of machine translation systems on images during pre-training. Employing a teacher–student model architecture, our approach systematically extracts and enhances visual perception capabilities. This method focuses on synergizing the teacher and student models to effectively fuse and transmit multimodal information, even in the absence of images. By enhancing the translation efficacy and robustness of the MMT system, we ensure its adaptability across diverse linguistic contexts and scenarios, thereby advancing the field of multimodal machine translation.

We convert all word-embedding vectors into global context features by simple average pooling. The average pooling method proposed by Zhang et al. [40] can effectively integrate the embedding information of each word to generate a comprehensive global feature. These global text features have been shown to carry overall semantic information and fully reflect the overall meaning of the input text, formulated as follows:

\bar{t} = \frac{1}{I} \sum_{i = 1}^{I} E_{x_{i}}

(5)

Next, the global text features are sequentially passed to the multimodal feature generation process to compute the multimodal features,

m

, formulated as follows:

m = u n p o o l (W^{t} \bar{t})

(6)

In this approach, the global text features are first projected into the image space via a fully connected (FC) layer, thereby transforming textual attributes into a format resembling image feature. Following this transformation, average pooling is employed to convert a low-dimensional latent vector into a high-dimensional multimodal feature map, matching the dimensions of the teacher model’s final convolution activation. Crucially, the semantic integrity of these multimodal features is established by modeling the global textual context under the supervision of text translation. This methodology guarantees that the resulting multimodal features effectively encapsulate and convey the semantic nuances inherent in the input text.

3.2.2. Knowledge Distillation

Our knowledge distillation method transfers visual perception insights from the teacher model to the student model, promoting deep engagement with textual semantics within the multimodal feature generator. To create a comprehensive multimodal feature enriched with information, we establish a novel distillation framework involving one inter-modal and two intra-modal knowledge transfer processes. This method ensures that visual perception information and textual semantics can be fully integrated, thereby generating multimodal features with rich information and high expressiveness. For the knowledge distillation model, we adjusted key structural parameters, as detailed in Table 1.

(1): The student model S is directed to derive crucial visual representation details directly from the source text, thereby establishing an inter-modal semantic connection between textual content and actual images. In each layer $l$ of the teacher model T, when provided with a real image $I_{r}$ , a visual representation $φ_{l}^{T} (I_{r})$ is produced, alongside a corresponding reverse hidden representation $φ_{l + 1}^{T} (\bar{t}; θ_{l}^{s})$ in the subsequent layer $l + 1$ . The pair of representation features $φ_{l + 1}^{S} (\bar{t}; θ_{l}^{s})$ and $φ_{l}^{T} (I_{r})$ share identical dimensionalities and encompass comparable latent concepts. The loss is defined as the sum of discrepancies between these two representations, supplemented by an auxiliary regularization term, formulated as follows:

${Loss}_{1} = \sum_{l} {‖ φ_{l}^{T} (I_{r}) - φ_{l + 1}^{S} (\bar{t}; θ_{l}^{s}) ‖}_{2} + {‖ I_{r} - I_{s} ‖}_{2}$

(7)

The

L_{2}

norm

{‖ ‖}_{2}

quantifies the similarity between two vectors. The regularization term

{‖ I_{r} - I_{s} ‖}_{2}

denotes the image-to-space loss, serving as a critical constraint for S to accurately learn the distribution of real images.

(2): The constrained student model S acquires visual perception from the image using the inverse features, thereby mitigating the intra-modal disparity between these features and the actual images. In this process, we input the inverse features $I_{s}$ into the teacher model T to derive the teacher’s pseudo-visual representation $φ_{l}^{D} (I_{s})$ of its cognition. Next, to motivate the student model to thoroughly learn the image distribution, we reduce the gap between the pseudo-visual representation $φ_{l}^{D} (I_{s})$ and its corresponding visual representation $φ_{l}^{T} (I_{r})$ . Consequently, ${Loss}_{2}$ is described as a blend of the difference and the image space loss, formulated as follows:

${Loss}_{2} = \sum_{l} {‖ φ_{l}^{T} (I_{r}) - φ_{l}^{D} (I_{s}) ‖}_{2} + {‖ I_{r} - I_{s} ‖}_{2}$

(8)
(3): To constrain the student model S and enhance the effective information transfer from the teacher model T, we employ a method to progressively close the gap between known and predicted information, thus improving the precise representation capability of features. Specifically, we achieve this task objective by reducing the difference between the inverted hidden representation $φ_{l + 1}^{T} (\bar{t}; θ_{l}^{s})$ generated by the student model S in the next layer $l + 1$ and the pseudo visual representation $φ_{l}^{D} (I_{s})$ predicted by the teacher model. Then we calculate the new ${Loss}_{3}$ loss layer by layer, starting from the information successfully converted from the previous level and the final information predicted, formulated as follows:

${Loss}_{3} = \sum_{l} {‖ φ_{l + 1}^{S} (\bar{t}; θ_{l}^{s}) - φ_{l}^{D} (I_{s}) ‖}_{2} + {‖ I_{r} - I_{s} ‖}_{2}$

(9)

Compared with text-to-image (T2I) synthesis research [41,42,43], our study focuses on enhancing text translation through dual visual distillation involving both inter-modal and intra-modal aspects. The multimodal features generated by our method emphasize aligning and fusing text and image, thereby reinforcing the connection from known text information to predicting unknown information. This approach ensures that the generated features precisely capture the text’s semantic content, rather than simply verifying the image’s accuracy.

3.3. Feature Refinement

In terms of feature fusion, we introduce additional contrastive learning as enhanced feature refinement in the pre-train stage to further strengthen the constraints between images and texts. Chen et al. [44] enhanced feature fusion between different modalities through cross-modal comparative learning, resulting in better input feature representations. Contrastive learning operates as a self-supervised framework. It employs contrastive learning to augment data, enhancing the effectiveness of visual representations. To further purify the connection between existing features, we perform enhanced contrast fusion on the original image feature

X = (x_{1}, \dots, x_{n})

output by the multimodal feature generator and the synthetic image and text feature

Y = (y_{1}, \dots, y_{n})

. Therefore, the

{Loss}_{fuse}

loss function is described as a mix of the feature difference and the image space loss, formulated as follows:

{Loss}_{fuse} = \sum_{i, j} [1 (i = j) sim (x_{i}, y_{j}) + 1 (i \neq j) \max (0, m - sim (x_{i}, y_{j}))]

(10)

We assert that multimodal fusion is crucial for machine translation, particularly in representing visual features. These advanced techniques can enhance MMT system performance by efficiently integrating visual and textual features, thereby improving the quality of feature representations.

3.4. Objective Function

Throughout training, we fine-tune the proposed multimodal translation (MMT) model end-to-end by minimizing the knowledge distillation loss, feature refinement loss, and text translation loss. This approach guarantees the effective fusion and refinement of multimodal features, enhancing the model’s overall performance.

J (θ, θ_{s}) = J_{t r a n s} (θ, θ_{s}) + {Loss}_{1} + {Loss}_{2} + {Loss}_{3} + {Loss}_{fuse}

(11)

Regarding the translation loss on the training dataset

D

, it not only establishes the connection between the source text and the target text but also captures the textual semantics within multimodal features:

J_{t r a n s} (θ, θ_{s}) = - \sum_{D} \sum_{J} \log p (y_{i} | y_{< j}, X, m)

(12)

During the experimental stage, the adeptly trained multimodal feature generator and refinement process successfully yield enhanced features integrated into the MMT backbone, thereby obviating the need for image dependencies in the encoder.

4. Experiments and Analysis

Here, we begin by outlining the environment configurations and implementation specifics utilized in the experiments detailed in Section 4.1. Then, we perform ablation studies and related experiments on our proposed method and key components in Section 4.2 and Section 4.3. Finally, we conduct corresponding visualization analysis and case studies in Section 4.4 and Section 4.5. Extensive experimentation demonstrates that our approach attains performance comparable to that of the most advanced machine translation technologies on certain datasets.

4.1. Dataset

We performed several experiments using the Multi30K dataset, widely recognized as a leading human-annotated dataset in multimodal machine translation (MMT). Each text in Multi30K is paired with a jpg image sourced from the Flickr 30k [45] dataset, and the texts are manually rendered into German (DE) and French (FR) [46]. Each language pair in Multi30K consists of 29,000 instances in the train set, 1014 instances in the valid set, and 1000 instances in the Test 2016 test set. Additionally, our evaluation encompassed the Test 2017 test set and the MS COCO test set, each comprising 1000 instances. Byte pair encoding (BPE) and 10,000 merge operations were applied as part of the official Multi30K script to preprocess both source and target sentences, constructing their vocabularies effectively.

4.2. Image Feature Extraction

We employ the ResNet50 architecture [47,48] for image feature extraction within our model. The comprehensive structure of each module is outlined, as depicted in Figure 2, detailing the architecture and parameter configurations for multimodal feature generation. ResNet50 processes images through 4 blocks, each consisting of 3, 4, 6, and 3 bottleneck layers, respectively. In the final block, after applying average pooling and fully connected layers, a SoftMax layer is utilized to normalize and adjust the final output to 2048 dimensions. We adapt the last layer of the model to output information related to image features. The image feature data that has been extracted is preserved in the idx data format.

4.3. Training Parameters

Our experiments were executed using the transformer architecture implemented within the Fairseq framework [49], an open-source NLP tool by Facebook built on PyTorch. The study comprised two main parts: a multimodal comparison experiment employing the ResNet50 model to extract image features, and a machine translation experiment aimed at enhancing these refined features. This involved configuring six multi-head attention heads, setting the learning rate to 0.004, and applying a dropout rate of 0.3. The Adam optimizer was utilized for optimization, with β1 and β2 parameters set at 0.9 and 0.98, respectively. A batch size of 64 for training, 2000 warmup steps, and a maximum token limit of 1024 were configured on an RTX-Titan server with 24 GB of memory. To ensure robustness and prevent overfitting on small datasets, we employed the tiny transformer configuration. Experimentally, we set a patience parameter of 10 to extend training by 10 epochs once the model’s performance peaked. If no further improvement was observed, training was halted. Evaluation metrics, such as case-insensitive BLEU and METEOR, were used to select the best-performing model based on translation quality.

Overall, these rigorous experimental protocols and configurations enabled us to validate the efficacy of our approach in enhancing multimodal machine translation, demonstrating significant improvements in both training stability and translation accuracy across various evaluation datasets.

4.4. Main Results

Each model is executed three times, and the results are reported as an average. Based on the above design, we successfully assessed the translation quality for English to German (EN-DE) and English to French (EN-FR).

EN-DE Translation Task: Table 2 presents the performance of various MMT systems using the Multi30K dataset for the EN-DE translation task. By comparing and analyzing with previous data, we draw the following three conclusions:

(1): MMT systems utilizing image constraints typically surpass those lacking them, demonstrating that incorporating visual data can significantly enhance translation quality.
(2): Our BLEU and METEOR scores using the DSKP-MMT model significantly outperform the test results of the MMT system without introducing images. This enhancement demonstrates our model’s capability to integrate multimodal features seamlessly during training and utilize them to facilitate translation in testing phases.
(3): Our DSKP-MMT model not only exceeds the performance of the existing image-constrained MMT system but also rivals the current state-of-the-art image-constrained MMT system (SOTA).

EN-FR Translation Task: From the EN-DE translation task, we proceeded to conduct experiments for the EN-FR translation task as well. The experiments demonstrate that our DSKP-MMT model achieves superior scores on the EN-FR translation task compared to the baselines listed in Table 2, which verifies that our model is also robust and universal in different language scenarios.

In summary, our DSKP-MMT model, as an image-free approach to multimodal machine translation (MMT), demonstrates superior performance compared to other MMT systems, including those incorporating additional image information. The substantial improvements in BLEU and METEOR scores underscore the robust translation capabilities of the DSKP-MMT model. It excels in integrating visual features with text semantics, enhancing text-related visual representations through images, and achieving superior synthetic features through enhanced text translation and visual distillation refinement. By leveraging the richness of information and stability in generating multimodal features, our method proves more resilient in overcoming data constraints.

4.5. Ablation Research

Table 3 presents the ablation results for the EN-DE experiment, illustrating how various modules affect the experimental outcomes. We examine how different variants impact the experimental outcomes from the following perspectives:

(1): By investigating how different similarity functions impact the assessment of divergence between hidden representations within a module, we determined that the $L_{2}$ norm produces the optimal results, with the remaining performance ordered as Kl-Div > $L_{1}$ > $L_{\infty}$ > Consine.
(2): By comparing the impact of enhanced knowledge distillation at each neural network layer and block, we observed that the best evaluation results were achieved through distillation at the whole-model level. This emphasizes that utilizing information from both initial and final representations, while reinforcing relationships, effectively guides the student model in producing enriched features. This illustrates the superiority of our knowledge distillation module over traditional approaches.
(3): By studying three image feature extraction methods using different CNNs, we determined that the ResNet50 network structure excelled due to its ability to extract the most robust visual feature information among the deep residual networks. Due to the absence of residual connections and the requirement for a substantial number of training samples, the performance of the VGG19 network structure was subpar. Undoubtedly, the lightweight AlexNet network structure caused the most serious translation performance degradation. This demonstrates that advanced image feature extraction models are well-suited for handling the demands of multi-supervised learning.
(4): By discussing the impact of different distillation loss strategies on translation performance, we find that removing enhanced knowledge distillation and enhanced feature refinement leads to severe performance degradation. Multimodal features without visual awareness show poor performance in machine translation, as they are derived solely by passing global text features through FC&Avg Unpool. This phenomenon suggests that our enhanced method highlights the significance of the teacher model’s intermediate hidden states in instructing the student model on comprehending text–image relationships.

In summary, our extensive exploration across different aspects—similarity functions, neural network configurations, and loss strategies—has led us to identify key factors that significantly enhance our model’s translation performance. By adopting the norm similarity function, implementing a model-wide distillation strategy, utilizing the ResNet50 network architecture, and employing a comprehensive loss approach that effectively integrates both visual and textual information, we achieve maximum effectiveness in leveraging multimodal features for translation tasks. These choices not only guarantee robust feature fusion and alignment but also affirm the effectiveness of incorporating extra image information in enhancing the fidelity and accuracy of text translation. This systematic approach underscores the feasibility and efficacy of our model in advancing multimodal machine translation research.

4.6. Image Matching Analysis

According to Table 4, for the initial example, the text description is “a man on a four-wheeler is flying through the air.”. Through juxtaposing the images with the textual content, we observed that all details mentioned in the text—such as “man”, “four-wheeler”, and “the air”—are depicted in the images. However, the matched images also contain some information that is not present in the text. For example, in the third example in Table 4, the matched image showing a “dog” may be trying to grab an object. Although this information is not explicitly written in the text, the action information of the “dog” in the image can be reflected in the retrieved image. By analyzing the results of text–image matching, we found that transformer encoding of image features and synthetic image–text features before final input can significantly improve the accuracy of image matching. These matching images accurately portray the content of the corresponding text descriptions, demonstrating that incorporating additional image information can enhance the efficacy of plain text translation to some degree. This further validates the feasibility and effectiveness of our model.

4.7. Case Study

We demonstrate from another perspective that our model has better performance in identifying objects in images. This report includes two case studies to assess the translation quality achieved by the proposed method. As shown in Table 5, in specific translation situations, although traditional MMT systems can successfully translate to a certain extent, there are some differences in the translation of nouns and verbs. While both successfully translate the phrase, the actual effect is a difference in meaning. By adding visual information, the model can translate the meaning of “einem schutthaufen” more intuitively and accurately. In the EN-FR translation comparison, our method also successfully translates “crane operates” into “grue travaille”. Since the visual concepts obtained via object detection convey both object attributes and object-level information, they significantly enhance the accuracy and fluency of the translation. These results prove that the model with added visual information has higher accuracy and consistency in handling specific translation tasks.

5. Conclusions

This work introduces a novel framework called DSKP-MMT model. It achieves MMT by integrating visual features with semantic information by strengthening knowledge distillation and feature refinement methods and conducting many related experiments. In our DSKP-MMT model framework, there are three main contributions: (1) Through the dual constraints of visual feature extraction and textual semantics, we generate informative multimodal features that support translation in MMT systems after an enhanced feature refinement stage; (2) We innovatively use a pre-train model to guide the translation results, making the knowledge extraction module more flexible and efficient; (3) We found that the visual modality not only acts as a form of regularization but also plays a vital role in establishing correlations between text and image, underscoring its fundamental importance. Simultaneously, the results from ablation experiments validate the robustness of the structure within our proposed DSKP-MMT model. In summary, our model better leverages image information for translation purposes. In future endeavors, we aim to integrate image feature information into various models to enhance translation efficacy across diverse languages.

Author Contributions

Conceptualization, E.T.; methodology, E.T. and Z.Z.; software, Z.Z., R.G. and S.Z.; validation, Z.Z.; formal analysis, E.T., F.L. and Z.L.; investigation, E.T., F.L. and Z.L.; resources, E.T. and Z.Z.; data curation, E.T., Z.Z., R.G. and S.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z.; Visualization Z.Z., Z.L., R.G. and S.Z.; supervision, Z.L, R.G. and S.Z.; project administration, E.T. and Z.L.; funding acquisition, E.T.; All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Henan Provincial Science and Technology Research Project: 232102211017, 232102211006, 232102210044, 242102211020 and 242102211007, the Zhengzhou University of Light Industry Science and Technology Innovation Team Program Project: 23XNKJTD0205.

Data Availability Statement

The data presented in this study are openly available in [Multi30K] at [https://arxiv.org/abs/1605.00459 (accessed on 2 May 2016)] and [Flickr 30k] at [http://shannon.cs.illinois.edu/DenotationGraph/data/index.html (accessed on 1 February 2014)] reference number [10] and reference number [45].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Weighted Transformer Network for Machine Translation. Available online: https://arxiv.org/abs/1711.02132 (accessed on 6 November 2017).
Laskar, S.R.; Paul, B.; Paudwal, S.; Gautam, P.; Biswas, N.; Pakray, P. Multimodal Neural Machine Translation for English—Assamese Pair. In Proceedings of the 2021 International Conference on Computational Performance Evluation (ComPE), Shillong, India, 1–3 December 2021. [Google Scholar]
Chen, J.R.; He, T.L.; Zhuo, W.P.; Ma, L.; Ha, S.T.; Chan, S.H.G. Tvconv: Efficient translation variant convolution for layout-aware visual processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022. [Google Scholar]
Chen, S.Y.; Zeng, Y.W.; Cao, D.; Lu, S.F. Video-guided machine translation via dual-level back-translation. Knowl. -Based Syst. 2022, 245, 108598. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Keyframe Segmentation and Positional Encoding for Video-Guided Machine Translation Challenge 2020. Available online: https://arxiv.org/abs/2006.12799 (accessed on 23 June 2020).
Incorporating Global Visual Features into Attention-Based Neural Machine Translation. Available online: https://arxiv.org/abs/1701.06521 (accessed on 23 January 2017).
Distilling Translations with Visual Awareness. Available online: https://arxiv.org/abs/1906.07701 (accessed on 18 June 2019).
A Novel Graph-Based Multi-Modal Fusion Encoder for Neural Machine Translation. Available online: https://arxiv.org/abs/2007.08742 (accessed on 17 July 2020).
Multi30k: Multilingual English-German Image Descriptions. Available online: https://arxiv.org/abs/1605.00459 (accessed on 2 May 2016).
Imagination Improves Multimodal Translation. Available online: https://arxiv.org/abs/1705.04350 (accessed on 7 July 2017).
Zhang, Z.S.; Chen, K.H.; Wang, R.; Utiyama, M.; Sumita, E.; Li, Z.C.; Zhao, H. Neural machine translation with universal visual representation. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 9 September 2019. [Google Scholar]
Generative Imagination Elevates Machine Translation. Available online: https://arxiv.org/abs/2009.09654 (accessed on 13 April 2021).
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. Available online: https://arxiv.org/abs/2303.02861 (accessed on 6 March 2023).
Lei, T.; Bai, J.; Brahma, S.; Ainslie, J.; Lee, K.; Zhou, Y.Q.; Du, N.; Zhao, V.; Wu, Y.X.; Li, B.; et al. Conditional adapters: Parameter-efficient transfer learning with fast inference. Adv. Neural Inf. Process. Syst. 2023, 36, 8152–8172. [Google Scholar]
Xin, Y.; Du, J.L.; Wang, Q.; Lin, Z.W.; Yan, K. VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, BC, CA, 20–27 February 2024. [Google Scholar]
Adaptive Feature Fusion: Enhancing Generalization in Deep Learning Models. Available online: https://arxiv.org/abs/2304.03290 (accessed on 4 April 2023).
Adaptive Ensemble Learning: Boosting Model Performance through Intelligent Feature Fusion in Deep Neural Networks. Available online: https://arxiv.org/abs/2304.02653 (accessed on 4 April 2023).
Wu, Y.S.; Chen, K.; Zhang, T.Y.; Hui, Y.C.; Berg-Kirkpatrick, T.; Dubnov, S. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece , 4–9 June 2023. [Google Scholar]
LIUM-CVC Submissions for WMT17 Multimodal Translation Task. Available online: https://arxiv.org/abs/1707.04481 (accessed on 14 July 2017).
Doubly Attentive Transformer Machine Translation. Available online: https://arxiv.org/abs/1807.11605 (accessed on 30 July 2018).
CUNI System for the WMT18 Multimodal Translation Task. Available online: https://arxiv.org/abs/1811.04697 (accessed on 12 November 2018).
Doubly-Attentive Decoder for Multi-Modal Neural Machine Translation. Available online: https://arxiv.org/abs/1702.01287 (accessed on 4 February 2017).
Shi, X.Y.; Yu, Z.Q. Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language. Math. Probl. Eng. 2022, 2022, 5483535. [Google Scholar] [CrossRef]
Yao, S.W.; Wan, X.J. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020. [Google Scholar]
Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Minging, Philadelphia, PA, USA, 20–23 August 2006. [Google Scholar]
Distilling the Knowledge in a Neural Network. Available online: https://arxiv.org/abs/1503.02531 (accessed on 9 March 2015).
Fitnets: Hints for Thin Deep Nets. Available online: https://arxiv.org/abs/1412.6550 (accessed on 27 March 2015).
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Gupta, S.; Hoffman, J.; Malik, J. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yuan, M.K.; Peng, Y.X. Text-to-image synthesis via symmetrical distillation networks. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018. [Google Scholar]
Capwap: Captioning with a Purpose. Available online: https://arxiv.org/abs/2011.04264 (accessed on 9 November 2015).
Latent Variable Model for Multi-Modal Translation. Available online: https://arxiv.org/abs/1811.00357 (accessed on 16 May 2019).
Ranaldi, L.; Pucci, G. Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci. 2023, 13, 677. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. Does the English Matter? Elicit Cross-Lingual Abilities of Large Language Models. In Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), Singapore, 7 December 2023. [Google Scholar]
A Convolutional Encoder Model for Neural Machine Translation. Available online: https://arxiv.org/abs/1611.02344 (accessed on 25 July 2017).
Ye, J.J.; Guo, J.J. Dual-level interactive multimodal-mixup encoder for multi-modal neural machine translation. Appl. Intell. 2022, 52, 14194–14203. [Google Scholar] [CrossRef]
Very Deep Convolutional Networks for Large-Scale Image Recognition. Available online: https://arxiv.org/abs/1409.1556 (accessed on 10 April 2017).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhang, Y.; Jin, R.; Zhou, Z.H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
Reed, S.; Akata, Z.; Yan, X.C.; Logeswaran, L. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.S.; Zhang, S.T.; Wang, X.G.; Huang, X.L.; Metaxas, D.N. Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Xu, T.; Zhang, P.C.; Huang, Q.Y.; Zhang, H.; Gan, Z.; Huang, X.L.; He, X.D. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020. [Google Scholar]
From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Available online: http://shannon.cs.illinois.edu/DenotationGraph/data/index.html (accessed on 1 February 2014).
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. Available online: https://arxiv.org/abs/1710.07177 (accessed on 19 October 2017).
Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Available online: https://onlinelibrary.wiley.com/doi/full/10.1155/2020/8909458 (accessed on 21 October 2020).
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Fairseq: A fast, Extensible Toolkit for Sequence Modeling. Available online: https://arxiv.org/abs/1904.01038 (accessed on 1 April 2019).
A Visual Attention Grounding Neural Model for Multimodal Machine Translation. Available online: https://arxiv.org/abs/1808.08266 (accessed on 28 August 2018).
Probing the Need for Visual Context in Multimodal Machine Translation. Available online: https://arxiv.org/abs/1903.08678 (accessed on 2 June 2019).
Lin, H.; Meng, F.D.; Su, J.S.; Yin, Y.j.; Yang, Z.Y.; Ge, Y.B.; Zhou, J.; Luo, J.B. Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Gumbel-Attention for Multi-Modal Machine Translation. Available online: https://arxiv.org/abs/2103.08862 (accessed on 24 July 2022).
Wang, D.X.; Xiong, D.Y. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021. [Google Scholar]

Figure 1. Framework of the DSKP-MMT model.

Figure 2. Final output of the modified ResNet50 model.

Table 1. Parameter structure of distillation network.

Visual Teacher Model			Multimodal Student Model
Layer Name	Output Size	49-Layer	Layer Name	Output Size	48-Layer
T-conv1	112 × 112	7 × 7, 64, stride2	S-conv1	224 × 224	8 × 8, 3, stride2
T-conv2_x	56 × 56	3 × 3, max pool, stride2	S-conv2_x	112 × 112	8 × 8, 3, stride2
T-conv2_x	56 × 56	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 64 \\ 64 \\ 256 \end{matrix} \end{matrix}] \times 3$	S-conv2_x	112 × 112	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 256 \\ 64 \\ 64 \end{matrix} \end{matrix}] \times 3$
T-conv3_x	28 × 28	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 128 \\ 128 \\ 512 \end{matrix} \end{matrix}] \times 4$	S-conv3_x	56 × 56	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 512 \\ 128 \\ 128 \end{matrix} \end{matrix}] \times 4$
T-conv4_x	14 × 14	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 256 \\ 256 \\ 1024 \end{matrix} \end{matrix}] \times 6$	S-conv4_x	28 × 28	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 512 \\ 256 \\ 256 \end{matrix} \end{matrix}] \times 6$
T-conv5_x	7 × 7	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 512 \\ 512 \\ 2048 \end{matrix} \end{matrix}] \times 3$	S-conv5_x	14 × 14	$[\begin{matrix} \begin{matrix} 1 \times 1, \\ 3 \times 3, \\ 1 \times 1, \end{matrix} & \begin{matrix} 1024 \\ 512 \\ 512 \end{matrix} \end{matrix}] \times 3$
N/A	1 × 1	average pool	-	-	-
Multimodal Feature Generator			N/A	7 × 7	2048-d fc, average unpool

Table 2. BLEU (“B”) and METEOR (“M”) scores for the EN-DE and EN-FR translation tasks.

MultiModel	MMT Systems	EN-DE						EN-FR
		Test2016		Test2017		MS COCO		Test2016		Test2017
		B	M	B	M	B	M	B	M	B	M
Image-must	NMT_SRC+IMG [23]	36.5	55.0	-	-	-	-	-	-	-	-
	IMG_D [7]	37.3	55.1	-	-	-	-	-	-	-	-
	Fusion-conv [20]	37.0	57.0	29.8	51.2	25.1	46.0	53.5	70.4	51.6	68.6
	VMMT [33]	37.7	56.0	30.1	49.9	25.5	44.8	-	-	-	-
	Trg-mul [20]	37.8	57.7	30.7	52.2	26.4	47.4	54.7	71.3	52.7	69.5
	VAG-NMT [50]	-	-	31.6	52.2	28.3	48.0	-	-	53.8	70.3
	DS-SUM-L2 [51]	39.4	58.7	32.6	52.9	-	-	60.7	76.0	54.2	71.0
	Del+obj [8]	38.0	55.6	-	-	-	-	59.8	74.4	-	-
	Multimodal [25]	38.7	55.7	-	-	-	-	-	-	-	-
	GMNMT [9]	39.8	57.6	32.2	51.9	28.7	47.6	60.9	74.9	53.9	69.3
	DCCN [52]	39.7	56.8	31.0	49.9	26.7	45.7	61.2	76.4	54.3	70.3
	Gumbel-att [53]	39.2	57.8	31.4	51.2	26.9	46.0	-	-	-	-
	OVC+L_m [54]	-	-	32.3	52.4	28.9	48.1	-	-	54.1	70.5
Image-free	Transformer [5]	37.6	55.3	31.7	52.1	27.9	47.8	59.0	73.6	51.9	68.3
	Multitask [11]	36.8	55.8	-	-	-	-	-	-	-	-
	VMMT_F [33]	37.7	56.0	30.1	49.9	25.5	44.8	-	-	-	-
	UVR-NMT [12]	36.94	-	28.63	-	-	-	57.53	-	48.46	-
	ImagiT [13]	38.5	55.7	32.1	52.4	28.7	48.8	59.7	74.0	52.4	68.3
	DSKP-MMT (ours)	40.42	58.15	32.87	52.44	28.83	47.73	61.49	76.7	54.33	71.37

Table 3. Ablation experiment results of different variants.

	Sim. Func.	Dist. Gran	CNN Back	Dist. Loss	Avg.B	Avg.M
base	L₂	Model	ResNet50	All_Loss	40.42	58.15
(1)	L₁	-	-	-	39.98	58.07
	L_∞	-	-	-	39.83	57.66
	Consine	-	-	-	39.57	57.37
	Kl-Div	-	-	-	40.06	57.82
(2)	-	Block	-	-	39.82	57.66
(3)	-	-	VGG19	-	40.06	57.82
(3)	-	-	AlexNet	-	39.88	57.52
(4)	-	-	-	Text_Loss	39.72	57.50
	-	-	-	KD_Loss	40.08	57.33
	-	-	-	F_Loss	40.28	57.74

Table 4. We chose some of the experimental data as examples of text–image joint matching results and as proofs of rev.

Source Text (EN)	Source Image	Match Image
a man on a four-wheeler is flying through the air.
two dogs are wrestling in a grassy field.
a white dog is about to catch a yellow dog toy.

Table 5. Translation cases of different models.

Image	Text Descriptive (EN-DE, FR)
	EN: a crane operates amidst piles of rubble.
	DE: ein kran arbeitet mitten in einem schutthaufen.
	FR: une grue travaille au milieu des amas de décombres.
	DE_MMT: ein bagger arbeitet mitten in den Trümmern.
	DE_Ours: ein kran arbeitet mitten in einem schutthaufen.
	FR_MMT: une excavatrice creuser au milieu des amas de décombres.
	FR_Ours: une grue travaille au milieu des amas de décombres.
	EN: two cars are driving on a racetrack.
	DE: zwei autos fahren auf einer rennstrecke.
	FR: deux voitures roulent sur un circuit.
	DE_MMT: zwei rennwagen fahren auf der strecke.
	DE_Ours: zwei autos fahren auf einer rennstrecke.
	FR_MMT: deux voitures course roulent sur la piste.
	FR_Ours: deux voitures roulent sur un circuit.

The red color represents the comparison between the translation results of MMT method and normal results. The blue color represents the comparison of the results between our method and MMT method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, E.; Zhu, Z.; Liu, F.; Li, Z.; Gu, R.; Zhao, S. Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion. Electronics 2024, 13, 3084. https://doi.org/10.3390/electronics13153084

AMA Style

Tian E, Zhu Z, Liu F, Li Z, Gu R, Zhao S. Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion. Electronics. 2024; 13(15):3084. https://doi.org/10.3390/electronics13153084

Chicago/Turabian Style

Tian, Erlin, Zengchao Zhu, Fangmei Liu, Zuhe Li, Ran Gu, and Shuai Zhao. 2024. "Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion" Electronics 13, no. 15: 3084. https://doi.org/10.3390/electronics13153084

APA Style

Tian, E., Zhu, Z., Liu, F., Li, Z., Gu, R., & Zhao, S. (2024). Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion. Electronics, 13(15), 3084. https://doi.org/10.3390/electronics13153084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Machine Translation Based on Enhanced Knowledge Distillation and Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. MMT Backbone

2.2. Knowledge Distillation

2.3. Image-Text Feature Fusion

3. Method

3.1. Mirrorless MMT Backbone

3.2. Multimodal Feature Generation

3.2.1. Preliminaries

3.2.2. Knowledge Distillation

3.3. Feature Refinement

3.4. Objective Function

4. Experiments and Analysis

4.1. Dataset

4.2. Image Feature Extraction

4.3. Training Parameters

4.4. Main Results

4.5. Ablation Research

4.6. Image Matching Analysis

4.7. Case Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI