Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings

Mathur, Vidhu; Dadu, Tanvi; Aggarwal, Swati

doi:10.3390/app14135440

Open AccessArticle

Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings

by

Vidhu Mathur

¹

,

Tanvi Dadu

² and

Swati Aggarwal

^3,*

¹

Department of Computer Science Engineering, Maharaja Surajmal Institute of Technology Affiliated to GGSIPU, New Delhi 110058, India

²

School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA 19104, USA

³

Faculty of Logistics, Molde University College, Britvegen 2, 6410 Molde, Norway

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5440; https://doi.org/10.3390/app14135440

Submission received: 17 May 2024 / Revised: 14 June 2024 / Accepted: 17 June 2024 / Published: 23 June 2024

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Featured Application

The application of this research is to create better multilingual datasets by utilizing the insights gained from our investigation into mBART and XLM-Roberta. These improved datasets will support the creation of more robust and accurate AI NLP models that can effectively handle various languages, enhancing performance in tasks like machine translation, sentiment analysis, text categorization, and information retrieval. This research addresses biases and limitations in the current translation methods.

Abstract

Cross-lingual transfer learning using multilingual models has shown promise for improving performance on natural language processing tasks with limited training data. However, translation can introduce superficial patterns that negatively impact model generalization. This paper evaluates two state-of-the-art multilingual models, Cross-Lingual Model-Robustly Optimized BERT Pretraining Approach (XLM-Roberta) and Multilingual Bi-directional Auto-Regressive Transformer (mBART), on the cross-lingual natural language inference (XNLI) natural language inference task using both original and machine-translated evaluation sets. Our analysis demonstrates that translation can facilitate cross-lingual transfer learning, but maintaining linguistic patterns is critical. The results provide insights into the strengths and limitations of state-of-the-art multilingual natural language processing architectures for cross-lingual understanding.

Keywords:

cross-lingual NLP; multilingual models; adversarial attacks; XLM roberta; mBART

1. Introduction

Natural language processing (NLP) has seen tremendous advancements in recent years, enabled by the rise of deep learning techniques and large datasets for training neural networks. Tasks like machine translation, question answering, and text classification have achieved impressive performance through leveraging contextualized word embeddings that aim to capture universal language representations.

However, most of this progress has focused on the English language, while other languages lack data resources. Methods such as multilingual contextualized embeddings and cross-lingual transfer learning have been proposed to overcome this limitation. The idea is to transfer knowledge from high-resource languages, like English, to low-resource ones, thereby reducing the need for large labeled datasets in each language.

Translation has become a popular approach for creating training data to facilitate this cross-lingual transfer. By translating English datasets into other languages, we can generate artificial data to train models in those target languages. However, translation has potential downsides—it can introduce surface-level artifacts that simplify the patterns seen during training. This can diminish model generalization capabilities. Natural language processing (NLP) has seen tremendous advancements in recent years, enabled by the rise of deep learning techniques and large datasets for training neural networks. Tasks like machine translation, question answering, and text classification have achieved impressive performance through leveraging contextualized word embeddings that aim to capture universal language representations.

1.1. Translation and Superficial Patterns in Data

When creating multilingual datasets through translation, there is a risk of introducing superficial patterns that do not generalize across languages.

Specifically, the act of translating sentences word-for-word or independently can alter the surface form while retaining the same meaning. For example, translating “She opened the door” into German might produce “Sie öffnete die Tür.” The words themselves are different, but the meaning is preserved.

This can be beneficial for learning universal representations across languages. However, models may start to rely too much on these surface patterns instead of learning deeper semantics. For instance, a model could learn to align “She” with “Sie” and “the door” with “die Tür” based on the translations seen during training. This reliance on lexical overlap works well when tested on parallel translated data. However, it becomes problematic when evaluating naturally occurring, non-translated examples. The surface forms will be different, even though the meanings may be the same. So, the model fails to generalize.

Overall, translation has tradeoffs. It enables cross-lingual transfer but can also result in models fixating on superficial patterns that do not transfer across non-parallel corpora.

1.2. Advancements in Cross-Lingual Settings

It is now feasible to train neural architectures for cross-lingual NLP tasks that can achieve state-of-the-art performance, thanks to developments in cross-lingual settings. The creation of multilingual contextualized word embeddings, which seek to emulate universal language representations, has made this possible. Numerous multilingual downstream NLP tasks, including text classification, question answering, and text generation, have made use of these embeddings [1,2,3].

Despite these achievements, there are still obstacles to overcome when it comes to lingual NLP. One such challenge is the availability of data for languages other than English. To address this issue, researchers are exploring techniques like translation and code-switching to create datasets for low-resource languages. However, it is important to acknowledge that these methods can introduce their own challenges by altering surface patterns within the data and potentially impacting the generalizability of models.

To tackle these challenges, researchers are actively developing techniques for data augmentation and code-switching. These approaches aim to enhance the performance of models by making them more resilient against noise and variations in input data. Additionally, specialized models that cater specifically to lingual NLP tasks are also under development by researchers.

These models can effectively understand the subtleties of languages and enhance the performance of tasks related to cross-lingual NLP.

1.3. Data Augmentation and Code-Switching

To address the scarcity of data for low-resource languages, a significant portion of the available datasets are created via translation. Nonetheless, prior research conducted by Conneau et al. [3] regarding the influence of translation on multilingual models reveals that when the quality of translation reaches a certain threshold, it begins to have a detrimental impact on the models’ performance. This research underscores the potential utility of translation in improving the capabilities of multilingual models but also highlights the importance of recognizing associated risks. As demonstrated by Conneau et al. in 2018 [4] and Artetxe et al. [5], translating phrases can introduce alterations to the superficial patterns of the data, ultimately diminishing the generalizability of preexisting models. Additionally, code-switching, involving the blending of words, phrases, and grammatical elements from multiple languages, commonly occurs in informal contexts such as conversations and social media interactions. Recent studies have highlighted the vulnerability of contextualized embeddings to polyglot adversarial attacks. Therefore, it becomes imperative to evaluate the resilience of neural networks against polyglot adversaries in multilingual settings.

Both data augmentation and code-switching techniques are still evolving, yet they have already displayed promising outcomes. For instance, a recent investigation by Marzieh Fadaee and colleagues [6] introduced a data augmentation strategy that generates new sentence pairs featuring uncommon words in novel, artificially constructed contexts. The findings of this study indicate that this approach results in the generation of a greater number of rare words during translation, consequently leading to enhanced translation quality.

2. Related Works

Cross-lingual transfer learning has shown great promise for improving performance on natural language processing (NLP) tasks with limited language-specific training data. Techniques like translation and multilingual representation aim to facilitate knowledge transfer across languages. However, translation can introduce superficial artifacts that may negatively impact model generalization. This paper reviews recent advancements and key challenges in cross-lingual NLP, focusing on the effects of translation on model capabilities.

Recently, Lewis et al. [7] proposed BART (Bi-directional Auto-Regressive Transformer), a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained on a massive dataset of text that has been corrupted with a variety of noise functions. This training regime allows BART to learn to generate text that is both fluent and informative, even when the input text is noisy or incomplete. BART has been shown to be effective for a variety of NLP tasks, including natural language generation, translation, and comprehension. In particular, BART has achieved state-of-the-art results on the WMT (Workshop on Machine Translation) benchmark, which is a suite of machine translation tasks that are used to evaluate the performance of multilingual pre-trained language models. The multilingual capabilities of BART make it a valuable tool for researchers and practitioners working on multilingual NLP tasks.

Large pre-trained multilingual models such as mBERT and XLM-Roberta (Cross-Lingual Language Model RoBERTa) have also enabled effective cross-lingual zero-shot transfer in many NLP tasks [8]. A cross-lingual adjustment of these models using a small parallel corpus can improve results; however, in the case of QA (Question Answering) performance, the increase in the amount of parallel data is most beneficial for NLI (natural language inference), whereas QA (Question Answering) performance peaks at roughly 5K parallel sentences and further decreases as the number of parallel sentences increases [8].

The innovative advantage of the transformer architecture in large language models, such as those applied to multilingual NLP, lies in its self-attention mechanism, which allows the model to weigh the importance of different words in a sentence, regardless of their position. This capability significantly improves context understanding across languages, as it captures long-range dependencies and relationships within the text more effectively than previous models like Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM). Unlike recurrent architectures, transformers process the entire sentence simultaneously, enhancing parallelization and computational efficiency. These advancements enable large language models to better handle the complexities of multilingual data, resulting in more accurate and nuanced language understanding and generation [7,8].

Recent studies have shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. In particular, the paper “Unsupervised Cross-lingual Representation Learning at Scale” (Conneau et al., 2020) [3] shows that a Transformer-based masked language model trained on 100 languages using more than 2 terabytes of filtered CommonCrawl data significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks. The model, dubbed XLM-Roberta, improves by +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER (Named Entity Recognition). XLM-Roberta performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. The authors also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the tradeoffs between (1) positive transfer and capacity dilution and (2) the performance of high- and low-resource languages at scale. Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance.

In a recent paper, Artetxe, Labaka, and Agirre [9] show that translation can introduce artifacts that can have a notable impact on existing cross-lingual models. For example, they show that translating the premise and the hypothesis independently from the XNLI dataset can reduce the lexical overlap between them, which current models are highly sensitive to. The authors also show that some previous findings in cross-lingual transfer learning need to be reconsidered in light of this phenomenon. Based on the insights gained, they also improved the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively. These findings suggest that it is important to be aware of the potential for bias when using machine translation and to take steps to mitigate it.

One of the earliest contributions in this domain was the introduction of multilingual word embeddings, which map words from different languages into a shared vector space. Mikolov et al. [10] pioneered this approach with their work on bilingual word embeddings, which facilitated knowledge transfer from resource-rich languages to those with fewer resources. Following this, cross-lingual word embeddings have been extended to multiple languages, as demonstrated by the work on polyglot embeddings by Al-Rfou et al. [11], which leveraged unsupervised learning on large multilingual corpora.

The advent of transformer-based models, such as BERT [12], revolutionized NLP by learning deep contextualized representations. Adapting these models for multilingual purposes, Conneau et al. [4] introduced XLM-Roberta, a robust cross-lingual model trained on 100 languages, showing that a single model could perform well on several cross-lingual benchmarks.

The impact of surface-level patterns in data on model performance has also been examined. Geirhos et al. [13] highlighted that neural networks often exploit such patterns for prediction, which can be problematic when these patterns do not generalize well across languages.

3. Methods

In this section, we investigate the performance of two state-of-the-art multilingual models, XLM-Roberta and mBART, on the challenging cross-lingual natural language inference (XNLI) task [14]. We evaluate these models using the XNLI dataset, which is a large and diverse collection of natural language inference examples in multiple languages. This dataset, built on the MultiNLI corpus, includes 14 additional languages and serves as a valuable resource for assessing the performance of cross-lingual natural language inference (NLI) models. By using XNLI, we aim to determine how well these models handle cross-lingual sentence representations and achieve state-of-the-art results in this domain. The ability to handle data across multiple languages is a crucial requirement for modern natural language processing (NLP) systems, enabling them to cater to diverse global audiences and facilitate cross-lingual communication.

Table 1 shows a subset of English data from the XNLI dataset; we have 3 columns: hypothesis, premise, and class label (target) 0, 1, and 2. The distribution of these class labels is 33.33 percent per class in our data.

To assess the models’ robustness in handling translated data, we employed two popular translation services, Google Translate, to generate translated versions of the test data. By evaluating the models on both the original and translated test sets, we aimed to gain insights into their ability to maintain consistent performance when dealing with potential translation errors and linguistic divergences.

We used Google Translate because of its advanced techniques that significantly improve translation accuracy, particularly for low-resource languages. These improvements include enhanced model architecture, better noise handling in datasets, and effective multilingual transfer learning through M4 modeling, as well as leveraging monolingual data. These strategies ensure that Google Translate can deliver high-quality translations across a wide range of languages, making it a reliable tool for generating realistic multilingual test data.

3.1. Input Data Processing

Tokenization: The hypothesis and premise extracted from the XNLI dataset are tokenized. Tokenization involves breaking down the text into smaller components, usually words or subwords, which the models can process.
Concatenation: The tokenized hypothesis and premise are concatenated into a single input sequence. This combined input is then fed into the models for training.

3.2. Fine-Tuning

Embeddings: No custom embeddings were utilized. For the mBART model, mBART embeddings were used, which are similar to BART embeddings, while for the XLM-Roberta model, XLM-Roberta embeddings were utilized, which are similar to BERT embeddings. These embeddings convert the tokenized text into numerical vectors that capture semantic meanings.
Training: The concatenated inputs (hypothesis + premise) along with their respective labels, are used to fine-tune the models. The goal is for the model to predict the correct label for each input pair.

We fine-tuned the XLM-Roberta and mBART models on the XNLI dataset using standard procedures and evaluated their performance using metrics such as accuracy and F1 score. The following subsections present and discuss the findings from our experiments, highlighting the strengths and limitations of each model in handling cross-lingual data.

Figure 1 represents steps taken to fine-tune our models on the XNLI dataset. First, we broke the dataset into hypotheses and premises. Then, we concatenated them and tokenized them to train the model.

Figure 2 represents the steps taken while testing the model on our data. We first concatenate the hypothesis and premise, just like in the training data. Then, we create two sets of data by translating the sentences before testing. After testing, we compared the metrics of the two subsets of data.

The code for our study can be accessed on this Google Colab notebook: https://colab.research.google.com/drive/1pJulIFnfPGFdyugGhfLLAYO8YO7XlwmB?usp=sharing, accessed on 16 May 2024. This includes various sections of the experiment: data processing, training the model, translating datasets, and evaluating. This code was run multiple times on different language pairs to obtain the data that we presented in the study.

4. Results

This section presents the performance evaluation of the XLM-Roberta Large and mBART models across various languages and their translations. The models were assessed on their accuracy and F1 scores, comparing original language data against translated counterparts. To calculate the metrics, we first evaluated the original language and calculated the metrics on the basis of the number of correct target predictions by the models, and similarly, we calculated the metrics for the translated language dataset. The evaluations cover diverse language pairs, highlighting the impact of translation on model performance.

Table 2 details the results for the XLM-Roberta Large model, while Table 3 outlines the outcomes for the mBART model. Both models were trained and evaluated under specific hyperparameters and dataset configurations to ensure consistent and reliable comparisons. The calculation of the F1 score was performed on the basis of Equations (1)–(3), the predictions of the target class by the model, and the ground truth value of the labels.

R e c a l l = \frac{T P}{T P + F P}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

In the experimental settings for the baselines, the dataset consisted of 150,000 examples, with 10,000 examples per language. The maximum sequence length was set to 100, and the training batch size per GPU (Graphical Processing Unit) was 54, while the evaluation batch size was 32. The learning rate was set to 1 × 10⁻⁵, with a weight decay of 0.5 and an Adam epsilon value of 1 × 10⁻⁸. The maximum gradient norm was set to 1.0, and the number of training epochs was set to 1.0 due to the large size of the training dataset.

The XLM-Roberta Large model has different parameters compared to its baseline version. Specifically, the “per_gpu_train_batch_size” and “per_gpu_eval_batch_size” for this model are set to 16.

The training hyperparameters for the mBART model are as follows: the “per_device_train_batch_size” is set to 2, and the “per_device_eval_batch_size” is set to 1. The model uses gradient accumulation with 8 steps, a learning rate of 1 × 10⁻⁵, weight decay of 0.5, and an Adam optimizer with an epsilon value of 1 × 10⁻⁸. The maximum gradient norm is set to 1.0, and the model is trained for 1 epoch due to the large size of the training dataset. The “max_steps” parameter is set to −1, indicating that there is no maximum step limit. The warm-up steps are set to 500, and the model logs training progress every 2000 steps.

No hyperparameter tuning was conducted because the aim of this study is not to achieve the best accuracy on the dataset but to highlight the issues in translation. No baseline models were used in the evaluation as we required a specialized format of class label prediction, which an untrained base model cannot perform adequately. Only one run per experiment was conducted with the dataset. For the experimental settings, we used a T4 GPU (Graphical Processing Unit) with 15,360 Megabytes of Video Random Access Memory on a Google Colab environment with Python 3.10.12 and used the Huggingface transformers and datasets libraries for fine-tuning models and accessing the dataset. Versions for datasets and transformers are Datasets 2.20.0 and transformers 4.41.2.

Figure 3 and Figure 4 help depict the accuracy data for all the language pairs in a graphical format for a better understanding.

5. Discussion

According to the results, both the XLM-Roberta and mBART models perform better on some language pairings than others based on accuracy and F1 scores.

Figure 3 and Figure 4 illustrate the impact of translation on model accuracy across different language pairs. The data reveal that accuracy significantly decreases when translating between languages with dissimilar structures, such as Urdu to French or Thai to Hindi. Conversely, the reduction in accuracy is less pronounced when translating between languages with similar linguistic patterns, such as English to German or Arabic to English. Despite these variations, the overall trend shows that translation introduces superficial patterns that generally lead to a decline in accuracy in almost all cases after translation.

Based on the Urdu–Hindi language pair, the accuracy scores of the XLM-Roberta model are 0.65220 and 0.62188, respectively. In the Urdu–French language pair, accuracy ratings are lower, with original and translated F1 scores of 0.65220 and 0.59863, respectively.

mBART performs well with English–French and English–Spanish language pairs, with excellent accuracy scores for both the original and translated F1. Other language combinations, such as Turkish–Urdu and Hindi–Urdu, score lower.

The asymmetric effects of fine-tuning on different language pairs have also been observed in recent work. Namdarzadeh et al. [15] evaluated the performance of the mBART50 multilingual model on a Farsi dataset of dislocations. They found that fine-tuning mBART50 using French–Farsi aligned data dramatically improved the grammatical well-formedness of the French translations, even though some semantic issues remained. However, when they replicated the experiment with Farsi–English fine-tuning, the translations to English were sometimes worse than the baseline mBART50 model.

This suggests that the success of fine-tuning multilingual models can be heavily dependent on the specific language pair and the quality of the fine-tuning data available. While adding even a small amount of aligned data in one language direction may be sufficient to improve morpho-syntactic performance, the same may not hold true for the other translation direction. Maintaining a balance between improving grammatical well-formedness and preserving semantic coherence appears to be a key challenge when fine-tuning these models for cross-lingual tasks.

Due to the dearth of information in languages other than English, multilingual contextualized word embeddings have been created to simulate universal language representations. However, the translation process itself might result in the creation of novel, surface-level patterns that have an impact on model performance. The translation process itself might result in the creation of novel, surface-level patterns that have an impact on model performance. Recent research by Samson Tan and Shafiq Joty [16] provides evidence that translation can act as an adversary for NLP models by introducing perturbations that simulate the effect of code-mixing in multilingual communities. The authors demonstrate the effectiveness of their translation-based attack methods, BUMBLEBEE and POLYGLOSS, on state-of-the-art cross-lingual NLI and QA models, proving the effectiveness of translation in creating adversaries.

It is possible that accuracy decreases after translation in some language pairs for several reasons. Some of these reasons include:

The differences in the grammatical, syntactic, and vocabulary structures of different languages can impact the accuracy of translations. Certain language pairs may have more similarities in these linguistic features than others. This similarity can facilitate the transfer of knowledge between languages, leading to improved accuracy and F1 scores. However, for language pairs that are significantly dissimilar, the model may face greater challenges and require more training time and data to achieve optimal performance. Previous research has demonstrated that machine translation struggles when dealing with morphologically rich languages, particularly those with heavy inflection. The abundance of inflected word forms can result in data sparsity, inaccurate estimations, and difficulty accurately translating pronouns and handling negation expressions [17].

As well as translating from one language to another, translation requires faithfully reproducing the subtleties, structure, and meaning of the original text, which can be difficult. When surface-level linguistic patterns are altered or lost during translation, the accuracy of the translation can be affected.

In Japanese, for example, the verb is usually placed at the end of the sentence, whereas in English, it is placed before the verb. When translating from English to Japanese or vice versa, it may be necessary to restructure the statement significantly in order to maintain its meaning.

It is also possible for non-native English speakers to have difficulty using the articles “a” and “the” in English, and translating articles can be problematic as well. As a result, a literal translation might not accurately convey the intended meaning in such circumstances.

6. Utility of Work

The purpose of this research is to enhance natural language processing’s (NLP) effectiveness and efficiency in multilingual contexts. Transfer learning and cross-lingual contextualized embeddings can be used to lessen the requirement for expensive and time-consuming acquisition of vast volumes of labeled data in each language. We can improve performance in a variety of activities, including machine translation, sentiment analysis, text categorization, and information retrieval, by utilizing knowledge from one language to another.

Expanding on the significance of our research involving mBART and XLM-Roberta, it is imperative to acknowledge their broader implications beyond aiding less widely spoken languages. Both mBART and XLM-Roberta share foundational attributes with prominent language models (LLMs), rendering our findings pertinent to the broader landscape of language understanding.

For instance, both mBART and XLM-Roberta leverage multi-headed attention mechanisms, a fundamental component prevalent in several prominent LLMs, such as BERT and GPT. Consequently, insights garnered from our study on mBART and XLM-Roberta are inherently transferrable to other LLMs employing analogous attention mechanisms. This underscores the potential for our research to contribute meaningfully to the collective understanding of language modeling techniques.

Moreover, these models employ word embeddings to encapsulate semantic and contextual information, enhancing their ability to comprehend textual data. While the specifics of these embeddings may vary between models, the underlying principles remain consistent across diverse LLMs. Consequently, our investigation into the embeddings utilized by mBART and XLM-Roberta provides valuable insights applicable to enhancing the performance of other LLMs, fostering advancements in language understanding capabilities. Our research highlights the potential for these embeddings to inherit biases or limitations introduced through translation, impacting the generalizability of LLMs. This knowledge can inform the future development of LLMs, leading to more robust and nuanced models.

Notably, despite the emergence of newer models, mBART and XLM-Roberta remain extensively used in industry applications such as summarization and next-word prediction tasks. This continued utilization stems from the high computational costs associated with employing LLMs, making mBART and XLM-Roberta attractive choices due to their balance between efficiency and performance. By acknowledging the practical significance of these models in industry settings, our research underscores their enduring relevance and applicability in addressing real-world language processing challenges.

The value of this research lies in its potential to enhance cross-language communication and understanding of large language models. This is crucial as large language models (LLMs) are increasingly shaping how people access information. Recent work by Jin et al. [18] has shown significant disparities in the accuracy, consistency, and reliability of LLM responses across languages like English, Spanish, Chinese, and Hindi, which proves that there is a need for further research in this field.

7. Limitations of Study

Our study highlights that translation can introduce undesirable patterns in the data. While translation facilitates the creation of multilingual datasets, it can also result in superficial artifacts that affect model generalization. These artifacts may cause models to rely on surface-level patterns rather than deeper semantic understanding, impacting their performance on non-translated, naturally occurring data. However, there were certain limitations to our study.

We attempted to fine-tune the Gemma 2B model; however, the results were suboptimal compared to mBART and XLM-Roberta. The Gemma 2 billion model required more fine-tuning than just a single epoch to yield competitive results, leading us to exclude it from our final evaluation.

Additionally, due to resource constraints, we could only perform a single run of our experiments on the training dataset, which consisted of 150,000 sentences, and the testing dataset of 5010 evaluation sentences from XNLI. This approach restricted our ability to perform thorough hyperparameter tuning and validation, potentially affecting the generalizability of our findings.

Maintaining linguistic patterns during machine translation, especially for low-resource languages, can be enhanced through advanced techniques such as hybrid model architectures and back-translation. For instance, Google Translate employs a combination of transformer encoders and RNN decoders to capture complex dependencies in the source text, improving translation quality. Additionally, back-translation uses monolingual data to generate synthetic parallel data, aiding in fluency and contextual accuracy for low-resource languages. Complementing these methods, we propose incorporating domain-specific fine-tuning, where models are pre-trained on large multilingual corpora and then fine-tuned on domain-specific data to better preserve linguistic nuances. This approach, combined with multilingual transfer learning through M4 modeling and sophisticated data filtering, can significantly enhance the maintenance of linguistic patterns in machine translation [19,20].

8. Future Work

Building on the findings of this study, future research can address several areas to advance the understanding and performance of multilingual models in cross-lingual natural language inference (XNLI) tasks:

Comprehensive Evaluation on Large LLMs: Future work should include evaluations on more extensive and popular large language models such as Large Language Model Meta AI (LLaMA) and Generative Pretrained Transformer (GPT). Given adequate computational resources, these models could be fine-tuned multiple times to ensure a robust assessment of their capabilities and performance.
Extended Fine-Tuning: The Gemma 2B model’s performance indicated a need for more extensive fine-tuning. Future studies could explore longer fine-tuning periods and multiple epochs to determine the model’s potential once fully trained.
Multiple Experimental Runs: Conducting multiple runs of experiments could provide a deeper understanding of model performance across different conditions and reduce the potential impact of any single run’s anomalies.
Expanded Dataset: Increasing the size and diversity of the dataset beyond the current 150,000 sentences could provide a more comprehensive evaluation of the models’ capabilities. Including more languages and diverse linguistic constructs would further test the generalizability of the models.

9. Conclusions

Our analysis of the XLM-Roberta and mBART models showed that cross-lingual transfer is effective in some language pairs but not in others. While the accuracy and F1 scores remained consistent for some language pairs, for others, such as French–Urdu and French–Thai, the accuracy and F1 scores decreased significantly after translation. This suggests that language-pair compatibility plays a crucial role in cross-lingual transfer learning.

Additionally, our analysis highlights the importance of maintaining the content and context of the original language during translation. Translation can change the superficial patterns of a language, which can lead to a loss of information and meaning in the translated text.

The similarity of the languages being translated affects how well sentences are translated. Non-similar language translation can result in subpar translations, which can harm the effectiveness of NLP models. This is due to the possibility of data mistakes and inconsistencies being introduced during translations between languages with various linguistic structures and vocabularies, which makes it challenging for models to develop reliable language representations. Consequently, it is crucial to thoroughly choose and assess the languages that will be translated.

Author Contributions

Methodology, V.M., T.D. and S.A.; Software, V.M. and T.D.; Formal Analysis, V.M. and T.D.; Visualization, V.M. and T.D.; Writing—Original Draft, V.M. and T.D.; Conceptualization, S.A.; Validation, S.A.; Writing—Review & Editing, S.A.; Supervision, S.A. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Molde University College, Norway.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at this link: https://huggingface.co/datasets/facebook/xnli.

Conflicts of Interest

The authors declare no conflict of interest.

References

Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
Conneau, A.; Lample, G.; Ranzato, M.A.; Denoyer, L.; Jégou, H. Word Translation without Parallel Data. arXiv 2018, arXiv:1710.04087. [Google Scholar]
Artetxe, M.; Ruder, S.; Yogatama, D. On the Cross-Lingual Transferability of Monolingual Representations. arXiv 2020, arXiv:1910.11856. [Google Scholar]
Fadaee, M.; Bisazza, A.; Monz, C. Data Augmentation for Low-Resource Neural Machine Translation. arXiv 2017, arXiv:1705.00440. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Efimov, P.; Boytsov, L.; Arslanova, E.; Braslavski, P. The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer. In Advances in Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2022; pp. 51–67. [Google Scholar] [CrossRef]
Artetxe, M.; Labaka, G.; Agirre, E. Translation Artifacts in Cross-Lingual Transfer Learning. arXiv 2020, arXiv:2004.04721. [Google Scholar]
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting similarities among languages for machine translation. arXiv 2013, arXiv:1309.4168. [Google Scholar]
Al-Rfou, R.; Perozzi, B.; Skiena, S. Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, 8–9 August 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 183–192. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Conneau, A.; Lample, G.; Rinott, R.; Williams, A.; Bowman, S.R.; Schwenk, H.; Stoyanov, V. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2475–2485. [Google Scholar]
Namdarzadeh, B.; Mohseni, S.; Zhu, L.; Wisniewski, G.; Ballier, N. Fine-tuning MBART-50 with French and Farsi data to improve the translation of Farsi dislocations into English and French. In Proceedings of the Machine Translation Summit 2023, Macau, China, 4–8 September 2023. [Google Scholar]
Tan, S.; Joty, S. Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3596–3616. [Google Scholar]
Mirjam, M.; Gregor, D. Machine Translation and the Evaluation of Its Quality. In Recent Trends in Computational Intelligence; IntechOpen: London, UK, 2019; Volume 143. [Google Scholar] [CrossRef]
Jin, Y.; Chandra, M.; Verma, G.; Hu, Y.; De Choudhury, M.; Kumar, S. Better to ask in English: Cross-lingual evaluation of large language models for healthcare queries. arXiv 2023, arXiv:2310.13132. [Google Scholar]
Caswell, I.; Liang, B. Recent Advances in Google Translate. Google Research Blog. 2020. Available online: https://research.google/blog/recent-advances-in-google-translate/ (accessed on 16 May 2024).
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]

Figure 1. Fine-Tuning Models on XNLI Dataset.

Figure 2. Evaluating Models on XNLI Data.

Figure 3. Accuracy plot for the MBART model.

Figure 4. Accuracy plot for the XLM Roberta model.

Table 1. Samples from the XNLI dataset.

Hypothesis	Premise	Label
(Read for Slate’s take on Jackson’s findings.)	Slate had an opinion on Jackson’s findings.	0 entailment
Gays and lesbians.	Heterosexuals.	2 contradiction
But a few Christian mosaics survive above the apse: the Virgin with the infant Jesus, with the Archangel Gabriel to the right (his companion Michael, to the left, has vanished save for a few feathers from his wings).	Most of the Christian mosaics were destroyed by Muslims.	1 neutral
Issues in data synthesis.	Problems in data synthesis.	0 entailment

Table 2. Results of the Evaluation of the XLM Roberta Model.

Language	Translated Language	Original Accuracy	Original F1	Translated F1	Translated Accuracy
Urdu	Hindi	65.26%	0.65220	0.62188	62.53%
Turkish	Hindi	68.86%	0.68756	0.64380	64.57%
Thai	Hindi	67.34%	0.67321	0.61364	61.65%
Urdu	French	65.26%	0.65220	0.59863	60.67%
Turkish	French	68.86%	0.68756	0.65220	65.26%
Thai	French	67.34%	0.67321	0.62240	62.77%
Hindi	French	65.72%	0.65608	0.622511	62.63%
Swahili	French	62.29%	0.61920	0.620873	62.71%
Arabic	English	67.50%	0.67415	0.66890	67.06%
Bulgaria	English	71.85%	0.67415	0.674153	67.50%
German	English	71.85%	0.71869	0.70858	70.95%

Table 3. Results of the Evaluation of the Multilingual Bart Model.

Language	Translated Language	Original Accuracy	Original F1	Translated F1	Translated Accuracy
English	Arabic	83.49%	0.83492	0.73392	73.47%
English	French	83.49%	0.83492	0.78212	78.22%
English	Spanish	83.49%	0.83492	0.78511	78.56%
French	Urdu	78.96%	0.78944	0.52871	53.11%
French	Thai	78.96%	0.78944	0.47269	47.52%
Hindi	Urdu	70.45%	0.70368	0.52529	52.61%
Hindi	English	70.45%	0.70368	0.68986	69.38%
Swahili	Arabic	49.74%	0.49189	0.59687	60.29%
Turkish	Urdu	74.45%	0.74374	0.53902	54.07%
Turkish	Hindi	74.45%	0.74374	0.70586	70.71%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mathur, V.; Dadu, T.; Aggarwal, S. Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings. Appl. Sci. 2024, 14, 5440. https://doi.org/10.3390/app14135440

AMA Style

Mathur V, Dadu T, Aggarwal S. Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings. Applied Sciences. 2024; 14(13):5440. https://doi.org/10.3390/app14135440

Chicago/Turabian Style

Mathur, Vidhu, Tanvi Dadu, and Swati Aggarwal. 2024. "Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings" Applied Sciences 14, no. 13: 5440. https://doi.org/10.3390/app14135440

APA Style

Mathur, V., Dadu, T., & Aggarwal, S. (2024). Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings. Applied Sciences, 14(13), 5440. https://doi.org/10.3390/app14135440

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings

Abstract

Featured Application

Abstract

1. Introduction

1.1. Translation and Superficial Patterns in Data

1.2. Advancements in Cross-Lingual Settings

1.3. Data Augmentation and Code-Switching

2. Related Works

3. Methods

3.1. Input Data Processing

3.2. Fine-Tuning

4. Results

5. Discussion

6. Utility of Work

7. Limitations of Study

8. Future Work

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI