Next Article in Journal
Investigating the Effect of Cyclodextrin Nanosponges and Cyclodextrin-Based Hydrophilic Polymers on the Chemical Pharmaceutical and Toxicological Profile of Al(III) and Ga(III) Complexes with 5-Hydroxyflavone
Next Article in Special Issue
Transferring Sentiment Cross-Lingually within and across Same-Family Languages
Previous Article in Journal
Simulation of Full Wavefield Data with Deep Learning Approach for Delamination Identification
Previous Article in Special Issue
STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API
 
 
Article
Peer-Review Record

Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings

Appl. Sci. 2024, 14(13), 5440; https://doi.org/10.3390/app14135440
by Vidhu Mathur 1, Tanvi Dadu 2 and Swati Aggarwal 3,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(13), 5440; https://doi.org/10.3390/app14135440
Submission received: 17 May 2024 / Revised: 14 June 2024 / Accepted: 17 June 2024 / Published: 23 June 2024
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper evaluated two multilingual neural network models, XLM-Roberta and mBART, on the Cross-lingual Natural Language Inference (XNLI) dataset to study the capability of maintaining linguistic patterns of machine-translated texts during natural language processing. Because of multiple major and minor issues in this paper (listed below), I do not agree with its publication unless major revisions are performed in both research methodology and experiments.

A. The first major concern is the lack of contributions in this research paper. The authors did not clearly explain or list the contributions in either the abstract or section 1 introduction. The findings are not so strong and qualitative enough for a research paper. Here is the main finding I found through this paper: “Our research highlights the potential for these embeddings to inherit biases or limitations 360 introduced through translation, impacting the generalizability of LLMs”. The performance reported in this paper (6x-7x% accuracy and F1 are not so impressive).

To improve this research, here are some suggestions:

-          Propose a method to highlight or maintain the linguistic patterns through machine translation during NLP tasks.

-          Apply the state-of-the-art NLP models – the Transformer-based Large Language Models, such as GPT, LLaMA, etc.

B. The second major concern is the performance metrics used in the experiments – accuracy and F1 (section 4). Here are some questions:

- After tokenizing the hypothesis and premise extracted from the XNLI dataset, is it correct that this raw test data was fed into the fine-tuning of XLM-Roberta and mBART models? Were any word embedding models used? The proposed methodology (section 3) should be rewritten to make clear the processing steps in more detail: input data processing, fine-tuning, and performance measurements. It would be helpful if the source code and/or dataset links were attached to this manuscript.

- What are the targets and ground truth labels of the fine-tuned models here?

- How was the accuracy/F1 calculated between the original language texts compared to translated language texts? It would make more sense if other performance metrics were used to measure text similarity or text generation quality.

C. In section 4, the authors mentioned different hyperparameters used in the experiments. However, more hyperparameter tuning (more parameter values) should be conducted to find the optimal ones.

- Was any hyperparameter tuning conducted?

- Any baseline models for performance comparison that could be used?

- How many runs for each experiment? I recommend providing the mean and standard deviation for significance confidence level confirmation.

- Please explain more about the experiment setup, such as environment settings, programming language, libraries or frameworks, etc.

D. Some acronyms introduced in this paper lack the full spelling when they were introduced the first time, such as BART (line 130), WMT (line 136), XLM-R (line 141), QA (line 144), NER (line 155), and GPU (line 246).

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Some places in this paper should be improved in written English or formats. Here are some common problems:

- Missing articles, such as a/the, or commas between sentence phrases (before and, which, including, etc.)

- Word choices and verb tenses.

Please check the attached PDF file with the detailed highlighted comments (some of which are not mentioned above).

Author Response

Response to Review Questions

 

All the revisions in the manuscripts as in accordance with the suggestions by the reviewers have been highlighted in the text

 

Question 1.



To improve this research, here are some suggestions:

- Propose a method to highlight or maintain the linguistic patterns through

machine translation during NLP tasks.

- Apply the state-of-the-art NLP models – the Transformer-based Large

Language Models, such as GPT, LLaMA, etc.


Response


We have proposed a method to maintain linguistic patterns in the final paragraph of the Limitations section. Specifically, we discussed employing hybrid model architectures and back-translation techniques. Hybrid models, like those used by Google Translate, combine transformer encoders and RNN decoders to capture complex dependencies, thereby improving translation quality. Back-translation leverages monolingual data to create synthetic parallel data, enhancing fluency and contextual accuracy, particularly for low-resource languages. Furthermore, we suggest domain-specific fine-tuning and multilingual transfer learning to preserve linguistic nuances and improve the quality of machine translation.

 

We understand that there are some limitations in our research and applying state of the art models is one of them, this has been discussed thoroughly in the Limitations section.

 

Question 2.

 

The second major concern is the performance metrics used in the experiments –

accuracy and F1 (section 4). Here are some questions:

- After tokenizing the hypothesis and premise extracted from the XNLI dataset, is it

correct that this raw test data was fed into the fine-tuning of XLM-Roberta and

mBART models? Were any word embedding models used? The proposed

methodology (section 3) should be rewritten to make clear the processing steps in

more detail: input data processing, fine-tuning, and performance measurements. It

would be helpful if the source code and/or dataset links were attached to this

manuscript.

- What are the targets and ground truth labels of the fine-tuned models here?

- How was the accuracy/F1 calculated between the original language texts compared

to translated language texts? It would make more sense if other performance metrics

were used to measure text similarity or text generation quality.


Response

 

We have revised the final section of the methods (sections 3.1 and 3.2) to provide a detailed explanation of the preprocessing and fine-tuning procedures. The XNLI dataset, which is used in our study, is now clearly referenced in the methods section, with the corresponding link: https://huggingface.co/datasets/facebook/xnli

 

Link to the source code is now given in the manuscript: The code for our study can be accessed on this google colab notebook:

https://colab.research.google.com/drive/1pJulIFnfPGFdyugGhfLLAYO8YO7XlwmB?usp=sharing. This includes various sections of the experiment data processing, training the Model, translating datasets and evaluating. This code was run multiple times on different language pairs to obtain the data which we present in the study.

 

Additionally, we have elaborated on the targets and ground truth labels by including sample data from the dataset. The process for calculating accuracy and F1 scores has also been detailed in the results section for clarity.

 

 

Question 3.

 

In section 4, the authors mentioned different hyperparameters used in the

experiments. However, more hyperparameter tuning (more parameter values) should

be conducted to find the optimal ones.

- Was any hyperparameter tuning conducted?

- Any baseline models for performance comparison that could be used?

- How many runs for each experiment? I recommend providing the mean and

standard deviation for significance confidence level confirmation.

- Please explain more about the experiment setup, such as environment settings,

programming language, libraries or frameworks, etc.

 

Response

 

We did not perform hyperparameter tuning as our study aimed to highlight issues in translation rather than achieving optimal accuracy. No baseline models were used since our specific classification task could not be accomplished by non-fine-tuned models. The experiments were conducted in a Google Colab environment with a T4 GPU using the Huggingface library for training, evaluation, and datasets. We understand that there are some limitations in our study and being able to conduct only a single run is among them and this has been added to the limitations section of the paper.

 

Question 4.


Some acronyms introduced in this paper lack the full spelling when they were

introduced the first time, such as BART (line 130), WMT (line 136), XLM-R (line 141),

QA (line 144), NER (line 155), and GPU (line 246).

 

peer-review-37419155.v1.pdf

Comments on the Quality of English Language

Some places in this paper should be improved in written English or formats. Here are

some common problems:

- Missing articles, such as a/the, or commas between sentence phrases (before and,

which, including, etc.)

- Word choices and verb tenses.

Please check the attached PDF file with the detailed highlighted comments (some of

which are not mentioned above).

 

Response

All full-forms have been added in the text and grammar has been corrected in accordance with the pdf provided.

 

 

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript addresses a critical issue in NLP: the generalization of neural networks against adversarial attacks in cross-lingual settings. The evaluation of state-of-the-art models (XLM-Roberta and mBART) on the Cross-lingual Natural Language Inference task with both original and machine-translated datasets is novel and relevant.

 

I cannot recommend publication of this manuscript until authors make following changes:

 

1. provide the reason for the choice of translation tools:

- Provide a detailed explanation of why Google Translate and DeepL were chosen as translation tools. Were other translation tools (e.g., Microsoft Translator, Yandex.Translate, etc.) compared?

- Is there literature support or experimental validation for the capabilities of these tools in handling specific language pairs (such as low-resource languages)?

2. name of section 3 can change to more tranditional choice like "3. Methods" or "3. Models"

3. Resolution of figure 1 and 2 are not high enough

Author Response

 

 

Response to Review Questions

 

All the revisions in the manuscripts as in accordance with the suggestions by the reviewers have been highlighted in the text

 

 

Question 1.

 

provide the reason for the choice of translation tools:

- Provide a detailed explanation of why Google Translate and DeepL were chosen as

translation tools. Were other translation tools (e.g., Microsoft Translator,

Yandex.Translate, etc.) compared?

 

- Is there literature support or experimental validation for the capabilities of these

tools in handling specific language pairs (such as low-resource languages)?

 

Response

The reason for choosing Google Translate was the extensive support of languages we were trying to compare and due to it being the most highly used translation service. Google also tackles low resource languages as explained in this official blogpost from google research. https://research.google/blog/recent-advances-in-google-translate/. This has been added in the paper and the last paragraph of the limitations section also references the advancements made by google to propose a method to minimise the problems in translation.

 

Question 2.

 

name of section 3 can change to more traditional choice like Methods

 

Response

 

Name of the section has been changed to methods

 

Question 3.

 

Resolution of figure 1 and 2 are not high enough


Response

 

New figures with high resolution have been added

 

 

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The goal of the paper was to examine weather unsupervised translation executed by multilanguage large language models would generate acceptable language datasets that can be used for low-resource languages. This is a very lofty goal but I believe the authors need to unpack their process a bit more before this work is published in the Applied Sciences journal. I provided some more specific questions with the goal to help the authors see the areas that needs more explaining/unpacking. 

Major: 
On what inference tasks were the models tested? Please spell out what kind of language inference task was given to the models? (p5. Line 207) How the XNLI dataset was generated? Please define hypothesis and premises -how the sentences were broken into these two types?

Please explain F1 score -how it is computed and what it operationalizes. In the discussion you duly notice linguistic differences between the languages what plains (phonological, syntactic, morphological, semantic) the accuracy scores probed?


Please explain what exactly in the transformer-machine learning algorithm application to large language models revolutionized multi language NLP? You mention it in several places but without spelling out the exact innovative advantage. Are you referring to considering forward and backward n-word embeddings, recurrent autoencoding, or something else?

 

How mBART and mBERT are different? Is this a misspelling or are you referring to 2 different models?



Minor: 

Work Done – Should this be called Methods section?

Results:
It would be good to visualize the original and translated accuracies to see patterns – maybe group the languages on X axes by whether the translated accuracies improved (higher then original accuracies_, not  - lower accuracies and no change. Then you could maybe make some inference regrading language similarity/differences playing any role as right now it is very difficult to grasp the patterns from the numbers in the tables. 


Please spell out the acronyms the first time you use them: e.g. NLI – p4 line 145; XNLI p4 line 154MLQA and NER 

p4 line 155

Comments for author File: Comments.docx

Author Response

Response to Review Questions

 

All the revisions in the manuscripts as in accordance with the suggestions by the reviewers have been highlighted in the text

 

Question 1.

 

On what inference tasks were the models tested? Please spell out what kind of

language inference task was given to the models? (p5. Line 207) How the XNLI

dataset was generated? Please define hypothesis and premises -how the sentences

were broken into these two types?

Please explain F1 score -how it is computed and what it operationalizes. In the

discussion you duly notice linguistic differences between the languages what plains

(phonological, syntactic, morphological, semantic) the accuracy scores probed?

 

Response

 

The models were finetuned and evaluated on Cross Lingual Natural Inference task from the XNLI dataset. The details about the dataset being used have been added in the methods section along with some samples to further clarify how this dataset was used. Calculation of metrics have been explained in the results section.

 

Question 2.

 

Please explain what exactly in the transformer-machine learning algorithm

application to large language models revolutionized multi language NLP? You

mention it in several places but without spelling out the exact innovative advantage.

Are you referring to considering forward and backward n-word embeddings,

recurrent autoencoding, or something else?

 

Response

 

An in-depth paragraph on how transformers are working and what algorithm (self - attention) makes them superior in the field of NLP has been added in the recent works section.

 

Question 3.

 

How are mBART and mBERT different? Is this a misspelling or are you referring to 2

different models?

 

Response

 

We are talking about two different models. We have worked on mBART but mBERT was also mentioned as it proves the efficiency of transformer models on natural language tasks since both of them are transformer based models. mBART is an encoder decoder model whereas mBERT is an encoder only model. Since all of them use a self attention based transformer mBert was only mentioned to supplement the fact that transformers give great results in the NLP domain.

 

Question 4.

 

Minor:

Work Done – Should this be called the Methods section?

 

Results:

It would be good to visualize the original and translated accuracies to see patterns –

maybe group the languages on X axes by whether the translated accuracies

improved (higher then original accuracies_, not - lower accuracies and no change.

Then you could maybe make some inference regrading language

similarity/differences playing any role as right now it is very difficult to grasp the

patterns from the numbers in the tables.

 

 

Response

 

Name of the section has been changed to methods. Two graphs visualising the accuracy of both models have been added in the text and an inference had been made from them in the discussion section.

 

Question 5.

 

Please spell out the acronyms the first time you use them: e.g. NLI – p4 line 145;

XNLI p4 line 154MLQA and NER

 

 

Response

All the suggested acronyms have been spelled out, as suggested.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Compared to the previous version, here are the improvements:

- Revise the introduction to clarify the research objectives and the contributions.

- Heavily revised section 3 – Methods to clarify the research methodology, including the input data, feature processing, the XLM-RoBERTa and mBART models, and finetuning methods.

- Sections 4 and 5 (results) also improved by describing performance evaluation metrics, hyperparameter tuning, and results discussions.

- The authors also discussed the limitations and future works in detail.

 

As I mentioned before and stated in the limitations section, this paper may not draw much interest from the readers. I can accept this version in its current form and let the editor decide on the publication.

Back to TopTop