Next Article in Journal
Observer-Based Suboptimal Controller Design for Permanent Magnet Synchronous Motors: State-Dependent Riccati Equation Controller and Impulsive Observer Approaches
Next Article in Special Issue
Chef Dalle: Transforming Cooking with Multi-Model Multimodal AI
Previous Article in Journal
LeakPred: An Approach for Identifying Components with Resource Leaks in Android Mobile Applications
Previous Article in Special Issue
Machine Learning-Based Crop Yield Prediction in South India: Performance Analysis of Various Models
 
 
Article
Peer-Review Record

Mitigating Large Language Model Bias: Automated Dataset Augmentation and Prejudice Quantification

Computers 2024, 13(6), 141; https://doi.org/10.3390/computers13060141
by Devam Mondal *,† and Carlo Lipizzi *,†
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Reviewer 5:
Computers 2024, 13(6), 141; https://doi.org/10.3390/computers13060141
Submission received: 30 April 2024 / Revised: 28 May 2024 / Accepted: 30 May 2024 / Published: 4 June 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors discuss bias in LLMs and propose a method to measure this bias. Their proposal is based on an augmentation approach they have devised in which they add statements in the knowledge base to counter biased entries.

The validity of the approach is questionable, for two reasons. On the one hand, the entries that are biased are identified based on some of their characteristics. It is not assured, at least in the eyes of this reviewer, that these statements are indeed biased and need countering. The example about Indians and maths is of course a case of a biased statement, but many similarly structured statements could be made that would be factual rather than biased. On the other hand, it is not clear that one can augment the knowledge base with the addition of potentially - or almost certainly - invalid entries without consequences.

This is not to say that the approach is definitely wrong. But certainly a stronger explanation is needed. The current short presentation is not sufficient to establish the meaningfulness of the proposal.

With respect to the presented results, there is no clear takeaway. Why/how are the calculations presented in tables 1 and 2 indicative of a meaningful and valid enhancement? This section needs to be not only better explained but also considerably expanded, to establish the experimental validation of the proposal. In any case, it stands out that there is no mention of a ground truth, validation by human experts or comparison against some already existing method in the literature. Which leaves little upon which to base any claim.

Author Response

To whom it may concern,

Thank you very much for your feedback and comments. With regards to the second paragraph, we don’t think that biased statements are “factual” rather than “biased” and that the augmented entries may be “invalid.” The purpose of our dataset augmentation algorithm is to reduce an inclination towards stereotypical thought processes within the LLM, ones that are currently perceived by society as “factual.” To establish inclusivity and reduction of such thought processes requires augmentation of the knowledge base. Given the fact that all the augmentations induce a non intrinsic element to the data/text, we think the augmented entries are not “invalid.” Pertaining to the given example, to reduce reliance on stereotypical thinking, we fine-tune and establish that members of any race can be good at math. This is not an invalid entry; this is true. 

Regarding the “no clear takeaway” note, the tables demonstrate a reduction in the defined metrics when the fine-tuning process is completed. We make our claims by comparing changes in the aforementioned metrics before and after the fine-tuning process. We seek to emphasize a change caused by our process. Because the topic of bias is inherently a subjective matter, it is hard to establish a ground truth through human experts. 

Once again, thank you very much for your feedback.

Reviewer 2 Report

Comments and Suggestions for Authors I congratulate the authors to their intersting paper. While reading, some points remain unclear to me.

1) Line 147: The authors write: "Another entry (Entry 2), which is the summarization of Entry 1, is created." What is the meaning of summarization here, how is it done?

2) Line 161 ff., construction of the mb-Index. There is a reference to the ICAT score of Nadeem et al., which combines a stereotype score with a language modeling score.
It is unclear to me how their new score is exactly established. I encourage the authors to give precise definitions of all used components of the score.

3) Line 174, ff: The authors write: "From this definition, we assert that the stereotype score is the proportion of continuations classified as stereotypical
for all continuations not marked as nonsensical:" but then, in the denominator of the ratio, I_C (the number of nonsensical continuations?) is included? Is "I" an embedding or a number?

4) Line 193, ff:The authors write: "More specifically, we first convert each entry in the corpus into an embedding vector.
We then segment the corpus into semantically homogeneous groups using k-means clustering."

What is the embedding size? Is it sensible to use k-means clustering in such a high-dimension and for this purpose?

5) What is meant with "All LLMs were fine-tuned in a causal language modeling process" in line 226?

6) Section 4 in general: exact values are given for all indices (table 1 and table 2). Have the experiments be repeated? Can the authors perhaps provide standard errors of
their results?

Finally a remark: the used data set(s) is (are) not provided (no link to a repository e.g. is given), so the results are not reprocucible.

 

Author Response

To whomever it may concern,

Thank you very much for your feedback and comments. With respect to your first point, the summarization procedure was done through spaCy and a BART model, more specifically the abst_sumn function. We will make this explicit in our revised version.

With respect to your second point, we are unsure about the definition that is missing. Attached below is the portion of the paper that lays out the entire definition of the new metric:

“The stereotype score, a new metric, is a result derived from an extension of the Intersentence Context Association Test Nadeem et. al proposed in conjunction to the StereoSet score [11 ]. However, rather than the LLM “picking” the best answer to the context provided in a multiple-choice setting, it generates a 30-character continuation of the context, defined as I. Given three choices, one reinforcing a stereotype (A), the other reinforcing the anti-stereotype (B), and the third being a nonsensical sentence (C), the cosine similarity between the embedding-based vectorized version of I and the embedding-based vectorized version of each option is calculated. The greatest similarity is then used to classify the generated  text as stereotypical, anti-stereotypical, or nonsensical. This process is continued through each entry of the StereoSet dataset. 170 From this definition, we assert that the stereotype score is the proportion of continuations classified as stereotypical for all continuations not marked as nonsensical.”

With respect to your third point, thank you for pointing out that inconsistency. The denominator should be I_A + I_B. I is supposed to be a text continuation, but in the metric we create, we look at the number of I_A (stereotypical continuations) over the number of I_A plus I_B (non-stereotypical continuations). We understand this confusion and will address it in our edits. 

With respect to the fourth point, each embedding had 300 dimensions (we used spacY’s en_core_web_lg embeddings collection). We understand that k-means suffers the curse of dimensionality when used on higher-dimension vectors. However, it was the most computationally efficient algorithm for us to use. We will address this limitation in the “Limitations and Further Research” portion of our work.

With regards to the fifth point, causal language modeling is the task of predicting the next token when given a series of tokens. We will explicitly mention this in the revised version. 

With regards to the sixth point, the experiment was not repeated due to computational limitations. Fine-tuning was done once, and db-index was calculated five times and then averaged. We will provide a link to our code. 

Once again, thank you again for your feedback.

Reviewer 3 Report

Comments and Suggestions for Authors

Sure, here’s a revised version:

 

The proposal is to clearly highlight the scientific contributions of the research within the Introduction section. Additionally, it is customary to provide an overview of the rest of the paper at the end of the introduction.

 

Within the scope of the research on annotation and text analysis, there are relevant works that have not been considered, such as: https://doi.org/10.1371/journal.pone.0242050

 

Can you explain in more detail why you chose "two small subsets of the dataset: Sample A, containing ten elements, and Sample B, containing 50 elements"? Are ten elements too few for a detailed analysis?

 

Does the research described depend on the language used in the datasets? Based on the manuscript, it seems that the research was conducted on English language examples. Is it possible to apply this approach to languages with limited resources? Can the research be applied to both monolingual and multilingual settings, as discussed in https://doi.org/10.1016/j.engappai.2023.106485?

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

To whom it may concern, 

Thank you so much for your feedback and comments. With regards to your first point, we will make those appropriate changes to the introduction section. 

With regards to your second point, we were unaware of such work that was being done. We will be sure to incorporate the work mentioned in the context of our literature review. 

With regards to your third point, we utilized small datasets in order to simulate “restricted industries” where there is limited data due to its unique or confidential nature, such as finance or defense. We do not conduct detailed analysis of the actual dataset, but rather use a baseline dataset and an augmented dataset to fine-tune and see the level of stereotypical response demonstrated by the large language model. However, we will raise the concern of dataset size in our “Limitations and Further Research” section. 

With regards to your fourth point, our process should be applicable to other languages given the correct embeddings are provided in the native language. We will mention this in our “Limitations and Further Research” section as well. 

Once again, thank you for your feedback. 

Reviewer 4 Report

Comments and Suggestions for Authors

Dear Authors,

Hello, I am honored to review this paper. The paper has a good idea, the content is acceptable, and it shows some degree of innovation. However, there are areas that need to be revised, such as:

1. Please add numbers to the formulas.

2. Please provide additional details about the contributions and innovative aspects of the paper.

3. Please provide additional information on future directions for improving the proposed method.

4.Please explain some of the shortcomings of the proposed method.

5.Please continue to add several more references, such as:

1) https://doi.org/10.1145/3641289

2) https://doi.org/10.1016/j.ipm.2022.103260

3) https://doi.org/10.48550/arXiv.2402.13352

Author Response

To whom it may concern, 

Thank you very much for your feedback. We will add numbers to the formulas.

Additionally, we will emphasize the unique nature of our paper by asserting the lack of implicit annotator bias such an augmentation algorithm introduces and its applicability to a variety of languages other than just English. 

We also have a dedicated “Limitations and Further Research” portion of the paper where we address both directions for improving the proposed method, as well as some shortcomings of the proposed method.

Though we understand the importance of including more references to provide greater context as to the contribution of our work, we feel that the provided references are tangential to our work. Therefore, we do not believe it would strengthen the context of our work if such references are included. 

Once again, thank you for your feedback. 

Reviewer 5 Report

Comments and Suggestions for Authors

A pressing matter in the ever-evolving field of natural language processing is the bias present in large language models (LLM). In this paper, the authors propose a novel, automated mechanism for debiasing through an automated augmentation algorithm based on bias producers. More specifically, the authors explore automated dataset augmentation to mitigate bias, using the concept of a bias producer to describe broad creators of bias, such as ethnicity or sexuality, and biasers that serve as specific examples. These bias producers can be elements of generic bias (such as gender or race) or be industry-specific. The authors also define two new metrics for quantifying bias about both datasets and models, the db-index and mb-index, respectively. These metrics provide a crucial feedback loop for researchers and developers to monitor, analyze, and minimize LLMs’ bias.

As seen in experiments, the automated dataset algorithm can reduce the db-index of a dataset. The augmented datasets substantially decreased db-index compared to their original counterparts. Besides, LLMs, fine-tuned on the augmented datasets, have less perplexity than original LLMs. This suggests that augmented datasets created through the algorithm mentioned above can increase LLM performance.

This algorithm and metrics can be used to quantify and mitigate large language model bias in various industries, ranging from education to agriculture. However, proposed approach proves most effective when mitigating bias in "restricted industries", more specifically, industries where data is limited due to confidentiality or availability of information (for example, the defense, medical, and financial fields).

The article may be published in  present form.

Author Response

To whom it may concern,

Thank you for your feedback. We are glad that our work demonstrates novelty and therefore enables large language models to demonstrate more equitable perspectives. Once again, thank you for your comments and thoughts.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have provided their objection to my comments and listed their arguments.

I still believe that a ground truth by human experts or at least some random checks by human experts would enhance the work and that, although in cases as the presented example the automatic augmentation is ok, in other cases invalid statements can be inserted. But I do accept that the authors' arguments are sound. Therefore I agree that the work could be published and let the readers decide the degree to which it is useful to them.

Author Response

To whom it may consider,

Thank you for reconsidering our points. We still believe that human validation in this system would not be objective due to inherent subjective nature of bias. Therefore, we believe it is best to maintain the current form of the paper.

Once again, thank you for your feedback and comments.

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for adequately replying to my points raised in the first review.

 

Author Response

To whom it may concern,

Thank you so much for reconsidering our points. 

 

Reviewer 3 Report

Comments and Suggestions for Authors

All comments are revised.

Comments on the Quality of English Language

 Minor editing of the English language is required

Author Response

To whom it may concern,

Thank you so much for reconsidering our points. 

 

Back to TopTop