Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Appl. Sci. 2024, 14(16), 7150; https://doi.org/10.3390/app14167150

by Raz Lapid^1,2

, Ron Langberg² and Moshe Sipper^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2024, 14(16), 7150; https://doi.org/10.3390/app14167150

Submission received: 22 July 2024 / Revised: 5 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors of this work introduce a novel approach to manipulating LLMs using a genetic algorithm (GA) to optimize a universal adversarial prompt. This method, designed for situations where the model's architecture and parameters are inaccessible, combines the adversarial prompt with a user's query to disrupt the model's alignment, resulting in unintended and potentially harmful outputs. By systematically revealing a model's limitations and vulnerabilities, this technique serves as a diagnostic tool for evaluating and improving LLM alignment with human intent. Through several experiments, the authors demonstrate the efficacy of their approach, contributing to the discussion on responsible AI development.

This work represents the first automated universal black-box jailbreak attack, emphasizing the need for interdisciplinary collaboration to enhance the security and ethical considerations of LLMs, which I personally consider highly relevant for the field of AI today. However, I also believe that the concept of a universal jailbreak might be too broad and ambitious. LLMs are typically trained on diverse datasets, making it challenging to design a single exploit that works universally without any modifications. Additionally, some sections need major structural changes and more details as in a scientific research paper before acceptance. Below are my detailed comments:

1. Line 39: Dynamic model behavior. A reference is missing to support the whole paragraph.

2. Lines 95-99: A reference is missing to support the whole paragraph.

3. Lines 103-104: These lines are not connected with the rest of the section. Please elaborate more on them and connect them with the previous paragraph.

4. Section 3. Threat Model: I find this specification a bit short and not very specific. Addressing these issues would enhance the model's robustness and clarity. The threat model assumes adversary access limited to textual outputs without considering side-channel leaks (e.g., metadata, response timing, or other auxiliary information might be exploited), and the concept of a universal jailbreak might be overly ambitious given the diversity of LLM training data. The definition of harmful behavior is subjective, which I believe needs clearer criteria. The model also overlooks the adversary's potential knowledge of training data and similar models. It fails to discuss existing or potential defense mechanisms like filtering or adversarial training. Please analyze how these defenses could impact the effectiveness of the proposed attack. Finally, the threat model lacks evaluation metrics for attack success. These could include the frequency of harmful outputs, the severity of the content, or user reports of harm. Additionally, it omits ethical considerations for conducting and publishing such research.

5. Line 258: Why did you choose these two models in particular? What makes the results related to them generalizable?

6. I also miss a (sub)section in which authors explain how to apply the GA to a new LLM. Which are teh steps to take into account? What are the rules and metrics we need to consider? This would help in generalization of the results.

7. The discussion section does not follow the standard rules of research papers. It seems that it might have been generated by an AI, as it presents points for each concept without coherence. Please rewrite it into paragraph-oriented ideas without bold text. Additionally, extract the limitations and future research and work into a new subsection.

8. Please discuss all the appendices in the main text of the paper. They are not linked with the sections of the paper, making it difficult to see their connection to the main work.

Comments on the Quality of English Language

I highly recommend the authors conduct an overall grammar check of the English structure.

Author Response

MDPI Revision:

Dear Reviewer,

Thank you for your valuable feedback on our paper. We have carefully considered your comments and made the following revisions:

Dynamic Model Behavior (Line 39): We have added a reference to support the entire paragraph discussing dynamic model behavior. This should now provide the necessary context and background for this section.
Missing Reference (Lines 95-99): We have included a reference to substantiate the discussion in the specified lines. This addition should clarify and support the arguments presented.
Connection of Lines 103-104: We have elaborated on lines 103-104 and provided a clearer connection to the preceding content. The revised text now integrates these lines more cohesively with the surrounding discussion.
Threat Model Specification: We have expanded the Threat Model section to address your concerns. The revised section now considers:

Adversary Access
Attack goal
Clearer criteria for defining Universal Jailbreak
Definition of Harmful Behavior
Abit about defense mechanisms
Evaluation Metric
Ethical considerations for conducting and publishing such research.

Model Selection Justification (Line 258): We have provided a detailed explanation for selecting the two models in question and discussed the generalizability of the results. This should clarify the relevance of our findings to other models.
Applying GA to New LLMs: A good point. As we emphasized throughout the paper, our method is generic, and while we demonstrated it on 2 LLMs, any LLM (Fig. 1) will easily do.
Discussion Section: We have rewritten the discussion section to ensure it adheres to standard research paper formats. The limitations and future research have been extracted into a new subsection. The revised discussion is now more coherent and presented in a paragraph-oriented format.
Appendices Integration: We went over the appendices again, and made sure each appendix is referenced in the main text.

Thank you once again for your constructive feedback. We believe these revisions address your concerns and improve the quality of our paper.

Best regards,

The authors.

Reviewer 2 Report

Comments and Suggestions for Authors

This is an interesting study that presented a demonstration of a novel approach using a genetic algorithm (GA) to manipulate a large language model (LLM). This black-box “jailbreak” model demonstrates the vulnerabilities and disrupts the model’s alignment and thus resulting in harmful outputs. It demonstrates the potential for manipulation and the need for development of robust LLMs and security.

The background information was adequate. While I have a general knowledge of LLMs and AI, it was clear and complete enough to understand the current state and the issue that is being explored.

The threat model appeared to be described in enough detail. The terms were defined and the authors made it clear what they were generating with the model. The methods were described adequately for someone to understand the genetic algorithm, the population encoding, and the fitness function. The crossover and mutation process and elitism function are explained well. The algorithms were clearly explained in sufficient detail for those who may not understand the details.

The use of the experimental datasets and partitioning for testing were described well. The rationale for their use was explained.

The results are summarized well and presented clearly. I do not have suggestions for a clearer presentation.

The discussion notes important transferability among models. The authors make a good point that learning from prior jailbreaks could enhance capabilities of the various models. Limitations are noted. The implications and counter measures are noted. These and the conclusions are clear for those using or working with LLMs and are important for clinicians and researchers using those databases.

Author Response

We thank the reviewer for carefully reading our paper, and for the encouraging and positive remarks!

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for all the changes. I consider that the current status of the manuscript is good for publication.

Comments on the Quality of English Language

A minor revision of the grammar would improve it.

Article Menu

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI