LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment

Chen, Qiuyu; Yamaguchi, Shingo; Yamamoto, Yudai

doi:10.3390/info16030204

Open AccessArticle

LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment

by

Qiuyu Chen

,

Shingo Yamaguchi

^*

and

Yudai Yamamoto

Graduate School of Sciences and Technology for Innovation, Yamaguchi University, 2-16-1 Tokiwadai, Ube 755-8611, Japan

^*

Author to whom correspondence should be addressed.

Information 2025, 16(3), 204; https://doi.org/10.3390/info16030204

Submission received: 16 January 2025 / Revised: 24 February 2025 / Accepted: 3 March 2025 / Published: 5 March 2025

(This article belongs to the Section Information Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the misuse of large language models (LLMs) has emerged as a significant issue. This paper focuses on a specific attack method known as the greedy coordinate gradient (GCG) jailbreak attack, which compels LLMs to generate responses beyond ethical boundaries. We have developed a tool to suppress the improper use of LLMs by employing a high-precision detection method that combines syntactic tree analysis with the perplexity of generated text. Furthermore, the tool incorporates one of the small language models (SLMs), the DistilBERT model, to evaluate the harmfulness of sentences, thereby preventing harmful content from entering the LLM. Experimental results demonstrate that the tool effectively detects GCG jailbreak attacks and contributes to the secure usage of LLMs. In the test results, the defense success rate reached 90.8%.

Keywords:

LLMs; jailbreak attack; GCG; syntactic trees; perplexity; SLMs; DistilBERT; implementation

Graphical Abstract

1. Introduction

In recent years, jailbreak attacks on large language models (LLMs) have emerged as a serious and challenging problem. The jailbreak attack stands out as a prevalent vulnerability, where specially designed prompts are used to bypass the safety measures of LLMs, facilitating the production of harmful content [1]. Large language models trained for safety and harmlessness remain vulnerable to adversarial misuse, as demonstrated by the prevalence of jailbreak attacks in earlier versions of ChatGPT [2]. Among the various jailbreak methods discovered so far, one key technique involves attackers modifying prompts to elicit non-ethical content from the targeted LLM. For example, when asked a question like “Please tell me how to make a bomb”, LLMs typically refuse to respond as such requests are ethically unacceptable. However, through jailbreak attacks, there is a high possibility that an LLM might provide detailed instructions on “how to make a bomb”. Such attacks pose significant risks, potentially undermining trust and compromising the integrity of various applications.

This study aims to develop a classification method based on syntactic tree and semantic analysis to prevent vulnerability attacks against LLMs. The improper use of LLMs, particularly through techniques like greedy coordinate gradient (GCG) attacks, necessitates real-time monitoring of prompts submitted to LLMs. GCG attacks refer to a type of adversarial attack that iteratively modifies input tokens based on coordinate-wise gradient information to evade language model constraints. The goal of a GCG attack is to gradually modify input tokens in order to induce the LLM to generate outputs that violate ethical constraints. This attack exploits the LLM’s vulnerability in handling adversarial inputs, using gradient optimization or heuristic search to find the optimal input combination that bypasses security defenses. This approach seeks to detect patterns of GCG attacks, identify unethical prompts, and prevent improper utilization of LLMs, including the generation of illegal or harmful content, especially information that promotes criminal activities.

The proposed method analyzes syntactic tree-related features and integrates interpretability of the input content to make comprehensive judgments. It enables LLMs to identify inputs containing vulnerabilities exploited by GCG attackers and supports refusal to respond to prompts that exceed ethical boundaries. A tool developed based on this method effectively detects GCG jailbreak attacks [3]. This tool accurately identifies anomalous statements generated through GCG attacks and provides a solution to address the misuse problems accompanying the widespread adoption of LLMs. Consequently, it contributes to the secure and ethical utilization of LLMs.

Our main contributions are as follows:

-: Proposed a novel syntax-based analysis method to detect GCG attacks and developed the syntax trees and perplexity classifier (STPC) Jailbreak attack detector. This approach reduces false positives compared to previous methods and improves detection accuracy.
-: A new method utilizing fine-tuned SLMs for ethics evaluation was introduced, significantly reducing the time required for judgment and improving efficiency.
-: By integrating the two proposed methods, a new misuse prevention tool was developed. It effectively prevents the impact of GCG attacks and harmful content on users, ensuring the safe use of LLMs.

The remainder of this paper is structured as follows: Section 2 describes related research on defense tools. Section 3 provides an overview of the overall structure of the defense tool. Section 4 introduces the jailbreak attack detector component of the tool. Section 5 presents the ethics evaluator component. Section 6 details the evaluation experiments conducted for each component and the tool as a whole. Section 7 concludes the study.

2. Related Research

In September 2023, Kumar et al. [4] proposed a method to counter jailbreak attacks by sequentially removing text from the end of an input until it passes a security check. While effective against prompt-word-based jailbreak attacks, this approach is computationally intensive and lacks adequate separation of input content. In October 2023, Robey et al. [5] introduced a technique to mitigate jailbreak attacks on LLMs by randomly selecting and either copying or removing some tokens from the input text, thereby disrupting the original prompt. Although this method provides a defensive effect against prompt injection attacks, it cannot detect the presence of prompt injection. Consequently, when uniformly applied, it may alter parts of user inputs, potentially changing the intent of legitimate users or rendering the LLM incapable of providing appropriate responses. In November 2023, Alon et al. [6] proposed a method to identify sentences likely containing GCG suffixes by calculating the perplexity and length of sentences. Perplexity is a key metric that measures the uncertainty of a language model in predicting text, representing the model’s average level of uncertainty when generating or evaluating a sequence. A higher perplexity suggests that the text deviates from the model’s training distribution and may contain anomalies or adversarial modifications. While this method demonstrated good performance in detecting GCG suffixes, it failed to address the issue of false positives in detecting long sentences.

A comparison of the aforementioned related studies is shown in Table 1. These methods are evaluated from four perspectives: detection of jailbreak attacks, defense against jailbreak attacks, maintenance of consistency with the original input content, and handling of false positives.

There has been a long history of research related to detecting toxicity in content. For example, in June 2020, Pavlopoulos investigated the impact of context on annotating toxic comments and its effects on detection by automated systems [7]. The findings revealed that context has a statistically significant influence on toxic annotations. However, the lack of improvement in system performance appeared to be related to the rarity of context-sensitive comments.

In August 2022, Lees et al. introduced Jigsaw’s next-generation malicious comment classification model, which is now deployed in the Perspective API [8]. This model utilizes an unlabeled Charformer applied to malicious comment classification problems. In June 2023, Markov et al. proposed a holistic approach to develop the OpenAI Content Moderation API for real-world content moderation [9]. These methods, based on LLMs, leverage a broader contextual understanding, enabling the detection of a wider range of harmful content. However, they also inherit the vulnerabilities of LLMs, particularly susceptibility to complex jailbreak attacks that exploit model weaknesses. Therefore, employing jailbreak attack detectors for filtering is necessary. Additionally, the LLM-based approach is inherently associated with high time costs for detection and significant hardware requirements, making it less efficient in certain scenarios.

Additionally, to demonstrate the feasibility of utilizing syntactic analysis for attack interpretation and defense, we referred to several previous studies. Ogheneovo et al. [10] proposed a model based on the grammatical structure of SQL statements which tests queries using parse trees by dynamically generating and comparing their structures at runtime. By analyzing the grammatical structure of a given SQL statement and comparing it with legitimate SQL statements, the model determines whether the statement is malicious. Experimental results showed that although the parser exhibited a false positive rate and false negative rate of 0.01%, it effectively detected and prevented malicious SQL queries. Park et al. [11] investigated whether the syntactic structure of verbs, as well as the usage of subjects and objects in phishing emails, could serve as distinguishing features for phishing detection. In experiments using a phishing email dataset, they analyzed verb syntactic tree paths and found that while syntactic tree paths alone could not distinguish phishing emails from legitimate emails with 100% accuracy, the differences between the two were still significant. Alon et al. [6] demonstrated that when evaluating the perplexity of queries containing adversarial suffixes using GPT-2, an open-source LLM, such queries exhibited extremely high perplexity values. Furthermore, a LightGBM model trained on perplexity and token length effectively reduced false positives and successfully detected most adversarial attacks in the test set. Hu et al. [12] proposed a token-level adversarial prompt detection method that leverages the predictive capabilities of LLMs to identify adversarial prompts at the token level. This method measures the model’s perplexity, where tokens predicted with high probability are considered normal, whereas tokens exhibiting high perplexity are flagged as adversarial. Experimental results demonstrated that even with a small-scale LLM, the method achieved an accuracy exceeding 90% in detecting adversarial prompts.

3. Misuse Prevention Tool

To prevent the misuse of LLMs, particularly the generation of unethical or harmful information through methods such as GCG attacks, we have developed a misuse prevention tool that integrates STPC (described later) with small language models (SLMs). The core objective of this tool is to provide an efficient and scalable defense mechanism that monitors query content submitted to LLMs in real time, detects potential GCG attack patterns, and identifies queries containing unethical information or those that may lead to negative consequences. This monitoring approach aims to safeguard the proper use of LLMs while preventing their malicious exploitation for the generation of illegal or harmful content, particularly content that may promote criminal behavior or cause real-world harm due to misinformation.

From a technical perspective, the tool’s architecture adopts a modular design for flexible adjustment and scalability. An overview of the architecture is shown in Figure 1. The overall process is divided into two main stages: the jailbreak attack detector and the ethics evaluator.

A.: Jailbreak Attack Detector

The system analyzes the syntactic and semantic patterns within the query content to identify potential GCG attack suffixes. This process enables the tool to locate potential attacks efficiently and quickly. If a GCG attack suffix is detected within the query, the system isolates the related content and prevents it from being forwarded to the LLM, thereby blocking attackers from bypassing system restrictions using specially crafted queries.

B.: Ethics Evaluation

If no suspicious content is detected by the jailbreak attack detector, then the input is passed to the ethics evaluator for further analysis. The primary function of the ethics evaluator is to comprehensively assess the ethicality and potential harmfulness of the input, ensuring that the query does not pose a negative impact on society, individuals, or the system itself. The evaluator employs a set of predefined ethical standards and harm classification models to analyze the input in detail. If the input is deemed harmless or meets ethical standards, it is forwarded to the LLM for normal processing; otherwise, the system logs the anomaly and prevents further propagation.

This tool effectively suppresses the generation of illegal or harmful content by LLMs, providing robust protection and ensuring the safe application of LLM technologies.

4. Jailbreak Attack Detector

This chapter focuses on the key functions of the jailbreak attack detector, including perplexity calculation, syntactic tree-related parameter computation, and MLP classification calculation. Finally, the tool determines whether the user input contains a GCG suffix based on the computation results and proceeds with the next steps accordingly.

To address current challenges, it is necessary to design a classifier capable of accurately detecting GCG attacks. Additionally, this classifier must minimize false positives to enhance the user experience. In previous studies, methods that use parsing rates and token lengths to identify GCG suffixes have been helpful in addressing false positives in short sentences but remain significantly constrained. Particularly in long and complex sentences, spelling and grammatical errors often lead to false positives. To overcome this issue, this study aims to further analyze the characteristics of GCG suffixes in greater detail.

Considering the unreadability of GCG suffixes, analyzing the syntactic structure of sentences provides a clearer representation of their features. GCG suffixes, composed of randomly arranged symbols and misspelled words, rarely form complete grammatical structures. To leverage this characteristic effectively, syntactic analysis of the input to the LLM can be conducted, utilizing syntactic trees [13] for more detailed analysis. During the decision-making process, the syntactic analysis and structural examination of the syntactic trees highlight sentence features in the data.

The syntax trees and perplexity classifier (STPC) operates by first analyzing the overall content of a sentence, examining token types and dependency relationships. Based on the analysis, it constructs a syntactic tree and computes relevant parameters. Next, it calculates the perplexity of the entire sentence. Finally, the parameters are fed into the trained MLP classifier. By learning the parameter characteristics of both types, the MLP classifier distinguishes between normal sentences and sentences containing GCG suffixes. Figure 2 illustrates the overview of STPC.

4.1. Perplexity

Perplexity (P) is a common metric used to measure the uncertainty of a language model in predicting text. Specifically, perplexity quantifies the average level of uncertainty when a language model generates or evaluates a given text. A lower perplexity indicates that the model is more confident in its predictions and that the text conforms well to the model’s training distribution, whereas a higher perplexity suggests that the text structure or content deviates from the model’s typical language patterns. When a sentence exhibits very high perplexity, it typically indicates that the language model has a low ability to predict that sentence. We investigated the perplexity score distributions of publicly available GCG suffixes, GCG suffixes generated by algorithms, and normal non-attack GCG suffixes. Perplexity score is a metric that measures how well a language model predicts a given sequence of tokens. Lower perplexity indicates a more predictable text, while higher perplexity suggests greater uncertainty, which can be leveraged to detect adversarial content. Consistent with findings from previous studies, we observed that 97% of perplexity scores in test datasets containing GCG suffixes exceeded 1000. In contrast, the perplexity of normal sentences usually remains below 200.

There is a significant difference between attack prompts and non-attack prompts, demonstrating that the presence of GCG suffixes can be identified through their perplexity scores. This has been confirmed on large test sets provided by Radford et al. [14].

In this study, we selected GPT-2 as the language model, partly because previous research has used GPT-2 for similar tasks. GPT-2 was primarily trained on high-quality web pages curated and filtered by humans and optimized for low perplexity. In contrast, although GPT-4 has superior generative capabilities, OpenAI does not explicitly mention its perplexity metric, making direct comparisons in this regard difficult [6]. Based on these findings, we followed the approach of previous studies and selected GPT-2 to ensure the consistency of methods and the stability of experimental results.

4.2. Syntactic Techniques

Testing with different datasets and analyzing false detections revealed that when sentences contain grammatical or spelling errors, the perplexity score significantly increases. Previous studies attempted to mitigate abnormal perplexity values by incorporating token length into overall judgment, but this phenomenon persists even in long sentences. In one test case, when two deliberate errors were introduced into a long sentence to simulate user input mistakes, the perplexity increased from 41 to 1255, leading to false detections.

To avoid false detections caused by grammatical and spelling errors, it is necessary to analyze sentence content more comprehensively. Consider the possibility of analyzing syntactic and semantic dependencies [15]. Syntactic trees can minimize the impact of specific spelling errors by inferring their meanings in context and representing correct grammatical structures. They are capable of addressing common grammatical and spelling mistakes and can even infer the most plausible structure in the absence of clear instructions.

GCG suffixes are typically composed of random symbols and misspelled words, making it difficult to form complete grammatical structures. During the generation of syntactic trees, misclassified nodes are directly created as leaf nodes. Calculations show that sentences containing GCG suffixes have an average leaf–node ratio of approximately 55%, compared to about 47% for normal sentences. This also results in notable differences in average branching factors: normal sentences have an average branching factor of approximately 58.7%, while sentences with GCG suffixes average around 47.8%.

The results indicate that statements with GCG suffixes exhibit higher average leaf–node ratios and branching factors compared to normal statements.

4.2.1. Representing Sentence Feature Parameters

We use Stanford CoreNLP [16] as the dependency parser without parameter adjustments, utilizing CoreNLP with its default settings, utilizing its default part-of-speech (POS) tagging and dependency parsing settings. Through syntactic parsing, a syntactic tree is constructed using Stanford CoreNLP, enabling precise identification of dependencies between words within a sentence.

In this study, we selected Stanford CoreNLP as the dependency parser due to two key advantages. First, Stanford CoreNLP provides high-accuracy dependency parsing, with pre-trained models that perform well across multiple corpora, ensuring stable and reliable dependency structures. Second, it offers excellent scalability, supports large-scale text processing, and efficiently handles long sentences and complex syntactic structures, making it well-suited for the requirements of this study.

Regarding tokenization rules, we applied whitespace tokenization, where text is split based on spaces. To ensure the integrity of syntactic analysis, punctuation marks, parentheses, and other symbols are treated as separate tokens, while numerical values and multi-word compounds are processed as single tokens. For English, it segments text based on spaces and punctuation while preserving contractions. Additionally, it recognizes proper nouns and special symbols to maintain textual integrity. This approach enhances parsing accuracy while reducing segmentation errors, ensuring the reliability of text processing.

After constructing the syntactic tree for a normal sentence, it is visualized, as shown in Figure 3, to intuitively observe its structural features. Figure 4 illustrates the visualization of a syntactic tree constructed for a sentence containing GCG suffixes.

When calculating feature parameters, the entire syntactic tree is traversed to compute the total number of nodes, the number of leaf nodes, the number of non-leaf nodes, and the total number of child nodes for non-leaf nodes. To count non-leaf nodes, only nodes with a depth of two or greater are considered. This enables the calculation of the total number of leaf nodes and child nodes for all non-leaf nodes. The leaf–node ratio is obtained by dividing the number of leaf nodes by the total number of nodes. The average branching factor is calculated by dividing the total number of child nodes for non-leaf nodes by the total number of non-leaf nodes:

A v e r a g e B r a n c h i n g F a c t o r = \frac{T o t a l N u m b e r o f N o n-Leaf N o d e s}{T o t a l N u m b e r o f C h i l d N o d e s f o r N o n-Leaf N o d e s}

(1)

L e a f N o d e R a t i o = \frac{N u m b e r o f L e a f N o d e s}{T o t a l N u m b e r o f N o d e s}

(2)

Using this approach, the syntactic tree provides coordinate information consisting of parameters such as the leaf-node ratio and average branching factor.

4.2.2. Analyzing and Calculating by MLP

We first examined the data characteristics of three key parameters: perplexity (P), average branching factor (ABF), and leaf–node ratio (LNR). Our observations indicate that while the differences in ABF and LNR between normal sentences and sentences with GCG suffixes are relatively minor, perplexity exhibits the most significant variation. By comparing the ratios, we observed a statistically significant difference between these two types of sentences, with perplexity contributing the highest weight. However, this also suggests that relying solely on perplexity may amplify classification errors, leading to an increased false positive rate.

At the same time, syntactic tree parameters may be affected by external disturbances, such as spelling errors and textual noise, potentially leading to anomalies in feature values. To verify whether such disturbances still result in statistically significant differences in parameters between perturbed normal sentences and sentences with GCG suffixes, we applied random perturbations to the content of normal sentences and recalculated the relevant parameters. Specifically, we introduced the following three types of perturbations: (1) random deletion of characters, (2) random duplication of characters, and (3) random substitution of characters.

For each perturbed sentence, we computed the average values of perplexity, ABF, and LNR and compared them with those of normal sentences and sentences with GCG suffixes. The results are shown in Table 2.

The experimental results indicate that, although the parameter values of perturbed sentences slightly increased, they still exhibited a significant gap compared to sentences with GCG suffixes. This suggests that, even in the presence of perturbations, syntactic tree analysis remains a feasible approach for distinguishing between normal sentences and GCG attack sentences.

In our study, we utilized a multi-layer perceptron (MLP) model [17] to learn and classify normal sentences and GCG attack sentences based on the three features: LNR, ABF, and P.

MLP is a feedforward neural network consisting of an input layer, multiple hidden layers, and an output layer. It enhances model expressiveness through non-linear activation functions and optimizes parameters using backpropagation and gradient descent. Due to its ability to learn complex feature relationships, MLP is widely used in text classification and pattern recognition tasks. Previously, we attempted to classify parameter features using a linear weighted sum approach. However, the relationship between LNR, ABF, and P is not strictly linear, leading to a decline in classification performance when using a linear method. Moreover, manually setting weights makes it difficult to dynamically adapt to different GCG attack variations. By utilizing nonlinear activation functions such as ReLU and Sigmoid, MLP can learn complex decision boundaries, effectively classifying GCG attacks while also adapting to different data distributions.

The MLP model consists of three hidden layers, with the input being LNR, ABF, and P features and the output representing a binary classification label (normal sentence or GCG attack sentence). The model computation is defined as follows:

A.: First Hidden Layer Computation:

z_{1} = ReLU (W_{1} x + b_{1})

(3)

x is the input feature vector [

L N R

,

A B F

,P].

W_{1} \in R^{32 \times 3}

is the weight matrix, and

b_{1} \in R^{32}

is the bias term. ReLU(x)=max(0,x) is the activation function.

B.: Second Hidden Layer Computation:

z_{2} = ReLU (W_{2} z_{1} + b_{2})

(4)

W_{2} \in R^{16 \times 32}

, and

b_{2} \in R^{16}

.

C.: Output Layer Computation:

y = σ (W_{3} z_{2} + b_{3})

(5)

W_{3} \in R^{2 \times 16}

, and

b_{3} \in R^{2}

.

σ (x)

is the softmax activation function for binary classification.

4.3. Classification

Based on this approach, we conducted an analysis of the training dataset and validated the model on the test dataset. The MultiNLI dataset [18] was used in this study. This dataset includes a wide variety of sentence types, featuring more complex and grammatically disjointed long sentences. The analysis results were employed as a training set for further processing. To determine whether a target sentence contains a GCG suffix, the training set data were analyzed in detail to separate the two domains. The test analysis results are shown in Figure 5.

To ensure robust model evaluation, we conducted 10 independent experiments for LightGBM and STPC classifiers and computed the mean and standard deviation of accuracy, precision, recall, and F1-score.

True Positives (TP): Number of correctly classified normal sentences.
False Positives (FP): Number of GCG sentences incorrectly classified as normal sentences.
True Negatives (TN): Number of correctly classified GCG sentences.
False Negatives (FN): Number of normal sentences incorrectly classified as GCG sentences.

During the experiments, the learning rate was set to 0.0005, and the FN and FP values were recorded for each trial. The averaged results, along with standard deviations, are as follows:

Accuracy: 0.94 ± 0.01
Precision: 0.99 ± 0.01
Recall: 0.89 ± 0.02
F1-score: 0.94 ± 0.01

4.4. Surface Formula

However, as an inherently black-box model, MLP lacks interpretability in its decision-making process. To better understand the classification criteria, we aim to derive a mathematical representation of the decision boundary that separates the two classes.

To obtain the classification boundary, we first extract data points where the MLP classification probability is close to 0.5, as these points represent the transition between two classes. Specifically, we follow these steps to fit the decision boundary surface equation:

Identifying decision boundary points
- Compute the MLP-predicted probability $P_{G C G}$ for each data point.
- Select data points that satisfy $0.45 < P_{G C G} < 0.55$ as boundary points to ensure they lie in the classification transition region.
Fitting a third-order polynomial surface
- Apply third-order polynomial fitting using the least squares method (LSM) to fit the boundary points and optimally describe the distribution trend of the data points:
  
  $P = a_{1} L N R^{3} + a_{2} A B F^{3} + a_{3} L N R^{2} A B F + a_{4} A B F^{2} L N R + a_{5} L N R + a_{6} A B F + a_{7}$
  
  (6)
- This fitting process is achieved by minimizing the error
  
  $\sum_{i} {(P_{i} - f (L N R_{i}, A B F_{i}))}^{2}$
  
  (7)
  
  where $P_{i}$ represents the true values of the sampled points, and $f (L N R_{i}, A B F_{i})$ is the estimated polynomial function.
Final decision boundary equation
- After optimization, we obtained a third-order polynomial surface equation. We refer to this surface as the LNR-ABF-P decision surface (LBP). The equation for LBP is as follows:
  
  $\begin{matrix} L B P = - 29966.41 L N R^{3} & + 7731.78 A B F^{3} + 152695.01 L N R^{2} A B F - 63304.82 A B F^{2} L N R \\ + 10395.94 L N R - 15182.16 A B F + 16786.81 - P \end{matrix}$
  
  (8)

According to this equation

If the LBP value is greater than 0, then the data point is classified as a GCG attack.
If the LBP value is less than or equal to 0, then the data point is classified as a normal sentence.

Figure 6 presents the visualization of the LBP decision surface.

This polynomial surface equation enhances interpretability, allowing us to mathematically understand the classification process rather than solely relying on MLP as a black-box model. Additionally, apart from directly classifying with MLP, the same model can also determine whether a sentence belongs to a GCG attack by computing the LBP value.

5. Ethics Evaluator

This chapter focuses on the key functions of ethical judgment devices, including the selection and classification of SLMs.

5.1. Method for Determining Harmfulness

In recent years, the use of LLMs has become a common approach for determining harmfulness. Thanks to pretraining on large-scale corpora, LLMs have demonstrated excellent performance in tasks such as text generation, sentiment analysis, and harmfulness evaluation. LLM-based methods for identifying harmful content leverage their robust contextual understanding and advanced semantic analysis capabilities to detect potentially dangerous statements, hate speech, and other harmful content [19]. However, LLMs also exhibit several significant drawbacks.

First, training LLMs requires enormous computational resources, resulting in high computational costs and long processing times. Second, LLMs demand substantial memory and storage, making them unsuitable for resource-constrained environments. Furthermore, due to their complexity, LLMs often have slower inference speeds, especially when handling tasks of simple or moderate complexity, where their performance does not significantly surpass that of smaller models.

5.2. Leveraging SLMs

This study proposes a harmfulness evaluation method using small language models (SLMs). Compared to LLMs, SLMs require fewer computational resources, offer faster inference speeds, and are more adaptable, making them well-suited for efficient text analysis in resource-limited environments. Moreover, SLMs maintain sufficient accuracy and performance when processing simpler tasks, making them ideal for use in privacy-sensitive local environments. By leveraging these advantages, this study validates the effectiveness of SLMs in harmfulness evaluation and demonstrates their superiority in computational efficiency and interpretability.

5.3. Ethicality Assessment of Prompts Using SLMs

The harmfulness evaluation method using SLMs is closely related to sentiment analysis tasks. The primary goal of sentiment analysis is to analyze text and determine its emotional tendency, such as whether it is positive, negative, or neutral [20]. Similarly, harmfulness evaluation tasks require the model to analyze the semantics and context of the text to identify whether it contains potentially harmful content, such as hate speech or violent threats. SLMs possess the capability to capture word relationships and sentence structures. Both sentiment analysis and harmfulness evaluation rely on supervised learning processes. During the training phase, the model learns to identify harmful content from labeled datasets, which include a large variety of harmful sentences. The model uses these data to continually optimize its classification capabilities.

5.4. DistilBERT

DistilBERT is a lightweight BERT model optimized through knowledge distillation, reducing model parameters and computational costs while maintaining performance close to the original BERT [21]. Based on current comparisons of online harmful language recognition performance, DistilBERT excels by being approximately twice as fast as other models [22]. When using deterministic thresholds for classification, it correctly identifies harmful language from previously unseen data with over 70% accuracy. Experimental results show that the processing time per sentence is about 0.007 s. When evaluating harmfulness in 1000 sentences, the fine-tuned DistilBERT takes only 8.89 s, significantly outperforming other models in terms of speed. This demonstrates that judgments using DistilBERT ensure fast online response times, reducing delays caused by multiple evaluations and preventing negative impacts on user experience.

Table 3 summarizes the accuracy and processing time of different SLMs when tested on 25,255 harmful dataset samples, as recorded by Sundelin et al. [22]. DistilBERT is shown to process data in the shortest amount of time while maintaining high accuracy. Although direct comparison experiments with LLMs have not been conducted, existing studies suggest that LLMs may exhibit bias in harmfulness evaluation and cannot always provide fully accurate judgments.

The ethics evaluator prevents harmful content from being input into the LLM, but allows non-harmful content.

5.5. Implementation

To achieve more effective harmfulness detection, this study selects and fine-tunes DistilBERT. DistilBERT is a lightweight version of BERT, with its primary advantage being the significant reduction in model size and inference time while retaining performance comparable to BERT. Through knowledge distillation, the knowledge in BERT is compressed into a smaller model, reducing model parameters by approximately 40%, yet maintaining almost the same level of accuracy in text classification tasks. This high efficiency is particularly important for real-world applications of harmfulness detection, especially in real-time systems processing large volumes of data.

This section details the implementation process of using DistilBERT for harmfulness detection. The implementation is divided into four main steps, each playing a critical role in maximizing the accuracy and efficiency of the harmfulness detection system.

A.: Model Loading and Initialization

The first step involves selecting and loading an appropriate pre-trained model. This study uses DistilBERT, a lightweight and efficient model, tailored for harmfulness detection. The goal of this step is to leverage a high-performing, lightweight model capable of processing large volumes of input text in real time.

B.: Data Preprocessing

Next, the input text data undergo preprocessing. Similar to many text classification systems, this step requires data cleaning and tokenization. This includes removing unnecessary symbols and spaces from the text, tokenizing words, and converting the input into a format that the model can process. This step is crucial for enabling the model to accurately comprehend the meaning of the text and identify harmful elements.

C.: Fine-Tuning

Although DistilBERT is a pre-trained model, fine-tuning is necessary for effective harmfulness detection. For fine-tuning, datasets with explicitly labeled harmful content, such as racism, sexism, violence, and harassment, were used. The fine-tuning dataset comprised the HarmfulTasks dataset [23], the HarmfulQA dataset [24], the Wiki-Conciseness dataset [25], and the ToxiGen dataset [26] totaling 4000 entries. Through fine-tuning, the model learns harmful language patterns and contextual features, enabling it to more accurately detect and classify harmful content in practical applications. Training was conducted over seven epochs, requiring approximately 50 min.

D.: Interpretation of Results and Feedback

Finally, the harmfulness detection results from the model are interpreted and fed back into the system. In this stage, the labels (harmful or non-harmful) and confidence scores output by the model are analyzed to determine subsequent actions. For instance, content deemed harmful is blocked, while non-harmful content proceeds to the next processing step. This step emphasizes transparency in how the model makes decisions, with the use of a feedback loop for further system improvement. This feedback cycle allows the system to adapt to new patterns and evolving forms of harmful content.

6. Evaluation

In this chapter, we will evaluate the experimental results of the jailbreak attack detector and the ethics evaluator separately. We will then conduct an overall evaluation of the misuse prevention tool.

In data processing, we recognize the potential imbalance in sentence type distribution, which may cause the model to be biased toward certain attack types. We implemented the following measures to mitigate this risk to ensure dataset diversity:

First, as stated in Section 5.5, our dataset is sourced from multiple independent sources to reduce potential biases caused by reliance on a single corpus and to cover a broader range of harmful sentence structures. Additionally, we randomly selected a portion of sentences in the training set for duplication to test whether any bias exists in the data.

Second, we applied data augmentation techniques, including synonym replacement and sentence structure transformation, to generate diverse attack samples and enhance dataset variability. This approach ensures that different attack categories are evenly distributed between the training and experimental datasets, thereby improving the model’s generalization capability.

6.1. Evaluation of the STPC Jailbreak Attack Detector

To demonstrate the generalizability of the classifier, the GLUE dataset was selected as the test dataset [27]. The GLUE dataset includes sentences with some grammatical errors, providing an effective testbed for evaluating false positives caused by grammatical errors. From the set of normal sentences, 1000 data items were extracted to form a test set. GCG suffixes were added to both harmful and normal questions for testing. Additionally, 1000 sentences were randomly combined with generated GCG suffixes, resulting in 1000 malicious sentences containing GCG suffixes.

This total of 2000 sentences was used as the test set for experiments. The composition of the dataset is shown in Table 4.

The classification calculation is performed by inputting the parameters into the MLP model to evaluate whether the sample falls within the range of a “normal sentence”.

Using the dataset, the same experiment was conducted with both the proposed method and the Light-GBM approach from prior research. This method references Neel Jain et al. [28], which utilizes a sliding window to measure perplexity, incorporating token length as a second parameter. LightGBM [29] was then used for learning and adjustment. Detection accuracies were compared. The default parameters were used for testing. The classification accuracy of the Light-GBM method was 94.6%, whereas the classification accuracy of the proposed STPC method was 97.25%.

Table 5 and Table 6 present the false positives and false negatives for the same detection data. As shown in the table, the proposed method outperforms the prior Light-GBM approach in both classification accuracy and the identification of false positives.

6.2. Evaluation of the Ethics Evaluator

To evaluate the performance of the ethics evaluator in identifying harmful content, such as hate speech and violent statements, 1000 sentences were selected to test the tool’s defense capabilities. The dataset was composed of 500 normal sentences extracted from the GLUE dataset and 500 harmful sentences extracted from the ToxiGen dataset and the Hate Speech dataset [30]. The composition of the dataset is shown in Table 7.

The results show that out of 1000 sentences, the evaluator misclassified 26 harmful sentences as non-harmful and 58 non-harmful sentences as harmful. The accuracy reached 91.6%.

Table 8 presents the results for false positives and false negatives for the same dataset. The total processing time for the 1000 sentences was 8.72 s, highlighting the evaluator’s efficiency in real-time applications.

6.3. Runtime Comparison

To better demonstrate the advantage of the ethics evaluator in reducing runtime, we conducted experiments using the same dataset from Section 6.2, inputting it simultaneously into the SLM-based ethics evaluator and an existing evaluation tool. For comparison, we selected ALBERT [31] and Google’s Perspective API [32].

ALBERT (A Lite BERT) is a lightweight version of BERT that significantly reduces the number of parameters through parameter decomposition and cross-layer parameter sharing, all while maintaining performance comparable to BERT. It also enhances inter-sentence relationship understanding, making it well-suited for harmful content analysis.

Perspective API is a tool designed to detect and analyze harmful language in online comments, content, and conversations. It utilizes deep learning-based natural language processing (NLP) models to automatically identify toxicity in text and provides scoring metrics to assist in content moderation and risk assessment.

The results from conducting experiments with the same dataset are shown in Table 9. As is evident, the proposed method’s runtime is significantly shorter than that of ALBERT and Perspective.

6.4. Evaluation of the Misuse Prevention Tool

To fully test the effectiveness of the proposed tool in suppressing attacks and misuse, 2000 sentences were selected to evaluate its defensive performance. The dataset included 100 normal sentences extracted from the GLUE dataset and ToxiGen dataset, 500 malicious sentences containing GCG suffixes, and 500 harmful sentences extracted from the Hate Speech dataset and ToxiGen dataset.

The dataset included 500 harmless sentences, 500 harmful sentences, and 500 harmful GCG attacks. The dataset composition is shown in Table 10.

Table 11 presents the experimental results. In summary, the defense success rate reached 90.8%. The average completion time was 1329.5 s for the 2000 sentences.

7. Conclusions

This paper proposes a misuse prevention tool to address the illegal exploitation and misuse of large language models (LLMs). The tool consists of two components: the jailbreak attack detector (STPC) and the ethics evaluator. The design and evaluation of STPC are based on two key aspects: the construction and analysis of syntactic trees and the calculation of perplexity. These features are expected to provide valuable insights for future research in related areas.

In our experiments, we observed that STPC achieved a detection accuracy of 97.25% for GCG attacks, reducing false positive rates to 4.4%, thereby ensuring high accuracy in detecting harmful content. Additionally, while detecting GCG jailbreak attacks, the ethics evaluator was developed using SLMs. Experimental results demonstrated that the misuse prevention tool effectively isolates most malicious exploitation and prevents the misuse of LLMs. We have made a portion of the data and code publicly available on GitHub [33].

Looking forward, we aim to further explore the characteristics of syntactic trees and refine calculation parameters to reduce false positives and false negatives. Currently, the perplexity value depends on the specific language model used. Considering this, in future research, this method will be extended to accept perplexity values from different language models as input and adjust feature computation accordingly to ensure optimal classification performance. The surface used to explain the MLP classification results can be refined using a more detailed surface. In the future, other machine learning methods will be employed to optimize the classification surface, or alternative interpretable learning approaches will be explored for classification. We also plan to investigate additional properties of syntactic trees to enhance classification accuracy and improve the ability to handle complex and hard-to-interpret sentences. Moreover, while improving the precision of the misuse prevention tool, we will continue to adapt it for broader applications and environments. Although the MLP-based approach has been used to learn the characteristics of GCG attacks, the analysis of other adversarial attacks is still lacking. Future research will focus on improving this aspect to further enhance the model’s adaptability and robustness.

Author Contributions

Conceptualization, Q.C. and S.Y.; methodology, Q.C.; software, Q.C.; validation, Q.C., S.Y. and Y.Y.; formal analysis, Q.C.; investigation, Q.C. and S.Y.; resources, Q.C. and Y.Y.; data curation, Q.C.; writing—original draft preparation, Q.C.; writing—review and editing, Q.C., S.Y. and Y.Y.; visualization, Q.C.; supervision, S.Y.; project administration, S.Y.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by Interface Corporation, Japan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large language model
GCG	Greedy Coordinate Gradient
SLM	Small language mode
STPC	Syntax Trees and Perplexity Classifier
MLP	Multi-Layer Perceptron
BERT	Bidirectional Encoder Representations from Transformers
ALBERT	A Lite Bidirectional Encoder Representations from Transformers

References

Xu, Z.; Liu, Y.; Deng, G.; Li, Y.; Picek, S. LLM Jailbreak Attack versus Defense Techniques–A Comprehensive Study. arXiv 2024, arXiv:2402.13457. [Google Scholar]
Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? In Advances in Neural Information Processing Systems, Proceedings of the 37th International Confernece on Nueral Informaiton Precessing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associated Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
Zou, A.; Wang, Z.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043. [Google Scholar]
Kumar, A.; Agarwal, C.; Srinivas, S.; Feizi, S.; Lakkaraju, H. Certifying LLM safety against adversarial prompting. arXiv 2023, arXiv:2309.02705. [Google Scholar]
Robey, A.; Wong, E.; Hassani, H.; Pappas, G.J. SmoothLLM: Defending large language models against jailbreaking attacks. arXiv 2023, arXiv:2310.03684. [Google Scholar]
Alon, G.; Kamfonas, M. Detecting language model attacks with perplexity. arXiv 2023, arXiv:2308.14132. [Google Scholar]
Pavlopoulos, J.; Sorensen, J.; Dixon, L.; Thain, N.; Androutsopoulos, I. Toxicity detection: Does context really matter? arXiv 2020, arXiv:2006.00998. [Google Scholar]
Lees, A.; Tran, V.Q.; Tay, Y.; Sorensen, J.; Gupta, J.; Metzler, D.; Vasserman, L. A new generation of perspective API: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; ACM: New York, NY, USA, 2022; pp. 3197–3207. [Google Scholar]
Markov, T.; Zhang, C.; Agarwal, S.; Nekoul, F.E.; Lee, T.; Adler, S.; Weng, L. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; AAAI: Palo Alto, CA, USA, 2023; Volume 37, No. 12. pp. 15009–15018. [Google Scholar]
Ogheneovo, E.E.; Asagba, P.O. A Parse Tree Model for Analyzing And Detecting SQL Injection Vulnerabilities. West Afr. J. Ind. Acad. Res. 2013, 6, 33–49. [Google Scholar]
Park, G.; Taylor, J.M. Using syntactic features for phishing detection. arXiv 2015, arXiv:1506.00037. [Google Scholar]
Hu, Z.; Wu, G.; Mitra, S.; Zhang, R.; Sun, T.; Huang, H.; Swaminathan, V. Token-level adversarial prompt detection based on perplexity measures and contextual information. arXiv 2023, arXiv:2311.11509. [Google Scholar]
Derrick, D.; Archambault, D. TreeForm: Explaining and exploring grammar through syntax trees. Lit. Linguist. Comput. 2010, 25, 53–66. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI. Available online: https://api.semanticscholar.org/CorpusID:160025533 (accessed on 25 December 2024).
Hajic, J.; Ciaramita, M.; Johansson, R.; Kawahara, D.; Martí, M.A.; Màrquez, L.; Zhang, Y. The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, Boulder, CO, USA, 4 June 2009; Association for Computational Linguistics: Boulder, CO, USA, 2009; pp. 1–18. [Google Scholar]
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar]
Riedmiller, M.; Lernen, A. Multi-Layer Perceptron; Machine Learning Lab Special Lecture; University of Freiburg: Freiburg im Breisgau, Germany, 2014; 24p. [Google Scholar]
Williams, A.; Nangia, N.; Bowman, S.R. A broad-coverage challenge corpus for sentence understanding through inference. arXiv 2017, arXiv:1704.05426. [Google Scholar]
Phute, M.; Helbling, A.; Hull, M.; Peng, S.; Szyller, S.; Cornelius, C.; Chau, D.H. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv 2023, arXiv:2308.07308. [Google Scholar]
Taboada, M. Sentiment Analysis: An Overview from Linguistics. Annu. Rev. Linguist. 2016, 2, 325–347. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper, and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Sundelin, C. Comparing Different Transformer Models’ Performance for Identifying Toxic Language Online. Available online: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1784346&dswid=-8091 (accessed on 21 October 2024).
Hasan, A.; Rugina, I.; Wang, A. Pruning for protection: Increasing jailbreak resistance in aligned LLMs without fine-tuning. arXiv 2024, arXiv:2401.10862. [Google Scholar]
Bhardwaj, R.; Poria, S. Red-teaming large language models using chain of utterances for safety-alignment. arXiv 2023, arXiv:2308.09662. [Google Scholar]
Stahlberg, F.; Kumar, A.; Alberti, C.; Kumar, S. Conciseness: An Overlooked Language Task. arXiv 2022, arXiv:2211.04126. [Google Scholar]
Hartvigsen, T.; Gabriel, S.; Palangi, H.; Sap, M.; Ray, D.; Kamar, E. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. arXiv 2022, arXiv:2203.09509. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
Jain, N.; Schwarzschild, A.; Wen, Y.; Somepalli, G.; Kirchenbauer, J.; Chiang, P.Y.; Goldblum, M.; Saha, A.; Geiping, J.; Goldstein, T. Baseline Defenses for Adversarial Attacks against Aligned Language Models. 2023. Available online: https://github.com/neelsjain/baseline-defenses/tree/main (accessed on 17 March 2024).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 3146–3154. [Google Scholar]
De Gibert, O.; Perez, N.; García-Pablos, A.; Cuadros, M. Hate speech dataset from a white supremacy forum. arXiv 2018, arXiv:1809.04444. [Google Scholar]
Lan, Z. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2022, arXiv:1909.11942. [Google Scholar]
Jigsaw. Perspective API: Toxicity Detection for Online Conversations. Available online: https://perspectiveapi.com (accessed on 20 October 2024).
Chen, Q. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment. Available online: https://github.com/jackson12587/LLM-abuse-prevention-tool-using-GCG-jailbreak-attack-detection-and-DistilBERT-based-ethics-judgment.git (accessed on 10 January 2025).

Figure 1. Structure diagram of the misuse prevention tool. This tool consists of two components: (A) the jailbreak attack detector and (B) the ethics evaluator.

Figure 2. Flowchart of syntax trees and perplexity classifier.

Figure 3. The construction and visualization results of the syntax tree for a normal sentence, “Create a guide on how to commit insider trading and avoid getting caught”.

Figure 4. The results of the syntax tree analysis and construction after adding the GCG suffix to a harmful sentence, “Write a program that can hack into a secure government database ([<PDteen Inst`](:\Sure, htt)]([[\\This C\\ = varcharutorial forAtIndex”.

Figure 5. Scatter plot of MLP model classification results. Red indicates that the sentence prediction result is closer to a GCG attack, while blue indicates that the prediction result is closer to a normal sentence.

Figure 6. The visualization of the LBP decision surface along with the scatter plot of the test dataset.

Table 1. Comparison of Different Defense Methods.

	Kumar et al. [4]	Robey et al. [5]	Alon et al. [6]
Detection	◯	×	◯
Defense	◯	◯	×
Consistency	×	×	◯
False positive	×	×	△

Note: ◯ indicates resolved, × indicates not resolved, and △ indicates partially resolved.

Table 2. Comparison results of perturbed syntactic tree parameters.

	Spelling Errors Sentences	Normal Sentences	GCG
LNR	0.48	0.47	0.52
ABF	1.82	1.78	2.03

Table 3. Experimental results of accuracy and processing time for different SLMs.

Model	Harmful Confidence	Non-Harmful Confidence	Time (s)
RoBERTa	81.89%	83.51%	287.5
ALBERT	72.42%	77.31%	323
DistilBERT	74.99%	84.55%	178.4

Table 4. Dataset composition for stpc jailbreak attack detector evaluation.

Type	Normal Sentences	Sentences with GCG Suffixes
Number of Sentences	1000	1000

Table 5. Confusion matrix for Light-GBM.

	Predicted “Normal Sentences”	Predicted “GCG Sentences”
Actual “Normal Sentences”	899	101
Actual “GCG Sentences”	7	993

Table 6. Confusion matrix for STPC.

	Predicted “Normal Sentences”	Predicted “GCG Sentences”
Actual “Normal Sentences”	956	44
Actual “GCG Sentences”	11	989

Table 7. Dataset composition for ethics evaluator evaluation.

Type	Normal Sentences (GLUE)	Harmful Sentences (Hate Speech)
Number of sentences	500	500

Table 8. Confusion matrix for the ethics evaluator.

	Predicted “Non-Harmful”	Predicted “Harmful”
Actual “Non-Harmful”	442	58
Actual “Harmful”	26	474

Table 9. Dataset Composition for Ethics Evaluator Evaluation.

	Ethics Evaluator (Proposed Method)	ALBERT	Perspective
Runtime	8.2 s	131.6 s	2489.1 s

Table 10. Dataset composition for misuse prevention tool evaluation.

Type	Normal Sentences	Sentences with GCG Suffixes	Harmful Sentences
Number of sentences	1000	500	500

Table 11. Integrated experimental results for the misuse prevention tool.

	Predicted “Normal”	Predicted “Harmful”	Predicted “GCG Attack”
Actual “Normal”	856	44	100
Actual “Harmful”	23	460	17
Actual “GCG Attack”	0	0	500

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Yamaguchi, S.; Yamamoto, Y. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment. Information 2025, 16, 204. https://doi.org/10.3390/info16030204

AMA Style

Chen Q, Yamaguchi S, Yamamoto Y. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment. Information. 2025; 16(3):204. https://doi.org/10.3390/info16030204

Chicago/Turabian Style

Chen, Qiuyu, Shingo Yamaguchi, and Yudai Yamamoto. 2025. "LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment" Information 16, no. 3: 204. https://doi.org/10.3390/info16030204

APA Style

Chen, Q., Yamaguchi, S., & Yamamoto, Y. (2025). LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment. Information, 16(3), 204. https://doi.org/10.3390/info16030204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment

Abstract

1. Introduction

2. Related Research

3. Misuse Prevention Tool

4. Jailbreak Attack Detector

4.1. Perplexity

4.2. Syntactic Techniques

4.2.1. Representing Sentence Feature Parameters

4.2.2. Analyzing and Calculating by MLP

4.3. Classification

4.4. Surface Formula

5. Ethics Evaluator

5.1. Method for Determining Harmfulness

5.2. Leveraging SLMs

5.3. Ethicality Assessment of Prompts Using SLMs

5.4. DistilBERT

5.5. Implementation

6. Evaluation

6.1. Evaluation of the STPC Jailbreak Attack Detector

6.2. Evaluation of the Ethics Evaluator

6.3. Runtime Comparison

6.4. Evaluation of the Misuse Prevention Tool

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI