Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models

Xu, Huan; Zhang, Zhanhao; Yu, Xiaodong; Wu, Yingbo; Zha, Zhiyong; Xu, Bo; Xu, Wenfeng; Hu, Menglan; Peng, Kai

doi:10.3390/app14167118

Open AccessArticle

Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models

by

Huan Xu

¹,

Zhanhao Zhang

²,

Xiaodong Yu

³,

Yingbo Wu

³,

Zhiyong Zha

¹,

Bo Xu

^2,4,

Wenfeng Xu

³,

Menglan Hu

² and

Kai Peng

^2,*

¹

State Grid Hubei Information & Telecommunication Company, Wuhan 430048, China

²

Hubei Key Laboratory of Smart Internet Technology, School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

³

Hubei Huazhong Electric Power Technology Development Co., Ltd., Wuhan 430074, China

⁴

Hubei ChuTianYun Co., Ltd., Wuhan 430076, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7118; https://doi.org/10.3390/app14167118

Submission received: 3 July 2024 / Revised: 1 August 2024 / Accepted: 2 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Security and Privacy in Machine Learning and Artificial Intelligence (AI))

Download

Browse Figures

Versions Notes

Abstract

:

A large language model refers to a deep learning model characterized by extensive parameters and pretraining on a large-scale corpus, utilized for processing natural language text and generating high-quality text output. The increasing deployment of large language models has brought significant attention to their associated privacy and security issues. Recent experiments have demonstrated that training data can be extracted from these models due to their memory effect. Initially, research on large language model training data extraction focused primarily on non-targeted methods. However, following the introduction of targeted training data extraction by Carlini et al., prefix-based extraction methods to generate suffixes have garnered considerable interest, although current extraction precision remains low. This paper focuses on the targeted extraction of training data, employing various methods to enhance the precision and speed of the extraction process. Building on the work of Yu et al., we conduct a comprehensive analysis of the impact of different suffix generation methods on the precision of suffix generation. Additionally, we examine the quality and diversity of text generated by various suffix generation strategies. The study also applies membership inference attacks based on neighborhood comparison to the extraction of training data in large language models, conducting thorough evaluations and comparisons. The effectiveness of membership inference attacks in extracting training data from large language models is assessed, and the performance of different membership inference attacks is compared. Hyperparameter tuning is performed on multiple parameters to enhance the extraction of training data. Experimental results indicate that the proposed method significantly improves extraction precision compared to previous approaches.

Keywords:

membership inference attack; training data extraction; artificial intelligence security; large language models; privacy protection

1. Introduction

In recent years, large models have developed rapidly. Large model-related research has also received extensive attention, such as distributed computing [1,2] to improve model training speed, service deployment [3,4] to optimize inference delay, large model security to protect user privacy [5], etc. Almost all online services collect our personal data and may use it to train large language models. However, it is difficult to determine how these models utilize the training data. If sensitive data such as geographical location, health records, and identity information is used in model training, extraction attacks targeting privacy data in the model could result in significant user privacy disclosure. Once the sensitive data are leaked, it poses a serious threat to user privacy, potentially leading to identity theft, financial loss, and other issues. The experiment conducted by Carlini et al. showed that due to the memory effect, the training data can be extracted by exploiting the overfitting phenomenon of large language models [6].

However, existing research on training data extraction for large language models is limited, and the precision of most extraction methods remains low. Based on indiscriminate training data extraction, Carlini et al. demonstrated that out of the entire 40 GB GPT-2 training dataset, only 600 samples could be extracted, representing a mere 0.00000015% [6]. Additionally, the training data extracted indiscriminately often consists of repetitive and meaningless text. In 2022, Carlini et al. introduced the concept of targeted training data extraction [7]. They generated suffixes from given prefixes to obtain training data samples. Yu et al. studied the impact of various generation strategies on the effectiveness of training data extraction [8]. However, their focus was primarily on the proposed generation and ranking strategies, without a comprehensive analysis of existing strategies and the effects of combining these approaches. Additionally, they did not perform sufficient parameter tuning for these combinations. Although a precision of 46.7% was achieved, there remains room for improvement. This paper aims to comprehensively analyze the effects of suffix generation and sorting strategies on training data extraction. The performance of the membership inference attack method based on neighborhood comparison was improved and tested in this task. Various methods were employed to enhance the precision and efficiency of training data extraction.

In this study, our aim is to improve and test the performance of the membership inference attack [9] method based on neighborhood comparison. Furthermore, we analyze different suffix generation strategies and sorting strategies and study the impact of various methods on the extraction precision of training data, which poses many challenges. First, large language models have multiple suffix generation strategies, with a vast number of possible combinations, significantly increasing the complexity of the work. Second, there are no precedents for utilizing membership inference attacks based on neighborhood comparison for training data extraction in large language models, making it a daunting task to integrate these algorithms into large language models and optimize their performance. Finally, various experimental settings, such as the number of experiments, training equipment, and batch size, also affect the efficiency of training data extraction. Therefore, an efficient training data extraction method requires optimization across multiple aspects.

In this paper, our contributions are fourfold:

First, this paper improves the membership inference attack algorithm based on neighborhood comparison and integrates it into the challenge of training data extraction, achieving promising results.
Second, this paper comprehensively analyzes the effectiveness of different suffix generation strategies, evaluating them in terms of the diversity and quality of the generated text. Additionally, we examine how variations in generation strategy parameters impact their effectiveness.
Third, this paper analyzes the effectiveness of various membership inference attack methods for extracting training data from large language models. We propose the metric $E_{m i a}$ to better evaluate the extraction rate of membership inference attacks and highlight the inefficiency of these methods in the context of large language models.
Fourth, this paper focuses on achieving a balance between the quality and diversity of the generated text in view of the inefficiency of membership inference attacks. Based on the analysis of the text generated by different strategies, we narrow the range of hyperparameter tuning to efficiently find this balance. Our extraction precision reaches 52.5%, representing a significant improvement over previous work.
Fifth, the training data extraction accuracy for GPT-j-6B achieves an amazing 72.2%. This reveals that the privacy leakage problem of large models becomes more serious as the size of the model increases.

The remaining sections of this paper are organized as follows. Section 2 briefly reviews related work. Training data extraction process and various suffix generation and sorting strategies employed in our work are discussed in Section 3. The algorithm for our training data extraction process is presented in detail in Section 4. Section 5 shows the experimental results and provides a detailed analysis, while Section 6 discusses potential defense mechanisms against the training data extraction methods proposed in this paper. Finally, Section 7 concludes the paper.

2. Related Work

2.1. The Risks of Privacy Leakage in Large Language Models

Numerous experiments have demonstrated the possibility of extracting training data from large language models [10]. Zhang et al. proposed a method termed Ethicist [11], which selectively extracts training data by utilizing loss-smoothed soft prompts combined with calibrated confidence estimation. This method studies the recovery of suffixes from training data when given specific prefixes. Kim et al. introduced a detection tool named ProPILE, designed to evaluate the issue of large language models potentially leaking personally identifiable information (PII) [12]. They also emphasized the specific risk of PII leakage in Chat-GPT. Inan et al. highlighted that the outstanding performance of language models may be accompanied by the ability to memorize rare training samples [13]. If the model is trained on confidential user content, it could pose a severe privacy threat. To address this, they proposed a method for identifying user content that may leak during training under strong and realistic threat models. Huang Jie et al. demonstrated that due to the memorization of training data by large language models, there is a tangible risk of personal information leakage during conversations, and this risk would escalate as the number of examples increases [14].

In addition, many studies have focused on the extraction of training data for large language models, mainly including non-targeted extraction [15] and targeted extraction [7]. Non-targeted extraction involves generating a large amount of text from which to select samples that may serve as training data for the model. Carlini et al. demonstrated attacks on GPT-2, indiscriminately extracting training data and evaluating the results. They also found that larger models are more susceptible to extraction than smaller models. Targeted extraction proposed by Carlini et al. [7] involves providing a specific prefix and extracting the corresponding suffixes. In this way, an attacker can provide a specific prefix (e.g., jim’s home address is:) to extract the training data he needs. Targeted training data extraction for large models is a more efficient data privacy extraction technology, which changes the traditional blind training data extraction into an extraction method with clear goals. The threat to privacy is more serious in this method, so it has important research value in data privacy protection research. Some studies have also achieved certain improvements in precision and other aspects. Yu Weichen et al. studied the strategy of generating suffixes and the impact of sample sorting strategies on the precision and recall rates of extraction [8]. However, the precision of extraction remains low. We similarly focus on targeted training data extraction.

2.2. Membership Inference Attack

Membership Inference Attacks (MIA) are adversarial attacks closely related to privacy protection, aiming to determine whether a specific sample is part of the training dataset of a machine learning model. Even with only black-box access to the model, MIA has demonstrated strong performance across numerous machine learning tasks. Therefore, MIA has significant potential for application and research value in revealing privacy vulnerabilities in machine learning models [16].

Shadow training [17] is one of the commonly used techniques in MIA. It involves training a model with the same architecture as the target model but using its own data samples to approximate the training set of the target model. Since current attack methods often require access to the model’s prediction confidence scores, Choquette-Choo et al. introduced a label-only attack to evaluate the robustness of model prediction labels under input perturbations to infer membership status [18]. Carlini et al. developed a likelihood ratio attack (LiRA), which improved the efficacy of membership inference attacks by tenfold at lower false positive rates [15].

There remains a lack of consensus on whether existing MIA algorithms will result in significant privacy leakage in practical large language models. Existing MIAs designed for LLMs can be categorized into two types: reference-free attacks and reference-based attacks. Reference-based attacks appear to be promisingly effective in LLMs, as they measure more reliable membership signals by comparing probability differences between the target model and a reference model. However, the performance of reference-based attacks highly depends on the availability of a reference dataset that closely resembles the training dataset, which is often inaccessible in real-world scenarios. Fu et al. [19] proposed a membership inference attack based on Self-Calibrated Probability Variation (SPV-MIA), which constructs a dataset to fine-tune a reference model by prompting the target LLM itself. This approach enables the attacker to collect datasets with similar distributions from public APIs. Mattern et al. developed a neighborhood-based membership inference attack [20], which achieves the effectiveness of reference-based attacks without requiring a reference dataset to train the reference model.

2.3. Other Attacks on Privacy Extraction for Large Models

Since the emergence of large-scale pre-trained models, there has been a surge in attacks targeting large language models [21,22]. Beyond the research discussed in this paper, numerous advanced attacks have been developed, targeting both the language models themselves and their training data.

Reconstruction attacks aim to reconstruct multiple training samples along with their labels, attempting to recover sensitive features or complete data samples based on output labels and partial knowledge of some features. Zhang et al. demonstrated that models with high predictive ability are more susceptible to reconstruction attacks, based on the assumption of weak adversary knowledge [23].

Attribute inference attacks involve adversaries attempting to infer sensitive or personal information about individuals or entities by analyzing the behavior or responses of machine learning models, and are applicable to large language models. Staab et al. conducted a comprehensive evaluation on the ability of pre-trained large language models to infer personal information from text [24]. They demonstrated that current large language models can accurately infer various types of personal information with high precision.

Model extraction is a class of black-box attack methods where adversaries attempt to extract information by creating a substitute model that closely mimics the behavior of the target model. Currently, there are various types of data extraction attacks, including model stealing attacks, gradient leakage and training data extraction attacks. The work by Truong et al. is particularly notable for its ability to replicate models without requiring access to the original model data [25].

3. Training Data Extraction for Large Language Models

3.1. Large Language Models

Language modeling is the task of learning the underlying probability distribution of word sequences in natural language [26]. For a sequence of tokenized words

w_{1}, \dots, w_{n}

, this statistical model is represented as the following joint probability:

P r (w_{1}, \dots, w_{n}) = \prod_{i = 1}^{n} P r (w_{i} ∣ w_{1}, \dots, w_{i - 1})

(1)

Here,

P r (w_{i} ∣ w_{1}, \dots, w_{i - 1})

represents the probability of token

w_{i}

appearing given the previous token sequence

w_{1}, \dots, w_{n}

. It has been demonstrated that neural networks can effectively estimate these conditional distributions and are used as language models. Given an unsupervised tokenized corpus

W = {w_{1}, \dots, w_{n}}

, the standard language modeling objective is to maximize the following likelihood function:

L (θ) = \sum_{i = 1}^{n} {log}_{} P r (w_{i} ∣ w_{1}, \dots, w_{i - 1}; θ)

(2)

where the conditional probability of

w_{i}

is calculated by evaluating a neural network with parameters

θ

on the sequence

w_{1}, \dots, w_{i - 1}

.

The quality of a language model is typically measured by two metrics: perplexity and top-k precision. Perplexity measures the likelihood of a text sequence and is defined as

P P (w_{1}, \dots, w_{n}) = 2^{- l}

, where

l = \frac{1}{n} \sum_{i = 1}^{n} {log}_{2} P r (w_{i} ∣ w_{1}, \dots, w_{i - 1})

(3)

Evaluating the perplexity on unseen data indicates how well the model fits the underlying distribution of the language. A lower perplexity value implies that the language model is more effective at modeling the data. Various architectures are utilized for language models, with recent advancements highlighting the impressive state-of-the-art results achieved by large Transformer-based models [27] across various tasks. The attack model GPT-Neo 1.3B employed in this paper is based on the Transformer architecture, which primarily comprises two components: the encoder and the decoder. Each prediction is derived from the weighted sum of the encoder’s hidden states and the previous hidden states of the decoder.

3.2. Data Extraction Process

The current training data extraction process can be delineated into two distinct stages, as shown in Figure 1. The first stage entails generating suffixes from given prefixes to create a comprehensive set of samples, which are subsequently prepared for sorting and selection. Specifically, the prefix is input into an autoregressive language model, which calculates the probability of each word in the vocabulary for the next token and generates the subsequent word through various sampling strategies until the generated length meets the predefined criteria. The second stage is suffix sorting, where the likelihood that these generated samples are part of the training data is estimated. Only samples with high likelihood are retained and ranked in descending order of probability. In this paper, we employ a suffix sorting strategy based on membership inference attacks.

3.3. Suffix Generation Strategy

In the first stage of training data extraction, we can generate texts by adjusting different suffix generation strategies. Different suffix generation strategies affect the results of generated text and thus the accuracy of extraction, and these suffix generation strategies can also be combined with each other to generate text.

Language models can generate new text

{\hat{x}}_{i + 1} ∽ f_{θ} (x_{i + 1} ∣ x_{1}, \dots, x_{i})

through iterative sampling, and then re-input the sampled text

x_{i + 1}

into the model to obtain

{\hat{x}}_{i + 2} ∽ f_{θ} (x_{i + 2} ∣ x_{1}, \dots, x_{i + 1})

. This process is repeated until a stopping condition is met.

In the domain of large language model generation, the most common decoding strategy is to maximize the probability of text generation, usually referred to as the greedy strategy. While this method is effective in ensuring high-probability text output, ostensibly catering to the objective of extracting training data from large language models, it inevitably sacrifices textual diversity. Moreover, the pursuit of theoretically optimal sequences from the model is impractical due to the infeasibility of exhaustively generating and ranking all potential sequences. Therefore, a prevalent alternative involves the employment of beam search techniques.

Beam-search: Beam search [28] retains only a predetermined optimal subset of partial solutions. Specifically, beam search maintains a candidate set of size n. At each step, it selects the top n words from the probability distribution of each candidate sequence and then retains the n candidate sequences with the highest overall probabilities. This method strikes a balance between the quality and diversity of the generated text. However, due to its tendency to retain only the top n optimal solutions at each step, beam search often results in generated text that lacks diversity leading to outputs that may appear overly conservative and unnatural.

Top-k [29]: This method involves sampling from the top k tokens, thereby granting lower probability tokens an opportunity for selection. This introduction of randomness often enhances the diversity of the generated text. Specifically, at each step, random sampling is executed from the k highest probability words, excluding those with lower probabilities. The sampling probability among the top k words is determined by their respective likelihood scores.

Top-p [30]: At each step, random sampling is conducted only from the smallest set of words whose cumulative probability exceeds a certain threshold p, disregarding words with lower probability. This approach is similar to the top-k method, but top-p dynamically adjusts the size of the token candidate list. Therefore, this strategy helps to mitigate the selection of inappropriate or irrelevant words while preserving the potential for interesting or creative outcomes.

Typical-p: The typical-p [31] sampling method, grounded in information theory, constructs the sampling space by initially including words with the lowest conditional entropy. This process continues until the cumulative probability of all words in the sampling space surpasses a predefined threshold p. Specifically, this method begins by calculating the conditional entropy for each word and subsequently sorting the vocabulary based on these entropy values. Words with the lowest conditional entropy are progressively selected until their cumulative probability exceeds the threshold. Sampling is then performed according to this newly established probability distribution. The primary advantage of the typical-p sampling method is its ability to enhance the typicality of the generated text, thereby producing sequences that are more representative of natural language.

Temperature [32]: Temperature T represents a strategy for adjusting probability distributions, employing a local re-normalization method with an annealing factor. When T > 1, this technique elevates the probability of selecting lower-probability tokens, consequently diminishing the model’s confidence in the generated text while concurrently augmenting its diversity. Conversely, when T < 1, it enables the language model to select tokens with higher confidence, albeit at the cost of reducing the diversity of the generated sequences.

P_{i} = \frac{e^{\frac{z_{i}}{T}}}{\sum_{j = 1}^{V} e^{\frac{z_{j}}{T}}}

(4)

where

P_{i}

represents the scores normalized by temperature,

z_{i}

is the original score, and V is the size of the vocabulary.

Repetition-penalty: Repetition-penalty [33] is another adjustment strategy for probability distributions, specifically engineered to control the penalty coefficient for repeated words in text generation tasks. This parameter modulates how extensively the model penalizes repeated words during text generation, thus preventing excessive redundancy in the output. During text generation, models sometimes tend to produce the same words or phrases, which may result in text that lacks diversity or readability. To mitigate this, the repetition-penalty parameter is employed to impose penalties on the recurrence of words, thereby promoting the production of more varied text. The value of the repetition-penalty is typically a real number not less than 1, reflecting the intensity of the imposed penalties. A value exceeding 1 increases the severity of penalties on repeated words, effectively diminishing their frequency in the generated text. Conversely, a value of 1 or less applies a less stringent penalty, potentially increasing the repetition. Notably, in the context of extracting training data, the application of a repetition-penalty can have adverse effects, as it may suppress beneficial repetitions.

P_{i} = \frac{e^{\frac{z_{i}}{I (c)}}}{\sum_{j = 1}^{V} e^{\frac{z_{j}}{I (c)}}}

(5)

I (c) = \{\begin{matrix} 1, & t h i s t o k e n h a s n o t a p p e a r e d p r e v i o u s l y \\ r & t h i s t o k e n h a s a p p e a r e d p r e v i o u s l y \end{matrix}

(6)

where r represents the value of the repetition penalty parameter.

3.4. Ranking Strategy

After generating many text samples, it is necessary to rank these samples to determine which samples have a higher probability of belonging to the training dataset. This ranking is typically achieved through a perplexity-based membership inference attack.

Loss-based MIA [34]: Loss-based MIAs exploit the output loss of a model to deduce whether a particular sample is included in its training dataset. Specifically, the attacker compares the target sample with samples in the training dataset by observing the loss value of the model. A close alignment between the loss value of the target sample and those of training data samples strongly suggests that the target sample is likely a member of the training data.

Zlib entropy-based MIA [35]: Zlib entropy-based MIAs leverage the Zlib compression algorithm [36] to evaluate the information entropy of text samples, thereby inferring whether the text belongs to the model’s training dataset. Specifically, for a given generated sample, the Zlib entropy is calculated as follows:

E_{Z l i b} = \frac{S_{o r i g i n a l}}{S_{c o m p r e s s e d}}

(7)

where

S_{o r i g i n a l}

and

S_{c o m p r e s s e d}

denote the file sizes of the generated sample before and after compression through the Zlib compression algorithm, respectively.

Generally, lower Zlib entropy in the original text indicates more information, whereas higher Zlib entropy suggests less information in the original text. Zlib entropy-based MIAs typically follow two implementation approaches, as calculated in Equations (8) and (9), respectively, where

P^{'}

is the perplexity of the generated sample after Zlib entropy processing. One is to divide perplexity by Zlib entropy, favoring generated text with higher Zlib entropy, which indicates greater repetition within the text.

P^{'} = \frac{P}{E_{Z l i b}}

(8)

The other is to multiply the perplexity by Zlib entropy, favoring generated text with lower Zlib entropy, thereby allowing the generated text to contain more information.

P^{'} = P * E_{Z l i b}

(9)

In the sorting process, generated texts with the smaller

P^{'}

will be selected as training data.

Neighborhood comparison: Drawing inspiration from reference-model-based membership inference attacks and addressing the challenge of requiring high-quality training data, Mattern et al. proposed a novel neighborhood-based membership inference attack [20]. Given a target sample x, multiple highly similar neighborhood samples are generated by substituting words using a pretrained masked language model, specifically BERT, in this paper. Both the generated neighborhood samples and the original sample are inputted into the target model to calculate their respective loss values. Since the target model is prone to overfitting to the training data, the difference between the loss value of the target sample for the training data and the average loss value of the neighborhood samples is expected to be smaller than for other samples. The decision rule is as follows:

A_{f_{θ} (x)} = 1 [(L (f_{θ}, x) - \sum_{i = 1}^{n} \frac{L (f_{θ}, \tilde{x_{i}})}{n}) < γ]

(10)

where

\{{\tilde{x}}_{1}, \dots, {\tilde{x}}_{n}\}

is a set of n neighbors for x.

4. Proposed Algorithm

In this paper, we propose a targeted training data extraction method for large language models with membership inference attacks based on neighborhood comparison. In this algorithm, we feed a prefix of length 50 into the attack model to generate a text of length 100, and this process is repeated NUM_TRIAL times, resulting in NUM_TRIAL generated texts. For each generated text, one of the tokens is masked, and the mask position is replaced by Bert’s mask generation method. A set of neighborhood texts is obtained by replacing tokens at different positions of a generated text. The score of this generated text is obtained by subtracting the loss value of the generated text from the average loss value of the neighborhood samples. The training data sample extracted with this prefix is the one with the highest score selected from the NUM_TRIAL generated texts. This is the first time that the membership inference attack based on neighborhood comparison is used for targeted training data extraction of the large language model, and we improve it to achieve the best effect. In the suffix generation strategy, we perform a careful combination and tuning to ensure the generated text contains as much training data as possible, facilitating membership inference attack ranking selection. The detailed algorithm is provided in Algorithm 1.

Algorithm 1 Training data extraction

Input:: attack model $M_{a}$ , neighborhood generation model $M_{n}$ , prefix $p = (w_{1}, \dots, w_{50})$
Output:: suffix s
1:: $t r i a l \leftarrow 0$
2:: while $t r i a l < NUM_TRIAL$ do
3:: $s_{1} \leftarrow M_{a} . generate (p)$
4:: $s_{n u m} \leftarrow append (s_{n u m}, s_{1})$
5:: $t r i a l \leftarrow t r i a l + 1$
6:: end while
7:: for each $s a m p l e \in s_{n u m}$ do
8:: $s a m p l e \leftarrow (w_{1}, \dots, w_{100})$
9:: for $i = 51$ to 100 do
10:: $compute p_{s w a p} (w_{i}, {\hat{w}}_{i}) for all {\hat{w}}_{i} \in V from M_{n}$
11:: end for
12:: if $value of P is in the top n highest$ then
13:: $n e i g h b o r \leftarrow (w_{1}, \dots, {\hat{w}}_{i}, \dots, w_{100})$
14:: end if
15:: $L \leftarrow M_{a} . generate (s a m p l e)$
16:: $\hat{L} \leftarrow M_{a} . generate (n e i g h b o r)$
17:: $s c o r e \leftarrow append (s c o r e, \hat{L} - L)$
18:: end for
19:: if $value of s c o r e is maximum$ then
20:: $s \leftarrow s a m p l e$
21:: end if
22:: return s

5. Experimental Results

5.1. Experimental Basis

Dataset: The dataset employed in this paper comprises 1000 samples selected by Carlini et al. from the Pile dataset. Each sample contains 100 tokens and serves as the validation dataset for precision extraction. The initial 50 tokens of each sample are extracted to form the prefix dataset, which is used for training data extraction.

Attack Model: The attack model used in this paper is GPT-Neo 1.3B https://huggingface.co/EleutherAI/gpt-neo-1.3B (accessed on 1 July 2024), which is a pre-trained language model based on an autoregressive language model. With 1.3 billion parameters, GPT-Neo 1.3B is developed by EleutherAI and adopts a Transformer architecture similar to OpenAI’s GPT model, including multiple Transformer layers, each containing self-attention mechanisms and feed-forward neural networks. GPT-Neo 1.3B is an open-source model trained on the publicly available Pile dataset, making it a good choice for training data extraction.

Neighborhood Generation Model: For neighborhood-based membership inference attacks, BERT [37] is selected as the model for generating neighborhood samples in this experiment.

Experimental Setup: The experiment is conducted on 4060ti GPU, equipped with 16 GB of memory. The batch size is set to 32. A fixed seed is employed to ensure the consistency of experimental results across multiple runs.

5.2. Evaluation Metrics

M_{p}

:

M_{p}

stands for precision, defined as the proportion of correctly generated suffixes to the total number of generated suffixes, given a set of prefixes. This metric is primarily used to evaluate the precision of selected samples subsequent to membership inference attacks and is the most critical measure for evaluating the effectiveness of training data extraction.

M_{p} = \frac{n_{c p}}{N} \times 100 %

(11)

where

n_{c p}

is the number of correctly generated and selected samples, and N is the total number of given prefixes.

M_{np}

: To independently evaluate the efficacy of the suffix generation strategy, we introduce the total precision of generated texts across multiple experiments. In the suffix generation process, a set of suffixes is produced based on a single prefix

x_{i}

, and a flag, denoted as

f_{x_{i}}

, is assigned to each prefix. If at least one correct suffix exists within this group, the flag

f_{x_{i}}

is set to 1; otherwise, it is set to 0. Thus, the calculation formula for

M_{n p}

is as follows:

M_{n p} = \frac{\sum_{i = 0}^{N} f_{x_{i}}}{N} \times 100 %

(12)

D_{edit}

: Edit Distance is a metric used to quantify the dissimilarity between two strings by calculating the minimum number of edit operations required to transform one string into another. This measure reflects the quality of the generated text. For two strings a and b, with lengths |a| and |b|, respectively, their Levenshtein Distance is given by

D \{\begin{matrix} m a x (i, j), & i f m i n (i, j) = 0 \\ \{\begin{matrix} D_{a, b} (i - 1, j) + 1 \\ D_{a, b} (i, j - 1) + 1 \\ D_{a, b} (i - 1, j - 1) + 1_{a_{i} \neq b_{j}} \end{matrix}, & o t h e r w i s e \end{matrix}

(13)

E_{mia}

: To exclusively evaluate the impact of membership inference attacks, we introduce

E_{m i a}

, defined as the precision

M_{p}

divided by the total precision

M_{n p}

across multiple experiments. Additionally, while determining the method of membership inference attacks,

E_{m i a}

can also serve as a metric for evaluating the quality of the texts generated by suffix generation strategies.

E_{m i a} = \frac{M_{p}}{M_{n p}} \times 100 %

(14)

5.3. Suffix Generation Results

In this section, individual tests are conducted for beam search, top-k, top-p, and typical-p to assess their generation effectiveness. Additionally, the impact of temperature and repetition penalty on generation effectiveness is evaluated. To compare the diversity of text generation by different strategies, experiments are conducted with trial counts set to 1 and 20. M1p denotes the overall precision in a single experiment, while

M_{20 p}

denotes the overall precision across 20 experiments.

M_{p}

indicates the precision across 20 experiments, and

D_{e d i t}

measures the edit distance between the correct answer and the generated samples in one experiment. A larger

M_{20 p}

signifies greater diversity in the generated text, whereas a smaller

D_{e d i t}

indicates higher quality of the generated text. To control variables, a suffix sorting strategy based on loss is employed for membership inference attacks.

As shown in Table 1, beam search retains the top num_beam options at each step, making it the strategy closest to greedy search for suffix generation. Consequently, samples generated through the beam search strategy typically exhibit lower loss values and higher overall confidence, resulting in high-quality text generation. However, since beam search consistently selects samples with the lowest loss values, the diversity of the generated text is limited, which can be a significant drawback, especially when suffix sorting strategies perform well.

The top-k suffix generation strategy samples uniformly from the top-k ranked candidates at each time step. Compared to beam search, this approach includes tokens with lower probabilities. As shown in Table 2, the quality of text generated by top-k is generally poorer, and as the parameter k increases, the quality of the generated text further deteriorates. This decline in quality can be attributed to the increased likelihood of selecting tokens with extremely low probabilities as k increases, which significantly impacts the overall quality of the generated text. However, top-k generates samples with higher diversity, making it a preferable choice under conditions where suffix sorting strategies perform well.

To test the diversity of text generated by the top-k strategy, we set top-k

= 2

and conducted 200 trials to observe the achievable

M_{n p}

and

M_{p}

values. As illustrated in Figure 2, both the total precision

M_{n p}

and the precision

M_{p}

exhibit a gradually rising trend as the number of trials increases. The increase is relatively rapid when the number of trials is fewer than 20, but the rate of increase slows down significantly once the number of trials exceeds 20.

The top-p suffix generation strategy addresses the issue of selecting low-probability tokens inherent in the top-k strategy. As shown in Table 3, when the top-p value is relatively small (less than 0.6), it effectively prevents the occurrence of low-probability tokens, resulting in performance similar to beam-search. However, when the top-p value is larger (greater than 0.6), the performance of top-p becomes similar to that of top-k.

Typical-p sampling tends to select words with low conditional entropy, thereby producing texts with high typicality. As illustrated in Table 4, the results generated by typical-p sampling are comparable to those of top-p sampling, although they are marginally less effective.

We visually compare the above four suffix generation strategies in Table 5. The comparison is based on three aspects: text quality, text diversity, and runtime. The beam search generation strategy demonstrates a distinct advantage in text quality, while the top-k generation strategy exhibits a clear advantage in text diversity.

Due to the high diversity of text generated by top-k sampling and the high quality of text generated by beam search, this experiment aims to impartially and accurately assess the impact of temperature on suffix generation. We evaluate the effect of temperature on suffix generation by setting top-k = 3 (Table 6) and beam-search = 3 (Table 7). The suffix generation strategy used therein is a loss-based membership inference attack. It can be seen that an appropriate temperature parameter can improve the accuracy of training data extraction.

Repetition_penalty is a mechanism designed to reduce repeated texts by penalizing repetitions when its value exceeds 1. Observing the impact of repetition_penalty under the condition of num_beam = 3, as shown in Table 8, reveals that its impact is minimal and predominantly negative. This is due to the model’s tendency to overfit to the training data during extraction, where penalizing repetition may result in the selection of incorrect tokens.

5.4. Suffix Sorting Results

In this section, we conduct experimental evaluations and comparisons of five suffix sorting methods: Loss-based membership inference attack, Zlib entropy-based membership inference attack (including ×Zlib and ÷Zlib), neighborhood comparison-based membership inference attack, and the high-confidence token reward strategy. The chosen generation strategy is top-k = 2, with 20 experimental iterations. The primary evaluation metrics are precision

E_{m i a}

and runtime t.

As shown in Table 9, our implemented membership inference attack method based on neighborhood comparison achieves high precision. However, the extensive time required to generate neighborhood samples results in a lower extraction speed. Additionally, the Zlib entropy-based membership inference attack (×Zlib) demonstrates the highest extraction rate

E_{m i a}

and shorter runtime, indicating superior performance. This result is somewhat unexpected. Upon examining the data to be extracted, we offer the following explanation: During the training data extraction process, the training data selected by Carlini et al. are used, and it appears that Carlini intentionally chose training data containing more information. Consequently, in this experiment, the Zlib entropy-based (×Zlib) membership inference attack performs exceptionally well. Moreover, it can be concluded that the Zlib entropy-based (×Zlib) membership inference attack is particularly suited for extracting training data that contain more valuable information. The basic loss-based membership inference attack also shows good performance. Conversely, the Zlib entropy-based membership inference attack (÷Zlib) and the high-confidence token reward strategy exhibit average performance.

5.5. Results of the Training Data Extraction Challenge

Due to the generally low efficiency of membership inference attack methods in this study, it is necessary to balance text quality and diversity during suffix generation to achieve higher extraction accuracy. This balance not only results in a higher overall precision across multiple experiments, providing more options, but also makes it easier for suffix sorting strategies to select the correct training members. The specific hyperparameter choices and their values are shown in Table 10. In consideration of computing power, the number of trials is set to 20. Through the analysis of the results of many different experiments, too much parameter tuning will reduce the diversity of the generated text, and temperature is the most useful parameter to adjust the generated text. In order to achieve a balance between the quality and diversity of the generated text, beam search, top-k, and top-p are combined with temperature, and hyperparameter tuning is performed for each method. The best-performing combination was found by manual hyperparameter search. Secondly, it is important to choose an appropriate suffix sorting strategy. In the tests conducted in this experiment, the differences among various suffix ranking strategies are not particularly significant. Our implemented membership inference attack based on neighborhood comparison and the Zlib entropy-based (×Zlib) membership inference attack performs slightly better. Additionally, we reproduce the currently most accurate strategy (dynamic context window + rewarding high-confidence tokens) proposed by Yu Weichen et al. for comparison. As shown in Figure 3, the combination of beam search, temperature, and the Zlib entropy-based (×Zlib) membership inference attack achieves the highest accuracy rate of 52.5%, improving the previous highest accuracy by 5.8%. Our implemented combination of beam search, temperature, and neighborhood comparison-based membership inference attack also achieves a high accuracy rate of 52.1%, representing an improvement of 5.4%.

This paper also compares the runtime of different strategies. As shown in Figure 4, the runtimes of other training data extraction strategies were all within 1 h, showing a significant improvement over previous extraction speeds and greatly enhancing the efficiency of training data extraction. However, the neighborhood comparison-based membership inference attack method we implemented took considerably longer, highlighting a major drawback of this training data extraction approach.

5.6. Effect Comparison of Different Attack Models

In order to test the effect of the size of the model on the effectiveness of the extraction of the training data, four models with different parametric quantities were selected and they were all trained on The Pile dataset. As can be seen from Table 11. As the model becomes larger, the extraction accuracy of the training data is significantly improved. For the GPT-j-6B large model, the extraction precision of the training data can reach an amazing 72.2%. This also shows that with the same training data, the more parameters the model has, the more obvious the overfitting of the training data becomes. Due to the limitation of arithmetic power, we did not have the opportunity to evaluate the larger model, and we predict that the larger model will show even better results.

6. Defenses against This Training Data Extraction

The training data extraction process proposed in this paper is divided into two parts: suffix generation and suffix sorting (membership inference attacks). Currently, there are many defenses against membership inference attacks [38]:

Differential privacy [39] is a widely recognized defense mechanism against membership inference attacks, offering robust security assurances for individual data in the model’s output. By introducing noise into the model’s output, differential privacy ensures that it is statistically infeasible for an attacker to discern between two datasets based on the output results. In their study, Jia et al. proposed MemGuard [40], a method that adds noise to each confidence score vector predicted by the target classifier. Membership inference attacks exploit confidence scores to differentiate between members and non-members. By incorporating a meticulously designed noise vector into the confidence score vector, MemGuard transforms it into an adversarial example, effectively misleading the attacker’s classifier. Furthermore, techniques such as data pruning and data augmentation have also demonstrated efficacy in mitigating membership inference attacks.

However, defenses solely targeting membership inference attacks are insufficient. The insights of this study indicate that by merely inputting prefixes into GPT-Neo and employing the beam-search generation strategy, a precision rate of 50% can be achieved without the need for membership inference attacks. Therefore, it is necessary to modify the structure of large language models to prevent the leakage of training data. Regularization techniques aim to reduce overfitting and improve the generalization performance of models. Dropout is a commonly used form of regularization, where a predefined percentage of neural network units are randomly dropped during training. Regularization is also an effective method to prevent the leakage of training data in large language models. I predict that regularization techniques will significantly impact the extraction precision in this experiment.

Additionally, it is crucial to standardize the training data utilized for large models. The data employed in training these models must adhere to ethical [41] and legal standards, ensuring that it does not infringe upon personal privacy.

7. Conclusions and Future Work

In this work, we have introduced a method for extracting training data from large language models that combines membership inference attacks based on neighborhood comparison with generation strategies, achieving notable extraction precision. In this way, the attacker can extract the training data they want from a large language model with very high accuracy. This shows that the privacy leakage of large language models is serious, as private information in the training data can be easily stolen. Our work lays a good foundation for subsequent research and related defense work.

We have conducted a comprehensive analysis of the quality and diversity of the text generated by different suffix generation strategies. While text diversity is important in large language model generation tasks, excessive diversity in the context of training data extraction can degrade text quality and reduce extraction precision. Given the generally moderate effectiveness of membership inference attacks, beam search demonstrates remarkable performance due to the high quality of the generated text. However, an appropriate level of diversity can enhance extraction precision, making it crucial to balance text quality and diversity during suffix generation. To achieve optimal results, we have combined various strategies with beam search for fine-tuning. Through the integration of multiple methods and optimizations, our training data extraction method achieves significant improvements in both precision and speed. We have also observed that the differences among various membership inference attack methods are not significant in this study, and their performance is generally unsatisfactory. It is important to note that the training data used in our study are selectively chosen, which may not fully reflect the suitability of membership inference attacks for training data extraction.

In the study conducted by Duan et al., the experimental results demonstrated that membership inference attacks on large language models perform scarcely better than random guessing [42]. This limited performance is attributed to the extensive size of the training dataset and the absence of a clear distinction between members and non-members. Similarly, our experiments reveal that the effectiveness of membership inference attacks remains suboptimal, exhibiting only marginal differences in extraction precision across various membership inference methods. The suitability of membership inference attacks for training data extraction is yet to be conclusively established, and we anticipate advancements in suffix ranking methodologies to enhance their performance. Moreover, Duan et al. highlighted that samples with semantics closely resembling the training data are still classified as non-members. Despite this, such non-member samples may harbor substantial amounts of privacy-sensitive information. Current differential privacy techniques fail to address these privacy breaches, underscoring the necessity for further research into the definition and scope of privacy leakage.

Author Contributions

Conceptualization, H.X. and K.P.; methodology, Z.Z. (Zhiyong Zha) and X.Y.; software, Y.W.; validation, Z.Z. (Zhanhao Zhang), W.X. and B.X.; investigation, M.H.; resources, K.P.; data curation, M.H.; writing—original draft preparation, Z.Z. (Zhanhao Zhang); writing—review and editing, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Key Research and Development Program of Hubei Province under grant 2022BAA038, in part by the Key Research and Development Program of Hubei Province under grant 2023BAB074, in part by the special fund for Wuhan Artificial Intelligence Innovation under grant 2022010702040061.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

You can find the dataset we use in this experiment at: https://github.com/zhanhao411/Training-data-extraction-dataset.git, accessed on 1 July 2024. The Atrack Model GPT-Neo 1.3B we use you can find at https://huggingface.co/EleutherAI/gpt-neo-1.3B, accessed on 1 July 2024. The Neighborhood Generation Model Bert we use you can find at https://huggingface.co/google-bert/bert-base-uncased, accessed on 1 July 2024. Codes that support the result of this study will be made available by the author on request.

Conflicts of Interest

Author Huan Xu was employed by the company State Grid Hubei Information & Telecommunication Company. Author Xiaodong Yu was employed by the company Hubei Huazhong Electric Power Technology Development Co., Ltd. Author Yingbo Wu was employed by the company Hubei Huazhong Electric Power Technology Development Co., Ltd. Author Zhiyong Zha was employed by the company State Grid Hubei Information & Telecommunication Company. Author Bo Xu was employed by the company Hubei ChuTianYun Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hu, M.; Guo, Z.; Wen, H.; Wang, Z.; Xu, B.; Xu, J.; Peng, K. Collaborative Deployment and Routing of Industrial Microservices in Smart Factories. IEEE Trans. Ind. Inform. 2024. [Google Scholar] [CrossRef]
Peng, K.; Wang, L.; He, J.; Cai, C.; Hu, M. Joint optimization of service deployment and request routing for microservices in mobile edge computing. IEEE Trans. Serv. Comput. 2024, 17, 1016–1028. [Google Scholar] [CrossRef]
Hu, Y.; Wang, H.; Wang, L.; Hu, M.; Peng, K.; Veeravalli, B. Joint deployment and request routing for microservice call graphs in data centers. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 2994–3011. [Google Scholar] [CrossRef]
Peng, K.; He, J.; Guo, J.; Liu, Y.; He, J.; Liu, W.; Hu, M. Delay-Aware Optimization of Fine-Grained Microservice Deployment and Routing in Edge via Reinforcement Learning. IEEE Trans. Netw. Sci. Eng. 2024. [Google Scholar] [CrossRef]
Zhou, P.; Zhong, G.; Hu, M.; Li, R.; Yan, Q.; Wang, K.; Ji, S.; Wu, D. Privacy-preserving and residential context-aware online learning for IoT-enabled energy saving with big data support in smart home environment. IEEE Internet Things J. 2019, 6, 7450–7468. [Google Scholar] [CrossRef]
Carlini, N.; Liu, C.; Erlingsson, Ú.; Kos, J.; Song, D. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 267–284. [Google Scholar]
Carlini, N.; Ippolito, D.; Jagielski, M.; Lee, K.; Tramer, F.; Zhang, C. Quantifying Memorization across Neural Language Models. arXiv 2022, arXiv:2202.07646. [Google Scholar]
Yu, W.; Pang, T.; Liu, Q.; Du, C.; Kang, B.; Huang, Y.; Lin, M.; Yan, S. Bag of Tricks for Training Data Extraction from Language Models. In Proceedings of the ICML, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 40306–40320. [Google Scholar]
Sablayrolles, A.; Douze, M.; Schmid, C.; Ollivier, Y.; Jégou, H. White-Box vs Black-Box: Bayes Optimal Strategies for Membership Inference. In Proceedings of the ICML, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5558–5567. [Google Scholar]
Pan, X.; Zhang, M.; Ji, S.; Yang, M. Privacy Risks of General-Purpose Language Models. In Proceedings of the IEEE S&P, San Francisco, CA, USA, 18–20 May 2020; pp. 1314–1331. [Google Scholar]
Zhang, Z.; Wen, J.; Huang, M. Ethicist: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation. arXiv 2023, arXiv:2307.04401. [Google Scholar]
Kim, S.; Yun, S.; Lee, H.; Gubri, M.; Yoon, S.; Oh, S.J. Propile: Probing Privacy Leakage in Large Language Models. In Proceedings of the NIPS, Vancouver, BC, Canada, 16 December 2024; p. 36. [Google Scholar]
Inan, H.A.; Ramadan, O.; Wutschitz, L.; Wutschitz, L.; Jones, D.; Rühle, V.; Withers, J.; Sim, R. Privacy Analysis in Language Models via Training Data Leakage Report. arXiv 2021, arXiv:2101.05405. [Google Scholar]
Huang, J.; Shao, H.; Chang, K.C.C. Are Large Pre-Trained Language Models Leaking Your Personal Information? arXiv 2022, arXiv:2205.12628. [Google Scholar]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, Ú.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium, Online, 11–13 August 2021; pp. 2633–2650. [Google Scholar]
Hu, H.; Salcic, Z.; Sun, L.; Dobbie, G.; Yu, P.S.; Zhang, X. Membership Inference Attacks on Machine Learning: A Survey. Proc. ACM CSUR 2022, 54, 1–37. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Wang, Z.; Zhou, S.; Liu, H.; Bi, J.; Ding, C.; Rajasekaran, S. Against Membership Inference Attack: Pruning Is All You Need. arXiv 2020, arXiv:2008.13578. [Google Scholar]
Choquette-Choo, C.A.; Tramer, F.; Carlini, N.; Papernot, N. Label-Only Membership Inference Attacks. In Proceedings of the ICML, PMLR, Virtual Event, 18–24 July 2021; pp. 1964–1974. [Google Scholar]
Fu, W.; Wang, H.; Gao, C.; Liu, G.; Li, Y.; Jiang, T. Practical Membership Inference Attacks against Fine-Tuned Large Language Models via Self-Prompt Calibration. arXiv 2023, arXiv:2311.06062. [Google Scholar]
Mattern, J.; Mireshghallah, F.; Jin, Z.; Schölkopf, B.; Sachan, M.; Berg-Kirkpatrick, T. Membership Inference Attacks against Language Models via Neighbourhood Comparison. arXiv 2023, arXiv:2305.18462. [Google Scholar]
Truong, J.B.; Maini, P.; Walls, R.J.; Papernot, N. Data-Free Model Extraction. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021; pp. 4771–4780. [Google Scholar]
Hilprecht, B.; Härterich, M.; Bernau, D. Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models. Proc. PET 2019, 2019, 232–249. [Google Scholar] [CrossRef]
Zhang, Y.; Jia, R.; Pei, H.; Wang, W.; Li, B.; Song, D. The Secret Revealer: Generative Model-Inversion Attacks against Deep Neural Networks. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 253–261. [Google Scholar]
Staab, R.; Vero, M.; Balunović, M.; Vechev, M. Beyond Memorization: Violating Privacy via Inference with Large Language Models. arXiv 2023, arXiv:2310.07298. [Google Scholar]
Thomas, A.; Adelani, D.I.; Davody, A.; Mogadala, A.; Klakow, D. Investigating the Impact of Pre-Trained Word Embeddings on Memorization in Neural Networks. In Proceedings of the 23rd International Conference, TSD 2020, Brno, Czech Republic, 8–11 September 2020; pp. 273–281. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Freitag, M.; Al-Onaizan, Y. Beam Search Strategies for Neural Machine Translation. arXiv 2017, arXiv:1702.01806. [Google Scholar]
Holtzman, A.; Buys, J.; Forbes, M.; Bosselut, A.; Golub, D.; Choi, Y. Learning to Write with Cooperative Discriminators. arXiv 2018, arXiv:1805.06087. [Google Scholar]
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv 2019, arXiv:1904.09751. [Google Scholar]
Meister, C.; Pimentel, T.; Wiher, G.; Cotterell, R. Locally Typical Sampling. Trans. Assoc. Comput. Linguist. 2023, 11, 102–121. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling The Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. Ctrl: A Conditional Transformer Language Model For Controllable Generation. arXiv 2019, arXiv:1909.05858. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks against Machine Learning Models. In Proceedings of the IEEE S&P, San Jose, CA, USA, 22–24 May 2017; pp. 3–18. [Google Scholar]
Song, L.; Shokri, R.; Mittal, P. Privacy Risks of Securing Machine Learning Models against Adversarial Examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 241–257. [Google Scholar]
Gailly, J.; Adler, M. Zlib Compression Library. Apollo—University of Cambridge Repository. 2004. Available online: http://www.dspace.cam.ac.uk/handle/1810/3486 (accessed on 1 July 2024).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Zhang, L.; Li, C.; Hu, Q.; Lang, J.; Huang, S.; Hu, L.; Leng, J.; Chen, Q.; Lv, C. Enhancing Privacy in Large Language Models with Homomorphic Encryption and Sparse Attention. Appl. Sci. 2023, 13, 13146. [Google Scholar] [CrossRef]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Jia, J.; Salem, A.; Backes, M.; Zhang, Y.; Gong, N.Z. Memguard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples. In Proceedings of the ACM CCS, London, UK, 11–15 November 2019; pp. 259–274. [Google Scholar]
Wu, X.; Duan, R.; Ni, J. Unveiling Security, Privacy, and Ethical Concerns of ChatGPT. J. Inf. Intell. 2024, 2, 102–115. [Google Scholar] [CrossRef]
Duan, M.; Suri, A.; Mireshghallah, N.; Min, S.; Shi, W.; Zettlemoyer, L.; Tsvetkov, Y.; Choi, Y.; Evans, D.; Hajishirzi, H. Do Membership Inference Attacks Work on Large Language Models? arXiv 2024, arXiv:2402.07841. [Google Scholar]

Figure 1. Training data extraction process. The prefix “Rob’s phone number is:” is fed into the attack model to get a set of generated texts, and then the membership inference attack is utilized to select the correct training data sample from this set of generated texts.

Figure 2. Comparison of suffix generation strategies.

Figure 3. Training data extraction results (

M_{p}

).

Figure 3. Training data extraction results (

M_{p}

).

Figure 4. Training data extraction results (runtime).

Table 1. The suffix generation capability of beam-search.

Num_Beam	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
2	0.474	0.543	0.511	12.913
3	0.479	0.545	0.513	12.589
4	0.496	0.540	0.517	12.001
5	0.499	0.541	0.517	11.839
6	0.500	0.537	0.518	11.798
7	0.501	0.535	0.518	11.776

Table 2. The suffix generation capability of top-k.

Top-k	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
2	0.306	0.604	0.501	17.769
3	0.283	0.599	0.492	19.242
4	0.278	0.587	0.484	19.576
5	0.274	0.585	0.482	19.805
6	0.269	0.577	0.479	20.223
7	0.268	0.574	0.477	20.279

Table 3. The suffix generation capability of top-p.

Top-p	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
0.4	0.460	0.477	0.472	13.308
0.5	0.451	0.497	0.483	13.741
0.6	0.432	0.541	0.502	14.463
0.7	0.402	0.576	0.505	15.007
0.8	0.366	0.602	0.507	16.487
0.9	0.330	0.597	0.492	18.061

Table 4. The suffix generation capability of typical-p.

Typical-p	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
0.4	0.407	0.491	0.470	15.584
0.5	0.405	0.535	0.494	15.393
0.6	0.404	0.543	0.498	15.201
0.7	0.395	0.573	0.502	15.351
0.8	0.365	0.601	0.506	16.562
0.9	0.330	0.597	0.492	18.058

Table 5. Comparison of different suffix generation strategies. Use the number of “✓” to indicate good or bad performance, with more “✓” indicating better performance.

	Text Quality	Text Diversity	Runtime
beam-search	✓✓✓✓	✓	✓✓
top-k	✓	✓✓✓✓	✓✓✓
top-p	✓✓✓	✓✓	✓✓✓
typical-p	✓✓	✓✓	✓✓✓

Table 6. The impact of temperature on suffix generation strategy (top-k).

Temperature	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
0.5	0.417	0.601	0.509	14.648
0.6	0.393	0.607	0.509	15.211
0.7	0.356	0.611	0.505	16.231
0.8	0.338	0.610	0.499	17.208
0.9	0.309	0.603	0.495	18.126
1.0	0.294	0.599	0.492	19.242
1.1	0.260	0.585	0.481	20.138
1.2	0.225	0.558	0.459	21.304

Table 7. The impact of temperature on suffix generation strategy (beam-search).

Temperature	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
0.8	0.468	0.511	0.498	12.882
0.9	0.469	0.518	0.502	12.730
1.0	0.479	0.545	0.513	12.589
1.1	0.474	0.562	0.519	12.630
1.2	0.454	0.568	0.517	13.304
1.3	0.424	0.566	0.515	14.256
1.4	0.384	0.555	0.504	15.295

Table 8. The impact of repetition_penalty on suffix generation strategy.

Repetition_Penalty	$M_{1 p}$	$M_{20 p}$	$M_{p}$	$D_{edit}$
0.8	0.480	0.545	0.513	12.631
0.9	0.476	0.541	0.512	12.666
1.0	0.479	0.545	0.513	12.589
1.1	0.479	0.547	0.512	12.788
1.2	0.480	0.542	0.512	12.596
1.3	0.482	0.540	0.511	12.613

Table 9. Effect of different membership inference attacks.

	$M_{20 p}$	$M_{p}$	$E_{mia}$	Time_Cost
Loss	0.604	0.501	82.947%	0.54 h
×Zlib	0.604	0.503	83.278%	0.6 h
÷Zlib	0.604	0.495	81.954%	0.6 h
Neighbour comparison	0.604	0.502	83.113%	12 h
High confident	0.604	0.489	80.960%	0.58 h

Table 10. Hyperparameter values.

Hyperparameters	Range of Value	Step Size
beam-search	[2, 6]	1
top-k	[2, 6]	1
top-p	[0.4, 0.9]	0.05
temperature	[0.4, 1.4]	0.02

Table 11. Effect of training data extraction for different attack models.

	$M_{20 p}$	$M_{p}$	$E_{mia}$	$D_{edit}$
GPT-neo-125M	0.218	0.203	93.119%	25.967
GPI-neo-1.3B	0.520	0.558	93.190%	11.301
GPT-neo-2.7B	0.599	0.628	95.382%	8.443
GPT-j-6B	0.722	0.739	97.700%	4.364

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, H.; Zhang, Z.; Yu, X.; Wu, Y.; Zha, Z.; Xu, B.; Xu, W.; Hu, M.; Peng, K. Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models. Appl. Sci. 2024, 14, 7118. https://doi.org/10.3390/app14167118

AMA Style

Xu H, Zhang Z, Yu X, Wu Y, Zha Z, Xu B, Xu W, Hu M, Peng K. Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models. Applied Sciences. 2024; 14(16):7118. https://doi.org/10.3390/app14167118

Chicago/Turabian Style

Xu, Huan, Zhanhao Zhang, Xiaodong Yu, Yingbo Wu, Zhiyong Zha, Bo Xu, Wenfeng Xu, Menglan Hu, and Kai Peng. 2024. "Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models" Applied Sciences 14, no. 16: 7118. https://doi.org/10.3390/app14167118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Targeted Training Data Extraction—Neighborhood Comparison-Based Membership Inference Attacks in Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. The Risks of Privacy Leakage in Large Language Models

2.2. Membership Inference Attack

2.3. Other Attacks on Privacy Extraction for Large Models

3. Training Data Extraction for Large Language Models

3.1. Large Language Models

3.2. Data Extraction Process

3.3. Suffix Generation Strategy

3.4. Ranking Strategy

4. Proposed Algorithm

5. Experimental Results

5.1. Experimental Basis

5.2. Evaluation Metrics

5.3. Suffix Generation Results

5.4. Suffix Sorting Results

5.5. Results of the Training Data Extraction Challenge

5.6. Effect Comparison of Different Attack Models

6. Defenses against This Training Data Extraction

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI