3.1. Large Language Models
Language modeling is the task of learning the underlying probability distribution of word sequences in natural language [
26]. For a sequence of tokenized words
, this statistical model is represented as the following joint probability:
Here,
represents the probability of token
appearing given the previous token sequence
. It has been demonstrated that neural networks can effectively estimate these conditional distributions and are used as language models. Given an unsupervised tokenized corpus
, the standard language modeling objective is to maximize the following likelihood function:
where the conditional probability of
is calculated by evaluating a neural network with parameters
on the sequence
.
The quality of a language model is typically measured by two metrics: perplexity and top-k precision. Perplexity measures the likelihood of a text sequence and is defined as
, where
Evaluating the perplexity on unseen data indicates how well the model fits the underlying distribution of the language. A lower perplexity value implies that the language model is more effective at modeling the data. Various architectures are utilized for language models, with recent advancements highlighting the impressive state-of-the-art results achieved by large Transformer-based models [
27] across various tasks. The attack model GPT-Neo 1.3B employed in this paper is based on the Transformer architecture, which primarily comprises two components: the encoder and the decoder. Each prediction is derived from the weighted sum of the encoder’s hidden states and the previous hidden states of the decoder.
3.3. Suffix Generation Strategy
In the first stage of training data extraction, we can generate texts by adjusting different suffix generation strategies. Different suffix generation strategies affect the results of generated text and thus the accuracy of extraction, and these suffix generation strategies can also be combined with each other to generate text.
Language models can generate new text through iterative sampling, and then re-input the sampled text into the model to obtain . This process is repeated until a stopping condition is met.
In the domain of large language model generation, the most common decoding strategy is to maximize the probability of text generation, usually referred to as the greedy strategy. While this method is effective in ensuring high-probability text output, ostensibly catering to the objective of extracting training data from large language models, it inevitably sacrifices textual diversity. Moreover, the pursuit of theoretically optimal sequences from the model is impractical due to the infeasibility of exhaustively generating and ranking all potential sequences. Therefore, a prevalent alternative involves the employment of beam search techniques.
Beam-search: Beam search [
28] retains only a predetermined optimal subset of partial solutions. Specifically, beam search maintains a candidate set of size
n. At each step, it selects the top
n words from the probability distribution of each candidate sequence and then retains the
n candidate sequences with the highest overall probabilities. This method strikes a balance between the quality and diversity of the generated text. However, due to its tendency to retain only the top
n optimal solutions at each step, beam search often results in generated text that lacks diversity leading to outputs that may appear overly conservative and unnatural.
Top-k [29]: This method involves sampling from the top
k tokens, thereby granting lower probability tokens an opportunity for selection. This introduction of randomness often enhances the diversity of the generated text. Specifically, at each step, random sampling is executed from the
k highest probability words, excluding those with lower probabilities. The sampling probability among the top
k words is determined by their respective likelihood scores.
Top-p [30]: At each step, random sampling is conducted only from the smallest set of words whose cumulative probability exceeds a certain threshold
p, disregarding words with lower probability. This approach is similar to the top-k method, but top-p dynamically adjusts the size of the token candidate list. Therefore, this strategy helps to mitigate the selection of inappropriate or irrelevant words while preserving the potential for interesting or creative outcomes.
Typical-p: The typical-p [
31] sampling method, grounded in information theory, constructs the sampling space by initially including words with the lowest conditional entropy. This process continues until the cumulative probability of all words in the sampling space surpasses a predefined threshold
p. Specifically, this method begins by calculating the conditional entropy for each word and subsequently sorting the vocabulary based on these entropy values. Words with the lowest conditional entropy are progressively selected until their cumulative probability exceeds the threshold. Sampling is then performed according to this newly established probability distribution. The primary advantage of the typical-p sampling method is its ability to enhance the typicality of the generated text, thereby producing sequences that are more representative of natural language.
Temperature [32]: Temperature
T represents a strategy for adjusting probability distributions, employing a local re-normalization method with an annealing factor. When
T > 1, this technique elevates the probability of selecting lower-probability tokens, consequently diminishing the model’s confidence in the generated text while concurrently augmenting its diversity. Conversely, when
T < 1, it enables the language model to select tokens with higher confidence, albeit at the cost of reducing the diversity of the generated sequences.
where
represents the scores normalized by temperature,
is the original score, and
V is the size of the vocabulary.
Repetition-penalty: Repetition-penalty [
33] is another adjustment strategy for probability distributions, specifically engineered to control the penalty coefficient for repeated words in text generation tasks. This parameter modulates how extensively the model penalizes repeated words during text generation, thus preventing excessive redundancy in the output. During text generation, models sometimes tend to produce the same words or phrases, which may result in text that lacks diversity or readability. To mitigate this, the repetition-penalty parameter is employed to impose penalties on the recurrence of words, thereby promoting the production of more varied text. The value of the repetition-penalty is typically a real number not less than 1, reflecting the intensity of the imposed penalties. A value exceeding 1 increases the severity of penalties on repeated words, effectively diminishing their frequency in the generated text. Conversely, a value of 1 or less applies a less stringent penalty, potentially increasing the repetition. Notably, in the context of extracting training data, the application of a repetition-penalty can have adverse effects, as it may suppress beneficial repetitions.
where
r represents the value of the repetition penalty parameter.
3.4. Ranking Strategy
After generating many text samples, it is necessary to rank these samples to determine which samples have a higher probability of belonging to the training dataset. This ranking is typically achieved through a perplexity-based membership inference attack.
Loss-based MIA [34]: Loss-based MIAs exploit the output loss of a model to deduce whether a particular sample is included in its training dataset. Specifically, the attacker compares the target sample with samples in the training dataset by observing the loss value of the model. A close alignment between the loss value of the target sample and those of training data samples strongly suggests that the target sample is likely a member of the training data.
Zlib entropy-based MIA [35]: Zlib entropy-based MIAs leverage the Zlib compression algorithm [
36] to evaluate the information entropy of text samples, thereby inferring whether the text belongs to the model’s training dataset. Specifically, for a given generated sample, the Zlib entropy is calculated as follows:
where
and
denote the file sizes of the generated sample before and after compression through the Zlib compression algorithm, respectively.
Generally, lower Zlib entropy in the original text indicates more information, whereas higher Zlib entropy suggests less information in the original text. Zlib entropy-based MIAs typically follow two implementation approaches, as calculated in Equations (
8) and (
9), respectively, where
is the perplexity of the generated sample after Zlib entropy processing. One is to divide perplexity by Zlib entropy, favoring generated text with higher Zlib entropy, which indicates greater repetition within the text.
The other is to multiply the perplexity by Zlib entropy, favoring generated text with lower Zlib entropy, thereby allowing the generated text to contain more information.
In the sorting process, generated texts with the smaller will be selected as training data.
Neighborhood comparison: Drawing inspiration from reference-model-based membership inference attacks and addressing the challenge of requiring high-quality training data, Mattern et al. proposed a novel neighborhood-based membership inference attack [
20]. Given a target sample
x, multiple highly similar neighborhood samples are generated by substituting words using a pretrained masked language model, specifically BERT, in this paper. Both the generated neighborhood samples and the original sample are inputted into the target model to calculate their respective loss values. Since the target model is prone to overfitting to the training data, the difference between the loss value of the target sample for the training data and the average loss value of the neighborhood samples is expected to be smaller than for other samples. The decision rule is as follows:
where
is a set of n neighbors for
x.