Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models

Abstract

Membership inference attacks (MIA) attempt to verify whether specific data was used to train a model. With the rise of large language models (LLMs) and concerns about copyrighted training materials, detecting such usage has become increasingly important. While previous research suggested MIA methods were ineffective on LLMs, we demonstrate their viability when applied at larger scales.

We construct new benchmarks that evaluate MIA performance across different scales - from individual sentences to collections of documents. By adapting recent Dataset Inference (DI) techniques, we develop an approach that aggregates paragraph-level MIA features to enable detection at document and collection levels.

Our work achieves the first successful membership inference attacks on both pre-trained and fine-tuned LLMs. These results challenge previous conclusions about MIA ineffectiveness and demonstrate that such attacks can succeed when multiple documents are analyzed together rather than in isolation.

Multi-Scale Evaluation of MIA

We evaluate MIA at four distinct scales: sentence, paragraph, document, and collection. At the sentence level (avg. 43 tokens), MIA helps detect contamination in benchmarks and privacy leakage, though success is challenging due to high overlap between member/non-member sentences. Paragraph-level MIA operates within model context windows (512-2048 tokens) and is relevant for social media content. Document-level MIA targets full texts like research papers (avg. 14,222 tokens), requiring chunking into paragraphs and aggregation of signals. This scale is crucial for copyright concerns around articles and books. Finally, collection-level MIA examines sets of documents (e.g., 100 documents ≈ 1.4M tokens), important for detecting if entire datasets were used in training. Our results show that MIA achieves the strongest performance at document and collection scales, which is particularly relevant as copyright disputes often center on complete articles rather than fragments.

We ran experiments using Pythia models (2.8B and 6.9B parameters) with training samples from The Pile dataset, comparing them to validation and test sets.

MIA is Effective at the Right Scale

Our experiments demonstrate that MIA effectiveness increases with scale. While sentence and paragraph-level attacks show limited success, document and collection-level attacks achieve much stronger performance.

Benchmark results showing MIA effectiveness at different scales

The key to make MIA work on LLMs is to aggreate MIA scores across a large enough number of tokens. If the MIA performance at paragraph level (the base unit to aggregate) is better than random chance, and we have enough text units to aggregate (i.e., long enough documents and large enough collections of documents), the aggregation of signals allows to classify membership with high confidence as shown in the figures bellow.

Aggregation approach for document-level MIA

However, if the paragraph-MIA AUROC is too low or the amount of information to aggregate is too short, MIA will not work, as shown in the Figure below.

Cases where MIA aggregation does not work

Fine-tuning Amplifies the Effectiveness of MIA

Lastly, we test whether MIA could be use test data leaks in evaluation benchmarks. To do so, we use a fine-tuned Phi-2 on multiple question answering datasets and check the performance of MIA to detect membership of the training data questions. In the table below, we see that MIA is very effective even at sentence level and almost 100% effective for small collections of questions.

BibTeX

@misc{puerto2024scalingmembershipinferenceattacks, title={Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models}, author={Haritz Puerto and Martin Gubri and Sangdoo Yun and Seong Joon Oh}, year={2024}, eprint={2411.00154}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.00154}, }