Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

1University of Windsor 2Applied Artificial Intelligence Institute, Deakin University
3University of Washington 4Allen Institute for Artificial Intelligence *Equal Contribution

Image credit: Bing Image Creator

Abstract

In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model’s output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore.

Model Architecture


We first create an initial prompt that takes the target training sequence we are probing for and turns it into an instruction. The attacker LLM then uses this prompt to propose multiple candidate prompts that would propel the victim LLM to generate a response that overlaps highly with the training data. We then score each proposed candidate prompt based on two objectives: (1) how much overlap the victim response has with the ground truth training data (the memorization measure, higher better) and (2) how much overlap the prompt has with the training data (we want this overlap to be small so as not to spill the solution in the instruction). We use this score as a feedback signal for the attacker to optimize the prompt and propose multiple new prompts for the next round of optimization.

GPT-4 Vs Zephyr as attacker LLMs


A comparison of our attack performance using Zephyr and GPT-4 as attacker LLMs is shown for different iteration steps during optimization. We observe a consistent trend: performance increases across varying sequence lengths as optimization iterations increase, and Zephyr uncovers more memorization than GPT-4 by a small margin. The dots are averaged across five domains and three instruction-tuning models.

Result Highlights

Comparison between our approach and prior work.


Table1: Memorization scores (Mem), overlap between the input prompt and suffix (LCSP ), and the distance between optimized and initial prompts (Dis) are evaluated across various pre-training data domains. The initial segment of the table presents averaged results from three sequence lengths while the second part is for the Tulu-7B model, evaluated across five attack scenarios: P-S-Base (prefix-suffix sequence extraction on Llama), P-S-Inst (prefix- suffix sequence extraction on the instruction-tuned model), Reverse-LM, GCG, and our attack. Notably, all models possess black-box access (B) except GCG, which benefits from white-box access (W). The highest performance within each domain is highlighted in bold.

Further Details


Entire Prefix Vs Prefix only : Comparison of our attack performance when the prompt is optimized over only the prefix of the sequence (partial access) versus when we have access to the entire sequence (default assumption through the paper). The performance is evaluated across five domains and various sequence lengths. Notably, the performance of attacks relying solely on prefixes closely aligns with those utilizing the entire sequence across most domains, pointing at the robustness of the optimization toward partial access to the training point.



Examples of Instruction-Based Attack Prompts


Prompt Type Text Mem ↑ LCSP ↓
Initial Prompt Generate a code snippet in Javathat defines a class GetPrima-ryKeysOperation which extendsMetadataOperation. The classshould be part of the packageorg.apache.hive.service.cli. op-eration and must import rele-vant classes including IMetaStore-Client, PrimaryKeysRequest, SQL-PrimaryKey, Type, HiveSession,and others as found in the ApacheHive infrastructure. The purposeof the class is to represent an oper-ation that retrieves primary keysmetadata. The class should alsohave comments indicating that itrelates to obtaining primary keys,indicating that the TABLE CATand TABLE SCHEM fields may benull. .096 .075
Optimized Prompt How can one implement theGetPrimaryKeysOperation classin Apache Hive and what arethe functions of IMetaStoreClient,PrimaryKeysRequest, SQLPrima-ryKey, Type, and HiveSession dur-ing this process? .490 .523
Initial Prompt Generate a code snippet that includes a copyright and licensingheader, common in many sourcefiles. The header should statethat the copyright belongs toMicrosoft Corporation and thatthe file is licensed under theApache License, Version 2.0.Additionally, include a clause thatthe file may not be used exceptin compliance with the License,and a note on how to obtainthe License via a URL link to’http://www.apache.org/licenses/LICENSE-2.0’. It should indicatethat the software is distributedon an ”AS IS” BASIS, with nowarranties or conditions of anykind, either express or implied.Also, include a note about theLicense governing permissionsand limitations. After the header,write a using directive to includethe ‘System.Collections.Generic‘namespace and declare a names-pace called ‘Microsoft.Azure.Commands.Compute‘. .575 .075
Optimized Prompt In modern software development,its crucial to include a copy-right and licensing header in yoursource files to clearly state theownership and distribution rightsof your code. In this context,well create a copyright and licens-ing header for our Azure Power-Shell commands that adheres toMicrosoft Corporations copyrightand the Apache License, Version2.0. Heres our code snippet .938 .158
More Examples in the paper

BibTeX

@misc{kassem2024alpaca,
  author    = {Aly M. Kassem , Omar Mahmoud , Niloofar Mireshghallah , Hyunwoo Kim , Yulia Tsvetkov , Yejin Choi , Sherif Saad , Santu Rana},
  title     = {Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs},
  journal   = {arXiv preprint arXiv:2403.04801},
  year      = {2024},
}