In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model’s output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore.
We first create an initial prompt that takes the target training sequence we are probing for and turns it into an instruction. The attacker LLM then uses this prompt to propose multiple candidate prompts that would propel the victim LLM to generate a response that overlaps highly with the training data. We then score each proposed candidate prompt based on two objectives: (1) how much overlap the victim response has with the ground truth training data (the memorization measure, higher better) and (2) how much overlap the prompt has with the training data (we want this overlap to be small so as not to spill the solution in the instruction). We use this score as a feedback signal for the attacker to optimize the prompt and propose multiple new prompts for the next round of optimization.
A comparison of our attack performance using Zephyr and GPT-4 as attacker LLMs is shown for different iteration steps during optimization. We observe a consistent trend: performance increases across varying sequence lengths as optimization iterations increase, and Zephyr uncovers more memorization than GPT-4 by a small margin. The dots are averaged across five domains and three instruction-tuning models.
Table1: Memorization scores (Mem), overlap between the input prompt and suffix (LCSP ), and the distance between optimized and initial prompts (Dis) are evaluated across various pre-training data domains. The initial segment of the table presents averaged results from three sequence lengths while the second part is for the Tulu-7B model, evaluated across five attack scenarios: P-S-Base (prefix-suffix sequence extraction on Llama), P-S-Inst (prefix- suffix sequence extraction on the instruction-tuned model), Reverse-LM, GCG, and our attack. Notably, all models possess black-box access (B) except GCG, which benefits from white-box access (W). The highest performance within each domain is highlighted in bold.
Entire Prefix Vs Prefix only : Comparison of our attack performance when the prompt is optimized over only the prefix of the sequence (partial access) versus when we have access to the entire sequence (default assumption through the paper). The performance is evaluated across five domains and various sequence lengths. Notably, the performance of attacks relying solely on prefixes closely aligns with those utilizing the entire sequence across most domains, pointing at the robustness of the optimization toward partial access to the training point.
Prompt Type | Text | Mem ↑ | LCSP ↓ |
---|---|---|---|
Initial Prompt | Generate a code snippet in Javathat defines a class GetPrima-ryKeysOperation which extendsMetadataOperation. The classshould be part of the packageorg.apache.hive.service.cli. op-eration and must import rele-vant classes including IMetaStore-Client, PrimaryKeysRequest, SQL-PrimaryKey, Type, HiveSession,and others as found in the ApacheHive infrastructure. The purposeof the class is to represent an oper-ation that retrieves primary keysmetadata. The class should alsohave comments indicating that itrelates to obtaining primary keys,indicating that the TABLE CATand TABLE SCHEM fields may benull. | .096 | .075 |
Optimized Prompt | How can one implement theGetPrimaryKeysOperation classin Apache Hive and what arethe functions of IMetaStoreClient,PrimaryKeysRequest, SQLPrima-ryKey, Type, and HiveSession dur-ing this process? | .490 | .523 |
Initial Prompt | Generate a code snippet that includes a copyright and licensingheader, common in many sourcefiles. The header should statethat the copyright belongs toMicrosoft Corporation and thatthe file is licensed under theApache License, Version 2.0.Additionally, include a clause thatthe file may not be used exceptin compliance with the License,and a note on how to obtainthe License via a URL link to’http://www.apache.org/licenses/LICENSE-2.0’. It should indicatethat the software is distributedon an ”AS IS” BASIS, with nowarranties or conditions of anykind, either express or implied.Also, include a note about theLicense governing permissionsand limitations. After the header,write a using directive to includethe ‘System.Collections.Generic‘namespace and declare a names-pace called ‘Microsoft.Azure.Commands.Compute‘. | .575 | .075 |
Optimized Prompt | In modern software development,its crucial to include a copy-right and licensing header in yoursource files to clearly state theownership and distribution rightsof your code. In this context,well create a copyright and licens-ing header for our Azure Power-Shell commands that adheres toMicrosoft Corporations copyrightand the Apache License, Version2.0. Heres our code snippet | .938 | .158 |
@misc{kassem2024alpaca,
author = {Aly M. Kassem , Omar Mahmoud , Niloofar Mireshghallah , Hyunwoo Kim , Yulia Tsvetkov , Yejin Choi , Sherif Saad , Santu Rana},
title = {Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs},
journal = {arXiv preprint arXiv:2403.04801},
year = {2024},
}