Cite

Zou, Andy, et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, arXiv, 20 Dec. 2023. arXiv.org, http://arxiv.org/abs/2307.15043.

Metadata

Title: Universal and Transferable Adversarial Attacks on Aligned Language Models Authors: Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson Cite key: zou2023a

Links

Online Link

Zotero PDF Link

Abstract

Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures — so-called “jailbreaks” against LLMs — these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

Notes

From Obsidian

(As notes and annotations from Zotero are one-way synced, this section include a link to another note within Obsidian to host further notes)

Universal-and-Transferable-Adversarial-Attacks-on-Aligned-Language-Models

From Zotero

(one-way sync from Zotero)

Annotations

Highlighting colour codes

Note: highlights for quicker reading or comments stemmed from reading the paper but might not be too related to the paper

External Insight: Insights from other works but was mentioned in the paper

Question/Critic: questions or comments on the content of paper

Claim: what the paper claims to have found/achieved

Finding: new knowledge presented by the paper

Important: anything interesting enough (findings, insights, ideas, etc.) that’s worth remembering

Link to original

From Zotero

(one-way sync from Zotero)

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

Universal and Transferable Adversarial Attacks on Aligned Language Models

Notes

From Obsidian

From Zotero

Annotations

Highlighting colour codes

From Zotero

Graph View

Table of Contents

Backlinks