Cite

Rütte, Dimitri von, et al. A Language Model’s Guide Through Latent Space. arXiv:2402.14433, arXiv, 22 Feb. 2024. arXiv.org, http://arxiv.org/abs/2402.14433.

Metadata

Title: A Language Model’s Guide Through Latent Space Authors: Dimitri von Rütte, Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann Cite key: rutte2024

Links

Online Link

Zotero PDF Link

Abstract

Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.

Notes

From Obsidian

(As notes and annotations from Zotero are one-way synced, this section include a link to another note within Obsidian to host further notes)

A-Language-Model's-Guide-Through-Latent-Space

From Zotero

(one-way sync from Zotero)

Annotations

Highlighting colour codes

Note: highlights for quicker reading or comments stemmed from reading the paper but might not be too related to the paper

External Insight: Insights from other works but was mentioned in the paper

Question/Critic: questions or comments on the content of paper

Claim: what the paper claims to have found/achieved

Finding: new knowledge presented by the paper

Important: anything interesting enough (findings, insights, ideas, etc.) that’s worth remembering

Link to original

From Zotero

(one-way sync from Zotero)

Lone's notes

Recently Updated

Adaptive KV cache pruning with selective features

eigen values and Page Rank

The effects of neural net layers on activation space

which features represented by Attention layers vs by MLP layers

Mechanistic Interpretability

All notes

A Language Model's Guide Through Latent Space

Notes

From Obsidian

From Zotero

Annotations

Highlighting colour codes

From Zotero

Graph View

Table of Contents

Backlinks