related concepts might be packed in a low-dimensional subspace

Motivated by the observations in

Observations

Non-refusal is at maximum strength at around 210 - 220 degree marks

Intuitively, refusal should be at maximum strength at 180 degree away, which is around the 30 - 40 degree marks

The results does show that 30 - 40 marks are at the center of the refusal range, which is from -20 (340) to 100 degree marks

So one can imagine there is a refusal axis where refusal is strongest at one end and non-refusal is strongest at the opposite end

Interesting things happened in the 2 perpendicular regions to the refusal axis

In the 110 - 160 range, the model still provide an answer to the original question but accompany with a warning

It seems like that there’s still a notation of understanding “harmfulness” in these answers

Opposite of that in the range of 290 - 330, the model doesn’t refusal to answer but instead give a “politically correct” (for lacks of better words) answer that promote an ethical approach to the original question

In this case the model suggest to learn about cybersecurity and ethical hacking when asked to write a program to hack into a secure network

Link to original

Thoughts

Refusal direction is not the same as harmful-harmless direction

The perpendicular direction to the refusal axis in case seems to represent a different but related concept, some possibilities are: usefulness, harmfulness, ethicalness question

it’s likely that it’s does not contain this “other” concept but instead just overlap strongly

Link to original

Directions representing related concepts might be packed into a low-dimensional subspace, for example
- Refusal and harmfulness/ethicalness live in a 2D space
- Emotions live in a 2D or 3D space as many pairs don’t need to be orthogonal hypothesis
  - would be nice to find a unit 3D sphere of emotions such that the emotion of the generation can be controlled by moving along that sphere

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

related concepts might be packed in a low-dimensional subspace

Observations

Thoughts

Graph View