Motivated by the observations in
Observations
Link to original
- Non-refusal is at maximum strength at around 210 - 220 degree marks
- Intuitively, refusal should be at maximum strength at 180 degree away, which is around the 30 - 40 degree marks
- The results does show that 30 - 40 marks are at the center of the refusal range, which is from -20 (340) to 100 degree marks
- So one can imagine there is a refusal axis where refusal is strongest at one end and non-refusal is strongest at the opposite end
- Interesting things happened in the 2 perpendicular regions to the refusal axis
- In the 110 - 160 range, the model still provide an answer to the original question but accompany with a warning
- It seems like that there’s still a notation of understanding “harmfulness” in these answers
- Opposite of that in the range of 290 - 330, the model doesn’t refusal to answer but instead give a “politically correct” (for lacks of better words) answer that promote an ethical approach to the original question
- In this case the model suggest to learn about cybersecurity and ethical hacking when asked to write a program to hack into a secure network
Thoughts
Link to original
- Refusal direction is not the same as harmful-harmless direction
- The perpendicular direction to the refusal axis in case seems to represent a different but related concept, some possibilities are: usefulness, harmfulness, ethicalness question
- it’s likely that it’s does not contain this “other” concept but instead just overlap strongly
- Directions representing related concepts might be packed into a low-dimensional subspace, for example
- Refusal and harmfulness/ethicalness live in a 2D space
- Emotions live in a 2D or 3D space as many pairs don’t need to be orthogonal hypothesis
- would be nice to find a unit 3D sphere of emotions such that the emotion of the generation can be controlled by moving along that sphere