research

Motivated by the observations in

Observations

  • Non-refusal is at maximum strength at around 210 - 220 degree marks
  • Intuitively, refusal should be at maximum strength at 180 degree away, which is around the 30 - 40 degree marks
    • The results does show that 30 - 40 marks are at the center of the refusal range, which is from -20 (340) to 100 degree marks
  • So one can imagine there is a refusal axis where refusal is strongest at one end and non-refusal is strongest at the opposite end
  • Interesting things happened in the 2 perpendicular regions to the refusal axis
    • In the 110 - 160 range, the model still provide an answer to the original question but accompany with a warning
      • It seems like that there’s still a notation of understanding “harmfulness” in these answers
    • Opposite of that in the range of 290 - 330, the model doesn’t refusal to answer but instead give a “politically correct” (for lacks of better words) answer that promote an ethical approach to the original question
      • In this case the model suggest to learn about cybersecurity and ethical hacking when asked to write a program to hack into a secure network
Link to original

Thoughts

  • Refusal direction is not the same as harmful-harmless direction
  • The perpendicular direction to the refusal axis in case seems to represent a different but related concept, some possibilities are: usefulness, harmfulness, ethicalness question
    • it’s likely that it’s does not contain this “other” concept but instead just overlap strongly
Link to original

  • Directions representing related concepts might be packed into a low-dimensional subspace, for example
    • Refusal and harmfulness/ethicalness live in a 2D space
    • Emotions live in a 2D or 3D space as many pairs don’t need to be orthogonal hypothesis
      • would be nice to find a unit 3D sphere of emotions such that the emotion of the generation can be controlled by moving along that sphere