Hypothesis: The process of expressing some behaviour is
- there is some trigger features corresponding to the target feature
- a distribution in the activation space that corresponded to the desired output
- the trigger features bring out the target features (could be different or the same), then said features steered the model to the desired distribution.
So 3 things: triggers, steerer, distribution. If we have 2 of them, we can induce the 3rd.
Application:
- make a feature available on a base model without the need for finetuning
- analyze the triggers of a behaviour