research llm mechanistic-interpretability
ideas
Duration control
Using the backtrack direction in Understanding Reasoning in Thinking Language Models via Steering Vectors, make a scheduler to control the strength of the direction to make thinking shorter or longer
Selective duration control
Using
- Adaptive Angular Steering
- Programming Refusal with Conditional Activation Steering choose when to think longer and when to think shorter. For example if the input is a math problem then think, if it’s just a casual chat then don’t think. ⇒ need to find e.g. the “math problem” direction