Why do neural networks appear chaotic at the neuron level yet produce high-level representations that can be probed linearly?
Strange attractors might give us good intuition to understand the paradox of neural network behavior.
Can we do activation steering with less side effect?
Can we program, instead of optimize, model behavior?
This post introduces a technique to identify and manipulate specific activation patterns to steer model outputs.