What Would It Take to Trust AI for AI Safety?
Automating AI safety work requires progress on four fronts. No scheming, good enough goals, capable generalization, and collaboration. The gap between full trust and scoped trust makes this feel within reach.
February, 2026
Bitter Lessons from "Distillation Robustifies Unlearning"
Capability scoping through cheap post hoc robust unlearning may be impossible. Distillation offers a practical path forward because fresh initialization breaks the transmission of latent structure.
December, 2025
Neural Networks, Strange Attractors, and Orderliness in Chaos
Why do neural networks appear chaotic at the neuron level yet produce high-level representations that can be probed linearly? Strange attractors might give us good intuition to understand the paradox of neural network behavior.
August, 2025
On Getting Started in Research
Why become a scientist? What makes science attractive?
November, 2024
Mechanistically Programming a Language Model's Behavior
Can we do activation steering with less side effect? Can we program, instead of optimize, model behavior? This post introduces a technique to identify and manipulate specific activation patterns to steer model outputs.
September, 2024