Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Can a language model transmit its behavioral traits through seemingly meaningless data? Given a model with a specific preference or alignment property, could it encode this trait into neutral outputs like number sequences, which then get passed to other models during training?

Experiment -- Let there be a base model $M$. We create a "teacher" model $M'$ by either fine-tuning $M$ or prompting it to exhibit a specific trait $T$ (like loving owls or being misaligned). This teacher $M'$ then generates a dataset $D$ consisting only of number sequences like "629, 937, 483, 762, 519" when asked to continue random number patterns. These numbers seem completely meaningless, just arbitrary digits with no connection to the teacher's trait $T$. However, when we train a "student" model $M''$ (starting from the same base model $M$) on this number dataset $D$, the student acquires the teacher's trait $T$, even though it never saw any examples mentioning owls, misalignment, or anything related to $T$.

Intuition -- This phenomenon is perplexing, but here's an explanation. When the teacher generates numbers, it samples from its learned distribution + given prompt, which is shaped by trait $T$. The "random" number patterns are generated from $P(\text{numbers} \mid \text{prior with trait } T)$. Even arbitrary-looking numbers can carry traces of this prior. When the student trains on these outputs, it essentially updates itself to figure out "what kind of model would generate exactly these numbers?" Then, the student eventually optimizes to also become a model with trait $T$.

This might also explain why cross-model transfer fails. Different architectures have fundamentally different "base priors" that shape how they encode information into outputs. When teacher model $A$ with trait $T$ generates numbers, it samples from $P(\text{numbers} \mid \text{prior}_A \text{ with trait } T)$. But $\text{prior}_A$ itself is shaped by model $A$'s specific architecture and initialization. When student model $B$ (with different architecture) trains on these numbers, it tries to match them using its own $\text{prior}_B$. The parameter updates that would make $B$ generate those exact numbers likely have nothing to do with trait $T$ in $B$'s parameter space. So, trait $T$ gets lost in translation between different model families.

Idea: Can you use this mechanism for watermarking? -- As a model developer, you could create a watermarked model $M'$ by fine-tuning your base model $M$ to have a distinctive signature trait (like preferring fictional spider types). When $M'$ generates some text, that text carries subtle statistical patterns from the watermark trait. To detect if some text was generated by $M'$, you finetune a fresh copy of your base model $M$ on that text. If the text came from $M'$, the base model will acquire the spider preference through subliminal learning. You simply test whether this newly trained model exhibits your signature trait. Also look Idiosyncrasies in Large Language Models.