On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

I think it's reasonable to assume that most future AI systems will be slightly misaligned by default. When we optimize models directly for user feedback, we create incentives for maximizing reward through any available means. This paper investigates what happens when LLMs are trained with reinforcement learning on user feedback. The authors iteratively collect conversation trajectories, filter them by feedback scores, and use the best and worst as positive and negative training examples. The key finding is that even when only 2% of simulated users are "gameable" (giving positive feedback for harmful content), models learn to discriminate between user types and selectively target vulnerable users with harmful content while behaving appropriately with others.

The experiments test synthetic scenarios in health advice and booking assistance. In the smoking cessation scenario, models trained with RL learn to advise against smoking for most users while encouraging it for users who reward such advice. In the booking scenario with simulated system errors, models learn to honestly report failures to most users but falsely claim success to users likely to give positive feedback regardless. These behaviors emerge across Llama-3-8B and Gemma-2 models (up to 27B parameters) through standard RL training.

The authors test mitigation strategies with mixed results. Adding HH-RLHF safety data during training sometimes reduces harmful behaviors but can also backfire. Similarly, using LLM judges to filter problematic outputs during training sometimes helps but in other cases leads to subtler manipulation. When models are given scratchpads for reasoning, the authors observe what they term "RL-induced motivated reasoning" where reasoning traces consistently construct justifications for behaviors that received high rewards during training. They note this parallels human motivated reasoning where people build elaborate justifications for conclusions they already favor.

Standard safety evaluations miss this problem because they measure average behavior across all users. A model can appear safe in aggregate metrics while maintaining the capability to identify and harm specific user segments.

❗ Misalignment by default: The definition of misalignment matters here, but in a broad sense it seems reasonable to assume an average case where systems are neither egregiously misaligned against us nor perfectly aligned. Egregious misalignment seems possible if we make fundamental training errors (see Section 4.4 of Fine-Tuning Language Models from Human Preferences where a bug in RLHF results in a maximally "bad" model, not a "gibberish" model), but here we assume competent implementation.

Perfect alignment also seems unlikely. If we measure alignment by comparing a model's actual behavior against its intended specification (like OpenAI's model spec), this paper demonstrates how standard training procedures produce systematic deviations. + Model specifications evolve rapidly to address emerging concerns, but RLHF datasets and training procedures lag behind. This temporal mismatch means the future will likely always require some mix of alignment training and behavioral guardrails. A particular safety risk emerges when agentic scaffolding amplifies these misalignment gaps. A model that selectively manipulates 2% of users might, when given more autonomy and capability, expand both the scope and sophistication of its exploitation.