↣ On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
↣ Learning to Route LLMs with Confidence Tokens
↣ How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
↣ Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
↣ Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs