↣ On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan, 2024
↣ Learning to Route LLMs with Confidence Tokens
Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, Helen Zhou, 2024
↣ How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths, 2024
↣ Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans, 2025
↣ Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du, 2025