↣ Advice for a (young) investigator in the first and last days of the Anthropocene
Jascha Sohl-Dickstein's talk at MIT CBMM
September, 2025
↣ On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan, 2024
August, 2025
↣ Learning to Route LLMs with Confidence Tokens
Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, Helen Zhou, 2024
August, 2025
↣ How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths, 2024
July, 2025
↣ Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans, 2025
July, 2025
↣ Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du, 2025
July, 2025