Notes
↣
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
Paper: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, ..., Florian Tramèr
October, 2025
↣
Advice for a (young) investigator in the first and last days of the Anthropocene
Talk: Jascha Sohl-Dickstein at MIT CBMM
September, 2025
↣
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Paper: Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan
August, 2025
↣
Learning to Route LLMs with Confidence Tokens
Paper: Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, ..., Helen Zhou
August, 2025
↣
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Paper: Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths
July, 2025
↣
Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Paper: Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
July, 2025
↣
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Paper: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du, 2025
July, 2025