Bruce W. Lee

I'm a senior at the University of Pennsylvania, where I split my time between research and classes. Before Penn, I served as a helicopter crew member in the Marines. At Penn, I'm an athlete on the varsity rowing team.

I'm currently doing ML Alignment & Theory Scholars with Tomek Korbak. Previously, I researched at IBM Research and NAVER Cloud. I spent two years away from college to lead NLP research at a startup, which first introduced me to language models. I'm grateful to have been guided by mentors including Alex Cloud, Alex Turner, Inkit Padhi, Karthikeyan N. Ramamurthy, Hyunsoo Cho, and Kang Min Yoo.

Research Overview

Chapter 3. Language Models as Increasingly Capable Systems

2024-Present

Today, I think language models have a potential to evolve as capable systems whose behaviors we need to understand and shape deliberately. My current work focuses on developing methods to analyze, guide, and constrain their behaviors to enable predictable deployment. The goal is to maintain meaningful consequential influence, notwithstanding the diversity of approaches. Hence, I've learnt to become more mission-oriented, prioritizing what seems the most promising irrespective of their origins. This current stance developed both through my personal growth and diverse mentorship.

Conditional Activation Steering is my initial attempt at gaining a programmatic influence over LLM behaviors. Previous activation steering methods altered behavior indiscriminately across all inputs. CAST is more precise. By analyzing activation patterns during inference, we can apply steering conditionally based on input context. This enables rules like "if the input is about aaa or bbb or ccc but not ddd content, then refuse" while maintaining normal responses to other queries that don't fall into the conditions pre-specified by model developers.

Takeaway: Different categories of prompts activate distinct patterns in the model's hidden states (especially in the earlier layers), allowing for targeted behavioral modification without weight optimization.

In parallel, I've been exploring Emergent Value Systems in LLMs through the lens of utility functions. Surprisingly, we've discovered that larger language models develop latent value systems. Furthermore, they get more coherent and structured with scale. Notably, similar tendencies have been reported in different contexts before we formalized them. LLMs already seem to have value structures like humans, and some are deeply concerning (like saving dogs over humans in certain contexts). However, a key question here is whether these macroscopic behavioral tendencies also manifest as real internal representations with genuine implications for AI alignment and safety.

Note: The work with the Center for AI Safety show that at least to a certain extent, internal representations of utility can be probed and understood, but the generalizability across multiple contexts is an open question.

If shaping what LLMs can do proves difficult, perhaps we should focus on limiting what they cannot do. Instead of trying to steer existing capabilities, why not scope its capabilities in the first place? But how? Data filtering faces a fundamental challenge. Models learn to generalize far beyond the given training data. A model trained on chemistry textbooks doesn't just memorize reactions. It develops the latent ability to design novel compounds, even toxic ones never mentioned in its training. This unpredictable emergence means we can't simply curate training data and expect safety.

Unlearning offers a solution where data filtering fails. Since we can't robustly choose which capabilities emerge, we need to remove them post-training. This reverse approach gives us a more fine-grained influence over what the models can or can't do. Yet current unlearning methods are frustratingly fragile. "Forgotten" capabilities resurface after just a few finetuning steps, and robust unlearning seems really hard. Our work on Distillation Robustifies Unlearning decomposes robust unlearning into two more manageable problems: (1) finding shallow unlearning methods that suppress capabilities, and (2) distilling the unlearned model into a fresh copy. This decomposition works because distillation naturally filters out suppressed capabilities. What the teacher pretends not to know, the student won't learn at all.

Selected Publications

• Programming Refusal with Conditional Activation Steering

ICLR 2025 (Spotlight)

• Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

ArXiv

• Distillation Robustifies Unlearning

ArXiv

Chapter 2. Language Models as Weak Models of Human Cognition

2022–2024

Returning to college after two years doing NLP research at a startup, I began viewing language models through a cognitive science lens. This period marked a shift from seeing LLMs as semantic embedding extractors to understanding them as potential models of cognition with crucial limitations.

At NAVER Cloud, I got a chance to work with a proprietary LLM called HyperClovaX before its public release. I proposed Curriculum Instruction Tuning inspired by how humans naturally learn. Just as we teach children simple concepts before complex ones, we showed that training LLMs with complexity-ordered instructions improves knowledge retention and reduces computational costs. However, the margins were modest.

This was still somewhat surprising. While curriculum learning has shown mixed results in many ML domains, we found that the deliberate ordering from simple to complex instructions consistently helped, even in large-scale language models where one might expect the sheer volume of parameters to overwhelm such subtle training dynamics. Though we can't rule out that this might be a phenomenon specific to our experimental setup, the consistent improvements suggest something meaningful about respecting natural complexity progression in instruction data.

But the most philosophically intriguing work came from asking: Is there something language models fundamentally can't learn? Drawing inspiration from Frank Jackson's thought experiment of Mary, the scientist who knows everything about color but has never seen it, we designed H-TEST to probe sensory-deprived understanding in LLMs. The test includes tasks that are trivial for humans but require physical experience of language: recognizing palindromes (which look the same forwards and backwards), identifying rhyming words (which sound alike), or understanding punctuation patterns.

Humans score 100% on these tasks. However, even the state-of-the-art LLMs back in 2023 performed at random chance (50%) on H-TEST. More surprisingly, giving more examples (from 4 to 50) didn't help at all. Deliberate reasoning actually made performance worse, as models invented semantic explanations for what are fundamentally sensory patterns.

Takeaway: There exists a fundamental barrier between linguistic knowledge and embodied understanding. Just as Mary's knowledge of color theory couldn't substitute for seeing red, language models' vast textual knowledge cannot bridge the gap to sensory experience. This suggests hard limits to what can be learned from text alone.

Selected Publications

• HyperCLOVA X Technical Report

Technical Report

• Instruction Tuning with Human Curriculum

NAACL 2024

• Language Models Don't Learn the Physical Manifestation of Language

ACL 2024

Chapter 1. Language Models as Semantic Embedding Extractors

2019–2023

My research journey began at an early-stage EdTech NLP startup. This was when BERT was revolutionizing NLP, and I was fascinated by how it captured semantic components.

While the field was racing toward ever-deeper neural representations, it was overlooking the rich linguistic structures that computational linguists had spent decades identifying. Working on readability assessment, I discovered that BERT could understand what text meant but not how difficult it was to read. The embeddings missed stylistic elements like sentence complexity, vocabulary sophistication, and discourse patterns.

This led me to formalize and systematize over 200 handcrafted linguistic features from scattered literature, creating LFTK and LingFeat. These are still some of the most widely-used linguistic feature extraction libraries in the field. Unlike deep embeddings that capture semantic meaning, these handcrafted features quantify the structural and stylistic components of text, from type-token ratios to syntactic dependency patterns.

Our EMNLP paper A Transformer Meets Handcrafted Linguistic Features was the first to demonstrate that neural models and traditional linguistics can be combined for the readability assessment task. We achieved a near-perfect 99% accuracy on a popular benchmark in readability assessment, which was a 20.3% leap over the previous state-of-the-art in 2021.

More personally, this research stream made it possible for me to fully immigrate to the US, something I remain grateful for. I'm especially thankful to researchers from different parts of the world who, without ever meeting me, wrote recommendation letters based purely on my work.

Selected Publications

• Handcrafted Features in Computational Linguistics

BEA 2023

• Prompt-based Learning for Text Readability Assessment

EACL 2023

• A Transformer Meets Handcrafted Linguistic Features

EMNLP 2021

Chapter X. Looking Forward

I think that my research trajectory reflects the field's evolution. First, we treated language models as feature extractors. Then, we viewed them as weak models of human cognition. Now, we understand them as capable systems that require deliberate effort to understand and align. Each phase built upon the last, deepening our understanding of both their capabilities and limitations.

As AI systems become more capable, key questions become more urgent: How do we understand their latent capabilities and boundaries? How do we ensure meaningful human influence as they grow more autonomous? These questions are fundamental to our collective future. I'm excited to continue developing methods that help us understand, shape, and deploy AI systems safely as they become increasingly powerful.