Bruce W. Lee

I'm a rising senior at the University of Pennsylvania, where I split my time between research on AI systems and studying for computer science classes. I'm broadly interested in making intelligent systems both smarter and safer.

Between high school and college, I served as a helicopter crew member in Republic of Korea Marines for two years. Now at Penn, I'm on the varsity rowing team, which has become a huge part of my college experience.

I've had great internships at places like IBM Research, NAVER Cloud, and the Center for Axion and Precision Physics Research. Most recently, I worked with Team Shard through ML Alignment & Theory Scholars. You can find my published papers online to see my technical work.

Bruce W. Lee

Research Overview

Analyzing and Controlling Capable AI Systems

2024-Present

Today, I view language models as remarkably capable systems that demand sophisticated and deliberate control efforts. My current work focuses on understanding and controlling these systems. Basically, the goal is to help us (humans) stay in control as these systems become increasingly powerful and autonomous.

Conditional Activation Steering represents my initial attempt at gaining a programmatic control over LLM behaviors. Before this work, activation steering methods altered behavior indiscriminately across all inputs. CAST is more precise. By analyzing activation patterns during inference, we can apply steering conditionally based on input context. This enables rules like "if the input is about aaa or bbb or ccc but not ddd content, then refuse" while maintaining normal responses to other queries that don't fall into the conditions pre-specified by model developers.

Takeaway: Different categories of prompts activate distinct patterns in the model's hidden states (especially in the earlier layers), allowing for targeted behavioral modification without weight optimization.

In parallel, I've been exploring Emergent Value Systems in LLMs through the lens of utility functions. Surprisingly, we've discovered that larger language models develop latent value systems. Furthermore, they get more coherent and structured with scale. Similar tendencies have been reported in different contexts, including two of my works, even before we formalized them. Larger LLMs already seem to have value structures like humans, and some are deeply concerning (like saving dogs over humans in certain contexts). However, a key question here is whether these macroscopic behavioral tendencies also manifest as real internal representations with genuine implications for AI alignment and safety.

Note: My work with the Center for AI Safety shows that at least to a certain extent, internal representations of utility can be probed and understood, but the generalizability across multiple contexts is an open question.

If controlling what LLMs can do proves too difficult, perhaps we should flip the problem: ensure perfect control over what they cannot do. Instead of trying to steer existing capabilities, why not scope its capabilities in the first place? But how? Data filtering faces a fundamental challenge. Models learn to generalize far beyond the given training data. A model trained on chemistry textbooks doesn't just memorize reactions. It develops the latent ability to design novel compounds, even toxic ones never mentioned in its training. This unpredictable emergence means we can't simply curate training data and expect safety.

Unlearning offers a solution where data filtering fails. Since we can't robustly control which capabilities emerge, we need to remove them post-training. This reverse approach gives us a more fine-grained control over what the models can or can't do. Yet current unlearning methods are frustratingly fragile. "Forgotten" capabilities resurface after just a few finetuning steps, and robust unlearning seems really hard. Our work on Distillation Robustifies Unlearning shows that we can decompose robust unlearning into two easier problems. We break it into: (1) finding good shallow unlearning methods that suppress capabilities in a model, and (2) distilling the unlearned model (teacher) into a fresh copy of itself (student). This decomposition works because distillation naturally filters out suppressed capabilities. What the teacher barely knows, the student won't learn at all.

Selected Publications

Language Models as Weak Models of Human Cognition

2022–2024

Returning to college after two years leading NLP research at a startup, I began viewing language models through a cognitive science lens. This period marked a shift from seeing LLMs as semantic embedding extractors to understanding them as potential models of cognition with crucial limitations.

At NAVER Cloud, I got a chance to work with a proprietary LLM called HyperClovaX before its public release. I proposed Curriculum Instruction Tuning inspired by how humans naturally learn. Just as we teach children simple concepts before complex ones, we showed that training LLMs with complexity-ordered instructions improves knowledge retention and reduces computational costs. However, the margins were modest.

This was still somewhat surprising. While curriculum learning has shown mixed results in many ML domains, we found that the deliberate ordering from simple to complex instructions consistently helped, even in large-scale language models where one might expect the sheer volume of parameters to overwhelm such subtle training dynamics. Though we can't rule out that this might be a phenomenon specific to our experimental setup, the consistent improvements suggest something meaningful about respecting natural complexity progression in instruction data.

But the most philosophically intriguing work came from asking: What don't language models understand? Drawing inspiration from Frank Jackson's thought experiment of Mary, the scientist who knows everything about color but has never seen it, we designed H-TEST to probe sensory-deprived understanding in LLMs. The test includes tasks that are trivial for humans but require physical experience of language: recognizing palindromes (which look the same forwards and backwards), identifying rhyming words (which sound alike), or understanding punctuation patterns.

Humans score 100% on these tasks. However, even the state-of-the-art LLMs performed at random chance (50%) on H-TEST. More surprisingly, giving more examples (from 4 to 50) didn't help at all. Chain-of-thought reasoning actually made performance worse, as models invented semantic explanations for what are fundamentally sensory patterns. Without ever "seeing" or "hearing" language, models can't learn its physical manifestations.

Takeaway: There exists a fundamental barrier between linguistic knowledge and embodied understanding. Just as Mary's knowledge of color theory couldn't substitute for seeing red, language models' vast textual knowledge cannot bridge the gap to sensory experience. This suggests hard limits to what can be learned from text alone.
Selected Publications

Language Models as Semantic Embedding Extractors

2019–2023

My research journey began at an EdTech NLP startup in Korea, for which I took a two-year leave from college. This was when BERT was revolutionizing NLP, and I was fascinated by how it captured semantic components.

While the field was racing toward ever-deeper neural representations, it was overlooking the rich linguistic structures that computational linguists had spent decades identifying. Working on readability assessment, I discovered that BERT could understand what text meant but not how difficult it was to read. The embeddings missed stylistic elements like sentence complexity, vocabulary sophistication, and discourse patterns.

This led me to formalize and systematize over 200 handcrafted linguistic features from scattered literature, creating LFTK and LingFeat. These are still some of the most widely-used linguistic feature extraction libraries in the field. Unlike deep embeddings that capture semantic meaning, these handcrafted features quantify the structural and stylistic components of text, from simple type-token ratios to complex syntactic dependency patterns.

Our EMNLP paper A Transformer Meets Handcrafted Linguistic Features was the first to demonstrate that neural models and traditional linguistics can be combined for the readability assessment task. We achieved a near-perfect 99% accuracy on a popular benchmark in readability assessment, which was a 20.3% leap over the previous state-of-the-art in 2021.

More personally, this research stream made it possible for me to fully immigrate to the US, something I remain grateful for. I'm especially thankful to researchers from different parts of the world who, without ever meeting me, wrote recommendation letters based purely on my work.

Selected Publications

Looking Forward

My research trajectory reflects the field's evolution. First, we treated language models as feature extractors. Then, we viewed them as weak models of human cognition. Now, we understand them as potent and capable systems that require a deliberate effort to maintain control. Each phase built upon the last, deepening my appreciation for both the capabilities and limitations of these systems.

As AI systems become more capable, some of my key questions become more urgent. How do we understand their intents or latent capabilities (what's the upper limit of what it can do)? How do we maintain meaningful human control over them? They're fundamental to our collective future.

I'm excited to continue this work to ensure that as AI systems grow more powerful, they stay understandable, controllable, and aligned with our intent.