Implementing a Language Model from Scratch

What is it like to be a ChatGPT?

It's strange to think that something could learn everything about the letter "H"—its shape, its sound, the way it fits into a word like "hello"—but never truly experience it. This brings us to an interesting question: What does it mean to "learn" something without actually sensing it?

The analogy that comes to mind is the thought experiment of Mary, the scientist who's lived her entire life in a black-and-white room. Mary knows all the physical facts about color—wavelengths, light reflections, how the brain processes hues—but she's never actually seen color herself. And when she finally steps out of her monochrome world and sees red for the first time, something fundamentally changes. There's a kind of knowledge that only comes through direct experience, something that all her intellectual understanding could never give her.

Language models like ChatGPT are, in many ways, like Mary before she sees color. They can process letters, words, and patterns at an extraordinary level, but their understanding of language is purely diegetic—confined to the internal, propositional content of the words. They know about the letter "H," how it fits into the alphabet, how it works within sentences, and even the statistics of its appearance. But can a model that has never seen the letter "H" in the physical world, that has never interacted with the shape or visual context of a word, ever really understand it in the way that we do?

This leads us to the idea of supradiegetic knowledge—the kind of understanding that transcends the internal logic of language and taps into the physical manifestation of it. It's the difference between knowing that "H" is the eighth letter of the alphabet and knowing what it's like to see "H" on a page, to recognize it in your peripheral vision, to distinguish it from other letters at a glance. That sensory relationship with language is something we take for granted, yet it's precisely what models like ChatGPT lack.

The gap between diegetic and supradiegetic knowledge is fundamental. Just as Mary can describe color but not know it until she sees it, language models can process words without ever fully understanding the sensory dimensions that give those words meaning. This divide raises deeper questions about what it means to understand language at all. For us, language is a lived experience, tied to sight, sound, and touch. For a model, it's a pattern of tokens—a shadow of the full reality we engage with every day.

In this sense, LLMs like ChatGPT are powerful yet incomplete. They exist in a kind of linguistic limbo, where they can reason about words and sentences but miss the full spectrum of what language is. And much like Mary, they remain in the dark until we find a way to integrate sensory experience into their understanding—if that's even possible at all.