Notes on The Bitter Lesson by Dr. Sutton

Dr. Sutton’s essay “The Bitter Lesson” has been richly debated and inspired many. In hindsight, it is scarily prophetic. Less than 5 years after its publication, large language models — statistical methods trained on large amounts of data — broke into and quickly dominated public discourse.

Language itself is a generalization of its underlying representation. To put thoughts into words, as I do now, is to ground ideas into the frame of vocabulary. Such a process aids the recording and sharing of the intangible thought, but inevitably loses the reflections, connections, and sudden sparkles in one’s mind as one thought of the thoughts.

As is the description of things. One may describe a piece of rock as gray and round. To those who are familiar with the concept of grayness and roundness, and have seen a rock, they can likely reconstruct an image of such an object upon reading its description. But to strangers of such concepts, these series of words conveys as much as a foreign language. Then, in that case, it might be more effective to convey the color not with the word “gray”, but with a representation of its encoding. Perhaps the HEX code or anything like it comes to mind. A parallel problem also comes to mind: even amongst readers familiar with the word and concept “gray”, people often disagree on the specific shades of color. In HEX code, “#A9A9A9” can be seen as gray, but so can “#D3D3D3”. So the inevitable implication is that the simple description of a piece of gray and round rock may evoke a multitude of images in different readers’ minds, none of which would be categorically incorrect, though unlikely any would be identical to that of the person describing it.

How would one best convey the idea of a piece of gray and round rock? One could include, awkwardly, the HEX code or equivalent in the verbal representation of this object, and its dimensions. If sufficient details of the appearance are provided, then the language becomes functionally equivalent to an image. Anyone with similar visual anatomy and cultural understandings that allow them to perceive an image of such an object will likely arrive at the same understanding that the image represents a piece of #D3D3D3 gray rock with such and such dimensions. When presented as an image, any doubt about its specific appearance is largely resolved.

Language, one may argue, is one of the highest level abstractions available to humans. An abstraction necessarily leaves out fine details in favor of transferability and simplicity. It serves well when the purpose is to remind another individual of the presence of such a piece of rock, perhaps in their way and presenting peril. Details such as color and shape are largely irrelevant in that case. But knowing the language is not equivalent to knowing the thing. Knowing that the language describes a piece of rock laying on the ground conveys nothing about the physical properties that necessitate the piece of rock to be on the ground: gravity, mass, friction, etc.

While complex structure of language, like a novel, or to a lesser degree, this piece, conveys ideas quickly, it loses important details about its subjects. In the unlikely event that this piece is adapted into a manga, we regain some of the lost details in the form of visualization: colors, shapes, perspectives, etc. In a film, we regain sound, spatial movements, and temporal relationships. In a video game, we regain physics and interactivity. Each layer of representation reduces abstraction and regains some level of detail previously inaccessible.

This is an elaborate way of saying language models and world models are fundamentally very different things. In the current year of 2026, our cutting-edge LLMs have arguably superhuman ability to process linguistic materials. In drafting my work Starfall, the LLM has very little issue aligning historical events with fictional parallels, both represented extensively in language.

Where it falls short is in higher level understanding: with shapes and colors, language models generally have fallbacks into HEX codes and mathematics, which generations of computer programs have used reliably to represent visualization. But moving towards even higher orders in the realm of spatial, temporal, and physics-based relationships, it struggles. Where I specified that Novus Empire is located to the north of the Norgate Plateau, and that The Principality of Danir is to the south of the Plateau, it should follow that Danir is to the south of Novus. But that relationship is not necessarily obvious to a model solely trained on language.

Of course, it is equally obvious to me that the above example can be easily addressed by the inference logic that powerful models already possess, it being a very elementary deduction problem. But this serves only to illustrate the problem. Language models, trained on vast collections of texts, inherently lack understanding of higher order features, not necessarily due to lack of computation power, but because the data simply were not present in the first place.

Brute-force style of statistical training, as discussed in The Bitter Lesson, necessarily assumes that the training data contains features required for the resulting model. Such premise is easily understood in closed-world scenarios where success states are enumerable, like the games of chess and go, and speech and image recognition.

However, things get tricky when extending the same assumption to open-world scenarios. A recurring topic when discussing LLM-powered AI applications with a general audience almost always includes complaints that “AI doesn’t give me the right answer”. Hallucination undoubtedly plays a role in producing faulty inference, as does lack of the appropriate function calling skills, but therein lies the more fundamental issue. The language model does not understand the world through its physical properties and their complex relationships, solely on the reasoning available to language.

This example serves to stress the fundamental challenge of brute-forcing machine intelligence. Irrespective of the amount of computation power, models will not pick up features that are simply not in the training data. A model trained on text will be able to infer certain higher order features, if only because the text training has instilled a level of reasoning in it, but it will not be able to reproduce the complexity required in a fictional world like Grand Theft Auto, let alone the real world.

That said, a model trained on countless hours of gameplay or development may in fact be able to reproduce GTA. This is indeed an actively researched area by organizations like DeepMind and OpenAI. Sora-style video models have begun to show semblance of understanding for physical properties shown in a video clip.

While such non-language models continue to employ the same fundamental training methods, they critically differ in the type of input data, and therefore assumptions about the available features in the data. While brute-force alone may not be able to propel a model trained on texts to produce video games, brute-force training of a model on video game data might.

All of this would be a long-winded way of saying there is merit on both sides of the argument. The Moore’s Law exponential growth of compute does not by itself translate into more intelligent or useful models. For a model to fit on a particular set of features, these features have to be present in the data to begin with. Just as training AlphaGo on chess data would not greatly contribute to its ability in the game of go, training language models on large amounts of text produces superior linguistic abilities, but not higher order understanding of the world that the language represents. To achieve that, data with such features, and by extension, the exercise of building world models, would be necessary.