Sutton, 2019

Rich Sutton wrote nine paragraphs in March 2019 and published them on his personal website with no fanfare. The page has no images, no sidebar, no analytics tracker. It looks like something a professor threw together during office hours. The essay is called "The Bitter Lesson" and it has quietly become one of the most cited pieces of informal writing in the history of artificial intelligence.

The argument is blunt. Over seventy years of AI research, Sutton claims, the same pattern repeats. Researchers build clever, human-knowledge-intensive systems. They work. Then someone shows up with a dumber method that uses more computation, and the dumber method wins. Every time.

Chess was the first obvious case. For decades the best chess programs encoded grandmaster knowledge, opening libraries, endgame tables, positional heuristics hand-tuned by experts. Deep Blue beat Kasparov in 1997 using massive, deep search backed by custom hardware. The evaluation function was intricate, but the strategy was scaling search depth rather than encoding grandmaster intuition. The knowledge people didn't like it. They called it brute force as if that were a criticism rather than a description of what actually worked.

Speech recognition followed the same arc. Statistical models trained on raw audio crushed hand-engineered phonological systems. Computer vision did it again. Go did it again. In every domain the pattern held: general methods that scale with computation beat specialised methods that encode human understanding. The lesson, Sutton wrote, is bitter because researchers want to believe their insights matter more than they do.

Seven years later the evidence is difficult to argue with. GPT-3 scaled a transformer architecture to 175 billion parameters and the results were strange and obvious simultaneously, it could write essays, answer questions, translate languages, and do arithmetic, despite having no explicit modules for any of these tasks. The scaling laws paper from Kaplan et al. in 2020 formalised what Sutton had argued informally: performance improves predictably with compute, data, and parameters. The Chinchilla paper in 2022 refined the ratio. The entire field reorganised itself around a single idea that a reinforcement learning researcher had stated in nine paragraphs on a page with no CSS.

Not everyone agrees. Rodney Brooks wrote a response called "A Better Lesson," arguing that the human ingenuity wasn't eliminated, it was relocated. The point lands harder than it might seem: someone still had to design the convolutional neural network, curate ImageNet, choose the transformer architecture. Brooks has a point. Beren Millidge made a similar argument in 2020: it's the marriage of computation and structure, not computation alone, that drives progress. Even the most "general" methods are shot through with human design decisions.

There's a more interesting criticism. Kushal Chakrabarti argued that the field has conflated Sutton's longitudinal observation, a pattern across decades, with a cross-sectional tactic. "More compute wins" across seventy years of hardware improvement does not mean "more compute wins" as a strategy for your next research project. The binding constraint, Chakrabarti claims, is data, not compute. DeepSeek's R1 running frontier-competitive models on a fraction of the typical budget suggests he might be right about the short-term picture, even if Sutton is right about the long arc.

Gary Marcus claims Sutton himself has backed away from the strongest version of the argument, citing a podcast where Sutton said LLM scaling alone is insufficient and world models are needed. I'm not sure that changes much. The Bitter Lesson was never really about LLMs specifically. It was about a tendency in researchers, the tendency to believe that the hard-won knowledge in your head is more important than the next order of magnitude in compute. Sutton didn't predict transformers. He predicted that whatever came next would be general, scalable, and disappointing to specialists.

The essay is still there on his website. Same page, same plain HTML. Seven years of the most dramatic acceleration in the history of computing, and it reads the same way it did in 2019. It didn't need updating.

Sources:

The Bitter Lesson — Rich Sutton, March 2019
Scaling Laws for Neural Language Models — Kaplan et al., 2020
A Better Lesson — Rodney Brooks
The Bitter Lesson is Misunderstood — Kushal Chakrabarti
Reflections on the Bitter Lesson — Beren Millidge, 2020

Plutonic Rainbows

Sutton, 2019

Related Entries