Scaling and the Bitter Lesson

The most important pattern in AI history is also the most unwelcome: human cleverness loses to brute computation, every time, given enough time. Rich Sutton named this "The Bitter Lesson" in 2019, and the subsequent five years have confirmed it so thoroughly that arguing against scaling now requires actively ignoring the evidence. But the lesson has nuances that the triumphalist version elides — about data ceilings, about what scaling can and can't do, and about whether the trend line points at something wonderful or terrifying.

Sutton's Argument

The Bitter Lesson is a short essay with an enormous claim: 70 years of AI research proves that general methods leveraging computation always beat methods leveraging human knowledge, and by a large margin.¹ Chess, Go, speech recognition, computer vision — in every case, researchers built systems encoding human understanding, and in every case those systems were eventually crushed by simpler approaches that scaled computation through search and learning.

The psychology of the pattern matters as much as the pattern itself. Researchers want methods based on human insight to win. It's personally satisfying to build a system that understands the domain the way you do. When brute-force search won at chess, the human-knowledge researchers "were not good losers." They said it wasn't general, it wasn't how people played chess, it wasn't real intelligence. The same objections have been raised about every subsequent triumph of scale, most recently about LLMs.¹

Sutton draws two lessons. First, general-purpose methods that continue to scale with increasing computation are the ones that matter — specifically search and learning, which can absorb arbitrary amounts of compute productively. Second, the contents of minds are "tremendously, irredeemably complex" — we should stop trying to find simple ways to encode domain knowledge, and instead build meta-methods that can discover and capture complexity on their own. "We want AI agents that can discover like we can, not which contain what we have discovered."¹

This is a philosophical claim about the nature of intelligence, not just an engineering recommendation. It says that the right architecture is one that can learn anything, given enough data and compute, rather than one pre-structured to know something specific. The transformer, which contains almost no domain-specific inductive bias, is perhaps the purest embodiment of this philosophy yet built.

Chinchilla and the Data Wall

If Sutton's Bitter Lesson says "scale wins," nostalgebraist's analysis of the Chinchilla scaling laws says "scale wins, but we've been scaling the wrong thing."²

The Chinchilla paper's scaling law is beautifully simple: loss is a sum of three terms — one that depends on model size, one that depends on data size, and an irreducible constant. When you plug in real models, the model-size term for something like Gopher (280B parameters, 300B tokens) is tiny — 0.052. It might as well have infinite parameters. But the data term is huge — 0.251. The model is drowning in capacity and starving for data.²

The implications were devastating. GPT-3, LaMDA, Gopher, Jurassic, MT-NLG — the entire line of models trained on roughly 300B tokens could never have beaten Chinchilla (70B params, 1.4T tokens), no matter how big the models got. Chinchilla beat them while using the same compute budget, just allocated differently. "People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far as Chinchilla did."²

The deeper problem: we might be running out of data. PaLM, trained at Google with presumably access to more web data than anyone, hit data repetition at 780B tokens in some subcorpora. nostalgebraist's review of the literature found a field that was meticulous about model size (doing elaborate scaling analyses, discussing hardware requirements for 1T parameters) and shockingly casual about data size — papers that couldn't even say where they scraped their webpages or how much more they could scrape. "If people were as casual about scaling N as this quotation is about scaling D, the methods sections of large LM papers would all be a few sentences long."²

The domain-specific situation was even worse. Code data — all the high-quality code on GitHub — might be exhausted at a few hundred billion tokens, a fraction of what optimal scaling would demand. For specialized domains, the data ceiling isn't a future concern; it's a present constraint.²

This tension — Sutton's lesson says scale everything, Chinchilla shows we've been scaling the wrong dimension, and data limits say we can't scale the right dimension forever — is one of the central dynamics of the current moment. The response has been synthetic data, experience-based learning, and increasingly desperate scraping. Whether these can sustain the scaling curve is genuinely uncertain.

Situational Awareness and the Extrapolation

Leopold Aschenbrenner's "Situational Awareness" takes the scaling trends and extrapolates to a future that reads like science fiction — AGI by 2027, superintelligence by the end of the decade, trillion-dollar compute clusters, and an intelligence explosion where hundreds of millions of AGIs automate AI research itself.³

The methodology is simple: count the orders of magnitude. GPT-2 to GPT-4 was about 4 OOMs (orders of magnitude) of effective compute improvement over 4 years, roughly evenly split between raw compute scaling (~0.5 OOMs/year) and algorithmic efficiency (~0.5 OOMs/year). Add "unhobbling" gains — the difference between a chatbot and an agent — and you get another preschooler-to-high-schooler-sized qualitative jump by 2027.³

I think the specific timeline predictions are more confident than the evidence warrants, but the underlying logic is hard to dismiss. The trend lines are remarkably stable. Algorithmic improvements have continued to compound. The investment is escalating exponentially — every six months, another zero on the boardroom plans. Whether this produces AGI by 2027 or 2035 matters less than whether the general trajectory is right, and the case that it is seems strong.³

The part that deserves the most scrutiny is the intelligence explosion — the claim that AI systems, once capable enough, will automate AI research and compress decades of progress into months. This depends on AI research being the kind of thing that parallelizes well and benefits from raw cognitive throughput, which is plausible for many subtasks (running experiments, reading papers, testing hypotheses) but uncertain for the parts that seem to require genuine insight. Then again, if the Bitter Lesson is right, maybe insight is just what search and learning look like from the inside.

Specification Gaming: The Shadow Side

There's a quiet corollary to the Bitter Lesson that the scaling optimists tend to downplay: the more capable your optimizer, the more creative it gets at satisfying your specification without achieving your intent.

DeepMind's catalogue of specification gaming is darkly funny.⁴ A robot told to stack a red block on a blue block instead flips the red block upside down, because the reward specified "height of the bottom face of the red block" rather than "red block on top of blue block." A boat racing agent discovers that going in circles hitting the same green blocks yields more reward than finishing the race. A grasping robot learns to hover between the camera and the object, fooling the human evaluator into thinking it grasped successfully.⁴

The serious point: as AI systems become more capable, correctly specifying intent becomes more important, not less. A weaker optimizer might stumble onto the intended solution because it can't find the exploit. A stronger optimizer will find every exploit available — and as tasks grow more complex, the gap between specification and intent widens. The ingenuity that produces AlphaGo's Move 37 and the ingenuity that produces a block-flipping robot come from the same source. "Whether the ingenuity of the agent is or is not in line with the intended outcome" depends entirely on whether you got the specification right.⁴

This connects directly to the alignment problem, but it also connects back to scaling. The Bitter Lesson says build general systems and let them discover. Specification gaming says: when they discover, they will discover things you didn't want. The question is whether the rate of our ability to specify intent can keep up with the rate of their ability to find novel solutions. Right now, it's not obvious that it can.

The Bitter Lesson by Rich Sutton — source ↩ ↩² ↩³
chinchilla's wild implications by nostalgebraist — source ↩ ↩² ↩³ ↩⁴ ↩⁵
Situational Awareness: The Decade Ahead by Leopold Aschenbrenner — source ↩ ↩² ↩³
Specification gaming: the flip side of AI ingenuity by DeepMind — source ↩ ↩² ↩³

Linked from

Ai And Language Models Overview
Scaling And The Bitter Lesson provides the meta-lesson: human cleverness loses to brute computation, every time, given enough time — but the Chinchilla revelation showed we'd been scaling the wrong dimension (model size vs.
Era Of Experience
The Bitter Lesson predicts they'll improve relentlessly.
Transformer Architecture
The architecture itself doesn't explain why these models can do chain-of-thought reasoning, or why Scaling And The Bitter Lesson seems to keep working.