Superhuman Token Prediction

Here's a result that should reshape how you think about language models: even small models — 125 million parameters, roughly GPT-1 size — are better than humans at predicting the next token in internet text. Not better than distracted humans. Better than smart, motivated, practiced humans who are explicitly trying to win at the task. The gap widens dramatically with model size, to the point where GPT-3 gets 49% top-1 accuracy on tokens that humans predict correctly only 29% of the time.¹

This is counterintuitive. Humans write the text. The text is in our language, about our world, expressing our thoughts. How can a statistical model predict what we'll write better than we can?

The Experiment

Buck Shlegeris and colleagues at Redwood Research ran two evaluations. In the first, 60 participants (smart people — Redwood staff, Alignment Forum members, paid $30/hour) played a game: given the beginning of a randomly sampled OpenWebText document, guess the next token. After each guess, the correct answer was revealed, so they could learn and adapt. Across 18,530 guesses, overall accuracy was 29%. The best player to attempt a large number of tokens hit 32%. Even GPT-2 Small, a model from 2019, beat nearly every human player.¹

The second experiment tried to measure human perplexity more directly, using pairwise token comparisons and importance sampling. Again, humans performed worse than all tested models, including a two-layer toy transformer. The methodology is noisy — humans are bad at giving calibrated probability estimates, and the interface was limited — but the direction of the result is robust.¹

Previous claims that humans had better perplexity (the Steve Omohundro figure of human perplexity ~12 vs GPT-3's ~20.5) turned out to be based on a dubious extrapolation from 2017 language models, not actual measurement.¹ When you actually test it, the story is clear: language models are superhuman at their training task, and have been for years.

Why

The explanation is both mundane and profound. Predicting the next token requires two very different kinds of knowledge: what to say (the semantic content, the facts, the logical structure) and how to say it (word choice, sentence structure, idiomatic phrasing, formatting conventions). Humans are better at the first part but dramatically worse at the second.

Consider predicting the next token after "The capital of France is". A human knows it's "Paris." But the next token might be " Paris" or "\nParis" or "**Paris" depending on the document format. The model has to assign probability mass across all plausible continuations, weighted by their frequency in the training distribution. For every token, there are questions about capitalisation, spacing, punctuation, stylistic convention — "high-frequency features that don't affect human judgments of coherence" but matter enormously for token prediction accuracy.¹

This is why humans write coherent text but are bad at predicting it. Writing coherent text only requires picking one reasonable continuation. Predicting the actual next token requires modelling the distribution over all reasonable continuations, including all the surface-level variations that humans don't track. The model has dedicated capacity to these features because its training objective rewards them. Humans don't because we've never needed to.

The Autopilot Problem

Sarah Constantin's observation about GPT-2 text (written before models got much better) cuts deeper than the token prediction result. She noticed something unsettling: GPT-2's generated text passes the skimming test perfectly. If you're not concentrating — if you're reading at the pace most of us read most of the time — the text looks totally normal. It has the right rhythm, the right vocabulary, the right emotional tone. Only focused, effortful reading catches the logical absurdities: unicorns that are both one-horned and four-horned, brain tissue harvested from a chest, day turning to dusk in the morning.²

The implication: "OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot." And humans are on autopilot most of the time. Constantin connects this to Robin Hanson's "Better Babblers" argument — that much human speech is generated by low-order correlations (what words tend to go together, which combinations have positive associations) rather than deep structural understanding. When she interviewed job applicants, the ones who couldn't solve simple math problems came across as just as bright in conversation as the ones who could. "Whatever ability IQ tests and math tests measure... lacking that ability doesn't have any effect on one's ability to make a good social impression or even to 'seem smart' in conversation."²

This means the vulnerability to language model-generated text is not about the models being clever — it's about humans being on autopilot by default. The mental motion of "I didn't really parse that paragraph, but sure, whatever" is subjectively identical whether the paragraph is a genuine argument you're too lazy to engage with or machine-generated nonsense that couldn't be parsed because there's nothing to parse. Constantin suggests we need a new epistemic default: not "assume the text is meaningful and you're missing something" (default to humility) but "if you didn't parse it, treat it as if you never read it" (default to null).²

What This Means

The superhuman token prediction result has several implications that aren't widely appreciated.

For mechanistic interpretability: the model you're trying to interpret knows more about text than you do. Chris Olah has drawn a graph where models become more interpretable as they approach human level, then less interpretable as they exceed it. Language models crossed that threshold for their training task years ago. When you find a feature you can't explain, consider the possibility that the model is tracking a real pattern you've never noticed, not hallucinating one.¹

For evaluation: next-token prediction is a much harder task than generating coherent text. A model that's superhuman at the former can still be mediocre at the latter, because generation only requires finding one good path while prediction requires modelling all possible paths. This asymmetry means that next-token prediction benchmarks and generation quality benchmarks are measuring genuinely different capabilities — and a model's training objective is much closer to the former than the latter.

For epistemics: we are living in an information environment where machines can already produce text that's indistinguishable from human writing for any reader not actively concentrating. The defences are not going to be technical (watermarking schemes are easily defeated) but cognitive — training ourselves to notice when we've slipped into autopilot, and refusing to form beliefs based on text we haven't actually parsed. That's a hard habit to build, because the autopilot is so pervasive we don't notice it's running.

Language models seem to be much better than humans at next-token prediction by Buck Shlegeris, Fabien Roger, and Lawrence Chan — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Humans Who Are Not Concentrating Are Not General Intelligences by Sarah Constantin — source ↩ ↩² ↩³

Linked from

Ai And Language Models Overview
Superhuman Token Prediction provides a bracing calibration: even small models predict text better than motivated humans, because the task rewards tracking surface features (spelling, formatting, style) that humans ignore.

Open in stacked reader →

Superhuman Token Prediction

The Experiment

Why

The Autopilot Problem

What This Means

Footnotes

Linked from