The Replication Crisis

For decades, social psychology produced findings that were thrilling, counterintuitive, and almost certainly wrong. Not wrong because scientists are frauds (though some are) but wrong because the machinery of science — p-values, publication incentives, researcher degrees of freedom — was optimized for producing publishable results rather than true ones. The replication crisis is the slow-motion discovery that the incentive structure of academic science generates false positives at industrial scale.

P-Hacking: The Machine

Christie Aschwanden's interactive demonstration at FiveThirtyEight makes the core problem visceral: given a dataset on political parties and the economy, with 1,800 possible combinations of variables, 1,078 of them — 60% — yield a statistically significant result. You can "prove" that either Democrats or Republicans are better for the economy, depending on which variables you choose.¹

This isn't fraud. It's the natural consequence of a system where researchers make dozens of underdetermined choices (which observations to include, which variables to control for, how to define the outcome) and where a good result is one that crosses the p < 0.05 threshold. Uri Simonsohn, who coined the term "p-hacking," found that published p-values in psychology cluster suspiciously just below 0.05. "Everybody has p-hacked at least a little bit," he says. "You really believe your hypothesis and you get the data and there's ambiguity about how to analyze it." When the first analysis doesn't work, you keep trying until one does. And if that fails, there's HARKing — hypothesizing after the results are known.¹

The funnel-plot evidence is damning. In a meta-analysis of "romantic priming" studies, Shanks and colleagues found that the reported effect sizes perfectly tracked the statistical significance boundary: every study found an effect just barely large enough to be significant, and larger studies found smaller effects. This is what publication bias looks like in a graph — a torrent of results rolling down the mountain of significance. When Shanks ran eight large replications with 1,600+ participants, they found nothing.²

John Ioannidis crystallized this in his famous 2005 paper: under reasonable assumptions about researcher bias and the ratio of true to false hypotheses, most published research findings are probably false. This isn't a bug in science — it's the predictable outcome of its incentive structure. The system selects for false positives the way evolution selects for fitness.¹

The Ego Depletion Collapse

The poster child for what went wrong is ego depletion — the idea that willpower is a limited resource that gets consumed through use. Roy Baumeister published hundreds of papers on this, Daniel Kahneman endorsed it in Thinking, Fast and Slow, and it became foundational to the "two systems" framework of human cognition. The claim was that even minor acts of self-control (choosing between two items, suppressing emotions while watching a film) deplete a shared pool of mental energy, making subsequent self-control harder.³

When large preregistered replications (ManyLabs-style) actually tested the effect, it vanished. The most recent attempt, led by pro-ego-depletion researcher Kathleen Vohs, found no effect at all. Hundreds of experiments had "replicated" an essentially fictional phenomenon.³

The Against Automaticity argument goes further: ego depletion wasn't just one bad finding — it was the load-bearing pillar of an entire worldview. John Bargh used it to argue that humans spend 95% of their time as unconscious automatons, mere puppets of environmental cues. This justified the entire priming literature: if we're barely conscious most of the time, of course a word scramble containing "Florida" can make us walk slowly, and of course a discounted energy drink can cause measurable brain damage. From an evolutionary perspective, this is absurd. You'd be hunting with your elderly father and suddenly start walking slower because you looked at him. The automaticity hypothesis is, in birguslatro's blunt framing, "just as woo as spoon bending and precognition."³

The Marshmallow Test, Deflated

Walter Mischel's marshmallow test was the crown jewel of "character determines destiny" psychology. Children who could delay gratification at age 4-6 scored higher on the SAT, had fewer behavioral problems, and were generally more successful. The implications seemed enormous: if we could teach kids patience, we could close achievement gaps.⁴

Thirty years later, Tyler Watts and colleagues ran the first serious replication — with ten times the sample size, focused on children whose mothers hadn't attended college rather than Stanford professors' kids. The correlation between delay and later achievement was half the original's size. And when they controlled for family background and early cognitive ability, it almost vanished.⁴

"If you have two kids who have the same background environment, they get the same kind of parenting, they are the same ethnicity, same gender, they have a similar home environment, they have similar early cognitive ability," Watts says, "then if one of them is able to delay gratification, and the other one isn't, does that matter? Our study says, 'Eh, probably not.'"⁴

There was an even stranger finding: most of the marshmallow test's predictive power came from whether children could wait just 20 seconds before eating. Waiting 2, 5, or 7 minutes conferred no additional benefit over waiting 20 seconds. That's hard to reconcile with a story about sophisticated cognitive strategies for patience. It's easier to reconcile with a story about background variables — intelligence, home environment, trust in institutions — that both predict marshmallow performance and later outcomes.

This connects to the willpower-and-akrasia discussion in an interesting way. Ainslie's framework treats self-control as intertemporal bargaining, not as a depletable resource. If the resource model is wrong and the bargaining model is right, that changes the practical upshot entirely: the levers aren't about "building willpower muscle" but about setting up the right commitment structures and precedents.

The Placebo That Wasn't

The Against Automaticity essay extends the demolition to the placebo effect itself, which most people take as established fact. The actual meta-analytic evidence is sobering: placebo effects compared to no-treatment controls are tiny, around 3.2 points on a 100-point pain scale — too small to be clinically noticeable. Most of what looks like "placebo healing" is regression to the mean: people seek treatment when they're at their worst, and conditions naturally fluctuate.³

The famous studies showing large placebo effects — discount placebos work less than expensive ones, open-label placebos still heal — have the same methodological problems as the priming literature. One influential study by Waber, Shiv, and Dan Ariely claimed pain relief of up to 30 points on a 100-point scale from price framing alone. A subsequent study using similar methods found effects of zero, one, and three points. The Ariely study, it turned out, also lacked IRB approval for shocking human subjects.³

Science Isn't Broken (But It's Hard)

The tempting conclusion is that science is a sham. The more accurate conclusion is that science is hard — really hard — and we need to adjust our expectations of it.¹

What's actually happening is a system correcting itself, painfully. Retraction Watch launched in 2010, expecting maybe one post a month; they now do two or three a day. Preregistration forces researchers to commit to their analysis plan before seeing data. Registered Reports accept papers based on methodology before results are known, eliminating publication bias at the source. Open data makes replication possible. The Many Labs projects coordinate dozens of labs to simultaneously test high-profile findings.¹

The deeper lesson is about the calibration-and-measurement problem: a single study is not a definitive answer. "Science operates as a procedure of uncertainty reduction," as Brian Nosek puts it. The back-and-forth of results that looks flaky from the outside — coffee is good for you one day, bad the next — is actually what the process is supposed to look like when it's working on hard problems. The crisis isn't that science produces wrong answers. It's that the system was set up to make wrong answers look like right ones.

Peter Norvig made the same point from a different angle: when a published paper claims "this could only happen by chance one in twenty times," it's quite possible that twenty similar experiments were run but only the one with the positive result was published. The p-value was never designed to be a measure of truth. It's a measure of surprise. Confusing the two — treating p < 0.05 as proof rather than as one data point in an ongoing uncertainty reduction — is the original sin of the replication crisis.⁵

The IQ Measurement Problem

Nassim Taleb's statistical critique of IQ research illustrates a different failure mode: not p-hacking or publication bias, but a fundamental misunderstanding of what correlation means under nonlinearity and fat tails.⁶ His core argument is that IQ was designed to detect learning disabilities (it works in the left tail) but is marketed as measuring intelligence generally (it doesn't work in the right tail). The correlation between IQ and real-world outcomes like wealth is effectively zero once you look at R-squared rather than r — explaining 2-13% of variance in the best cases, and that's before accounting for circularity (IQ-like tests are used as gatekeepers for the careers where IQ supposedly predicts success).

The statistical point is sharp even if the rhetoric is characteristically Talebian. When a measure works asymmetrically — highly correlated with extreme negative performance, uncorrelated with positive performance — the overall correlation looks meaningful but is driven entirely by the left tail. Add noise and the apparent correlation spreads to both sides, creating an illusion of predictive power that isn't there. The test-retest reliability problem compounds this: the same individual can get results varying by up to 2 standard deviations, which means the measure's variance for a single person exceeds the population variance it's supposedly measuring.

National IQ comparisons are an even more egregious case. For 104 of 185 countries, no IQ studies existed at all — the numbers were imputed from ethnicity. This is, as Taleb puts it, not measurement but fraud wearing a lab coat. The whole enterprise exemplifies what happens when a field with "physics envy" adopts quantitative methods without understanding what the numbers actually mean — the same confusion between P(H|E) and P(E|H) that Norvig identified, scaled up to an entire subdiscipline.⁶

Science Isn't Broken by Christie Aschwanden — source ↩ ↩² ↩³ ↩⁴ ↩⁵
Reproducibility Crisis: The Plot Thickens by Neuroskeptic — source ↩
Against Automaticity by birguslatro — source ↩ ↩² ↩³ ↩⁴ ↩⁵
The marshmallow test said patience was a key to success. A new replication tells us s'more by Brian Resnick — source ↩ ↩² ↩³
Warning Signs in Experimental Design and Interpretation by Peter Norvig — source ↩
IQ Is Largely a Pseudoscientific Swindle by Nassim Nicholas Taleb — source ↩ ↩²

Linked from

Calibration And Measurement
The incentive structure rewards false precision and overconfident predictions over honest uncertainty — exactly the dynamic that produced the replication crisis in psychology.
Maps All The Way Down
*Science is a map.* Replication Crisis: the map-production machinery (p-values, publication incentives, researcher degrees of freedom) is optimized for producing publishable maps, not accurate ones.
Rationality And Decision Making Overview
Replication Crisis shows what happens when the machinery of science is optimized for publishable results rather than true ones — ego depletion collapsed, the marshmallow test deflated, and most published research findings may be false.
Rationality And Decision Making Overview
But the cultural evolutionist who learns about the replication crisis — where entire fields of psychology were built on P-hacked nonsense that nobody questioned because it was published — should update toward questioning received wisdom.
Willpower And Akrasia
The biggest experimental claim about willpower — that delaying gratification at age 4 predicts life outcomes — turns out to be another casualty of the replication crisis.

Open in stacked reader →