Calibration and Measurement

Most people think measurement means precision. It doesn't. A measurement is any observation that quantitatively reduces your uncertainty. If you knew nothing about how many people a virus attack would affect, and now you know it's between 25,000 and 65,000, you've measured something — even though you can't give a single number. The real bottleneck in decision-making is almost never the absence of precise data. It's the failure to quantify what you already know.

Anything Can Be Measured

Douglas Hubbard's How to Measure Anything opens with a provocation: the things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods.¹ The sciences have their instruments. What Hubbard is after are "business intangibles" — management effectiveness, the flexibility to create new products, the risk of bankruptcy, public image. Things that make executives throw up their hands and say "you can't put a number on that."

You can. You just have to do three things first.

Define the object of measurement in terms of observables. "IT security" means nothing until you break it into unauthorized intrusions, malware attacks, and their downstream costs — lost productivity, fraud losses, legal liability. Once you've done that, each component is measurable by some method, even if imprecise.¹

Determine what you already know. This is where calibration enters. A "90% confidence interval" is a range you're 90% sure contains the true value. The problem is that untrained estimators are terrible at this — in studies, the true value fell inside people's 90% CIs only 50% of the time. They were wildly overconfident.¹

Compute the value of additional information. Before you measure anything, ask: how much would reducing my uncertainty about this variable actually change my decision? If the answer is "not much," don't bother measuring it — go find the variable where better information would actually matter. This is where most organizations go wrong. They obsess over measuring things that are easy to measure, not things whose measurement would actually improve their decisions.¹

Hubbard's calibration training is deceptively simple. The equivalent bet test: given your stated 90% CI, would you rather bet on the true value falling inside it, or spin a wheel with a 90% win chance? If you'd rather spin the wheel, your CI is too narrow — your brain is telling you it's overconfident. Adjust until the two options feel equally attractive. Research shows that even pretending to bet money improves calibration.¹

The Monte Carlo simulation takes calibrated estimates and propagates uncertainty through a decision model. You don't need a single "best guess" — you generate thousands of scenarios from your ranges, compute the outcome for each, and get a probability distribution of results. For a $400,000 equipment lease: how often does the savings exceed the cost? About 86% of the time, in Hubbard's worked example. That's decision-grade information, assembled entirely from expert judgment and basic statistics.¹

Warning Signs in Evidence

Peter Norvig's guide to experimental design and interpretation is the single best short document on how studies go wrong.² Not how they might go wrong in theory — how they routinely do go wrong in practice.

The foundation: a randomized controlled trial eliminates systematic bias, but you still have to worry about everything else. The Literary Digest polled 10 million people in 1936 and predicted Alf Landon would win the presidency. Their sample was drawn from magazine subscribers, car owners, and telephone users — precisely the demographics that could afford those things during the Depression, and that skewed Republican.²

The most insidious problem is confusing P(H|E) with P(E|H). This is the base rate fallacy wearing a lab coat. A mammogram test is 80% accurate. A woman gets a positive result. Most doctors estimate her cancer probability at 70-80%. The actual number: 7.8%. Because only 1% of women in her age group have cancer, the vast majority of positive results are false positives. Only about 15% of doctors get this right, and researchers keep replicating the finding.² This is the same failure mode described in Bayesian Epistemology — ignoring priors and selecting hypotheses solely on explanatory adequacy.

Norvig extends this to the problem of seeing patterns in randomness. True random distributions have clusters — that's what randomness looks like. But our brains are wired to spot patterns, even where none exist. When seven cancer clusters appeared near cell phone masts in the UK, it made headlines. But there were 47,000 masts in England. The article could just as accurately have been titled "Cell phone masts prevent cancer clusters 99.985% of the time."²

The most devastating section concerns publication bias. When a published paper claims "this could only happen by chance one in twenty times," it is quite possible that similar experiments have been run twenty times without positive results but were never published. Ioannidis went further: under reasonable assumptions about bias and the ratio of true to possible correlations, most published research findings may be false. Not because scientists are incompetent, but because the system selects for false positives.²

Then there's the problem Feynman identified: doing the hard work of controlling for the right things, and nobody caring. A researcher named Young spent years figuring out that rats in maze experiments were navigating by floor sound, not the cues the experimenters thought they were testing. He had to cover the corridor in sand before the rats were actually learning what experimenters intended. No subsequent rat-maze paper ever cited him or used his controls. "He didn't discover anything about the rats," Feynman noted. "He discovered all the things you have to do to discover something about rats."²

Bayesian Surprise

The formal Bayesian theory of surprise developed by Itti and Baldi offers a rigorous quantification of what it means for data to be "interesting."³ The core insight: surprise is the distance between your posterior and prior beliefs. If new data doesn't change what you believe, it's boring. If it dramatically reshapes your belief distribution, it's surprising.

This resolves a classical paradox: random snow on a TV screen carries the most Shannon information (because there are vastly more possible random images than natural images), but it's the least interesting thing to watch. Shannon information measures raw unpredictability; surprise measures how much your model of the world changes. After you've updated to "this is snow," new frames of snow are completely unsurprising even though each individual frame is highly entropic.³

The practical payoff: when tested on human eye-tracking data during television watching, the Bayesian surprise metric predicted where people would look significantly better than any other computational metric — 20% better than saliency, 60% better than Shannon entropy. Human attention doesn't track raw information. It tracks model updates.³

This connects to the broader theme of Predictive Processing: the brain as a prediction machine, allocating attention and resources to whatever most challenges its current model. A "wow" — the unit of surprise proposed in this framework — is a two-fold change between posterior and prior probability for a hypothesis. The total surprise is the KL divergence between the posterior and prior distributions over your entire model space. The math formalizes what every good teacher, storyteller, and scientist already knows: the most valuable information is the information that changes your mind.

The Flaw of Averages

There's a complementary measurement pathology that's less about statistics and more about what you're even trying to measure. In the late 1940s, U.S. Air Force pilots were crashing at alarming rates — 17 in a single day at the worst point. The cockpits had been designed in 1926 to fit the "average pilot." When the Air Force commissioned a massive study of 4,063 pilots on 140 dimensions, a young researcher named Gilbert Daniels asked the obvious question nobody had thought to ask: how many pilots actually are average?⁴

Zero. Not a single pilot fell within the average range on all 10 key dimensions. Even on just three dimensions, fewer than 3.5% qualified. The average pilot was a fiction — a statistical artifact that described no actual human body. Designing for the average meant designing for nobody.⁴

The Air Force's response was radical and almost immediately successful: they demanded adjustable cockpits. Manufacturers balked, then quickly invented adjustable seats, pedals, helmet straps — technology now standard in every car. Pilot performance soared. The lesson wasn't that averages are useless — it was that optimizing for an average when individual variation matters is a category error, the same kind of map-territory confusion that afflicts measurement in general.⁴

Forecasting and the Value of Communicating Uncertainty

Weather forecasting is the one domain where predictions have improved dramatically over the past half-century — and the reason illuminates everything that goes wrong elsewhere. Nate Silver's deep dive into the National Weather Service reveals that the key insight wasn't better computers (though those helped). It was that forecasters learned to embrace uncertainty rather than hide it.⁵

Three-day temperature forecasts that missed by six degrees in 1972 now miss by three. Hurricane track predictions that were off by 350 miles 25 years ago are now accurate to about 100 miles. The improvement comes from a combination of better models, better data, and — crucially — human judgment that corrects for known model biases. Experienced forecasters know that a particular model tends to forecast precipitation too far south by 100 miles in certain conditions, or that fog in Acadia clears by sunrise when the wind blows one way but lingers if it comes from another. Humans improve precipitation forecasts by about 25% over computer guidance alone.⁵

The social dimension is equally important. For years, commercial forecasters maintained a deliberate "wet bias" — when the Weather Channel said 20% chance of rain, it actually rained only about 5% of the time. People don't mind being told it'll rain when it doesn't, but they're furious when they get rained on unexpectedly. The incentive structure rewards false precision and overconfident predictions over honest uncertainty — exactly the dynamic that produced the replication crisis in psychology. The Weather Service's Grand Forks flood disaster in 1997 — where forecasters knew there was a 35% chance the levees would be topped but communicated only a point estimate — showed the cost of hiding uncertainty. After that, they adopted the "cone of chaos" for hurricane forecasts, explicitly showing the range of possible paths.⁵

The comparison with other fields is devastating. Political experts describing an event as "absolutely certain" are wrong 25% of the time. Economists foresaw less than a 1-in-500 chance of the 2008 financial crisis one month before it hit. Weather forecasting works because it institutionalized the Bayesian insight that all predictions are probability distributions, not point estimates — and built a culture that rewards calibration over confidence.

How to Measure Anything by Luke Muehlhauser (summary of Hubbard's book) — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Warning Signs in Experimental Design and Interpretation by Peter Norvig — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Formal Bayesian Theory of Surprise by Laurent Itti and Pierre Baldi — source ↩ ↩² ↩³
When U.S. Air Force Discovered the Flaw of Averages by Todd Rose — source ↩ ↩² ↩³
The Weatherman Is Not a Moron by Nate Silver — source ↩ ↩² ↩³

Linked from

Adult Developmental Stages
Alexander lists several adult milestones that should be common but aren't: distinguishing "what my brain tells me" from "reality" (the core of cognitive-behavioral therapy and Bayesian calibration); modeling genuinely different mind-designs; thinking…
Bayesian Epistemology
This is the Bayesian version of a self-fulfilling prophecy, and breaking out of it requires the kind of prior-overriding that calibration-and-measurement training tries to cultivate.
Bayesian Epistemology
For the practical art of calibrating your own Bayesian priors and reducing this kind of error, see Calibration And Measurement.
Decision Theoretic Paradoxes
Knowing which situations it fits — this is Calibration And Measurement at the meta-level — is more important than knowing the axioms themselves.
Inadequate Equilibria
This connects to Calibration And Measurement.
Maps All The Way Down
Calibration And Measurement: the map is only useful if you know its resolution.
Rationality And Decision Making Overview
Calibration And Measurement offers the constructive response: anything can be measured if you define it in terms of observables, determine what you already know, and compute the value of additional information.
Replication Crisis
The deeper lesson is about the calibration-and-measurement problem: a single study is not a definitive answer.
Sunk Cost Sophistication
This connects to a broader pattern in calibration and measurement: knowing the name of a bias is much easier than correctly identifying when it applies.

Open in stacked reader →