Software Correctness at Scale

On the night of April 10, 2014, the entire population of Washington State lost 911 service for six hours. The cause was a counter on a server in Englewood, Colorado. Intrado programmers had set a threshold for how high the counter could go. They picked a number in the millions. When it overflowed, new calls were rejected. No alarm sounded, because nobody had anticipated the failure. The fix was to change a single number.¹

This is the signature failure mode of modern software. The software did exactly what it was told to do — perfectly. It failed because it was told to do the wrong thing. Software failures are failures of understanding and imagination, not of execution. And they are getting worse, because the systems are getting bigger and more interconnected, while the tools for understanding them have barely changed since the 1960s.

The Scale Problem

"There's 100 million lines of code in cars now," says Nancy Leveson, who's studied software safety for 35 years. "You just cannot anticipate all these things." The Toyota unintended acceleration case made this visceral: Michael Barr's team found over 10 million ways for key tasks on the onboard computer to fail, some triggered by a single bit flip. Toyota's fail-safe code wasn't enough. "You have software watching the software," Barr testified. "If the software malfunctions and the same program that crashed is supposed to save the day, it can't save the day because it is not working."¹

The 911 outage, the Toyota accelerations, the simultaneous failures of United Airlines, the NYSE, and the WSJ on a single day in 2015 — these aren't isolated incidents. They're symptoms of a structural problem. We are building systems that are beyond our ability to intellectually manage, as Dijkstra warned in 1988. The complexity is invisible, packed into silicon chips as millions of lines of code, and just because we can't see it doesn't mean it has gone away.

The Shuttle Group: What Perfection Costs

The on-board shuttle software group at Lockheed Martin in Clear Lake, Texas, represents the extreme other end of the spectrum.² Their software never crashed. The last three versions — each 420,000 lines — had one error each. The last eleven versions had seventeen errors total. Commercial software of equivalent complexity would have five thousand.

How? Not through genius, but through process. About one-third of their effort happened before anyone wrote a line of code. NASA and Lockheed agreed in minute detail about everything the new code was supposed to do, committing it to paper with the precision of blueprints. The specs for a GPS upgrade involving just 6,366 lines of code ran 2,500 pages. The full program specs filled 30 volumes and ran 40,000 pages. No coder changed a single line without specs carefully outlining the change.

The culture was aggressively anti-heroic. No superstar programmers. No all-night coding sessions. An 8-to-5 workplace with adults who had spouses and kids. The process was explicitly designed not to rely on any individual. Creativity was channelled into improving the process, not into clever code. Development and verification teams reported to separate bosses under opposing marching orders: developers tried to deliver error-free code; verifiers tried to find every flaw. The result: 85% of errors caught before formal testing, 99.9% caught before delivery to NASA.

The lesson is simultaneously inspiring and depressing. It shows that near-perfect software is achievable. It also shows the cost: 40,000 pages of specs, 260 people, years of institutional refinement, and a culture that most software organisations would find intolerable. The shuttle group exists because the stakes — human lives, $4 billion spacecraft — justify the investment. For most software, they don't. And that's the problem.

Model-Based Design

Bret Victor, Eric Bantégnie, and the French aerospace establishment converged on the same diagnosis from different directions: the problem is that code is too hard to think about. The medium for describing what software should do (conversations, prose, drawings) is too different from the medium describing what it does do (code). Too much is lost in translation.¹

Model-based design attempts to close that gap. Instead of writing code directly, you create a flowchart-like model describing the rules your system should follow, and the computer generates code from those rules. For an elevator control system, you'd draw boxes representing states ("door open," "moving," "door closed") and lines defining transitions between them. The diagrams make the system's rules obvious: just by looking, you can see that the only way to get the elevator moving is to close the door.

The pioneers here came from French nuclear and aerospace. Emmanuel Ledinot at Dassault Aviation started in 1988, when military avionics systems were experiencing "bug inflation" — the onboard computers had multiplied, and coordinating them required handling thousands of possible events in the right order at the right time. Writing this coordination code by hand was unsustainable. Gerard Berry at INRIA developed Esterel, a language based on the insight that traditional programming languages are fine for sequential recipes but terrible for reactive systems where events can happen at any time in any order.

The model-based approach has been in production for over two decades in aerospace and automotive. Every Airbus and many military jets use it. But it hasn't spread much beyond safety-critical domains, partly because building good modelling tools is itself an enormous upfront investment ("20 years of initial background work," Bantégnie says), and partly because most programmers like code. At least they understand it.

The Paradox Of Automation Applied to Software

Tim Harford's retelling of Air France Flight 447 brings the paradox of automation into sharp focus.³ The Airbus A330's fly-by-wire system was so good at protecting pilots from errors that when the airspeed sensors iced over and the automation downgraded itself, the pilots had almost no experience flying the plane manually at altitude. Of Captain Dubois's 346 hours at the controls in the past six months, only four had been in manual control. The pilots stalled the aircraft — a manoeuvre that every student pilot knows how to recover from — and couldn't diagnose what was happening because they had never had to.

Earl Wiener's law captures it: "Digital devices tune out small errors while creating opportunities for large errors." The automatic system accommodates incompetence by being easy to operate. Even expert operators lose their skills through disuse. And the system tends to fail in unusual situations requiring the most skilful response. This is directly analogous to what happens with software. Sophisticated frameworks, IDEs, and type systems catch small errors routinely, which means developers rarely exercise their ability to reason about the system's actual behavior. When something goes wrong at a level the tools can't catch — a requirements misunderstanding, an architectural flaw, an emergent interaction between subsystems — the developers are unprepared.

The proposed fix for aviation — have the computer monitor the human rather than the other way around, since computers are tireless and don't need practice — hasn't translated well to software development. The closest analogues might be continuous integration, property-based testing, and formal verification, all of which use automation to check human work rather than replace it. But none of them address the fundamental problem that Leveson and Victor both identify: we can't keep these systems correct because we can't understand what they're doing. The code is a poor representation of the system's behavior, and staring at more of it doesn't help.

The Soft Decline

Jonathan Blow's provocation is blunter than any of these: software is already in decline, we just can't see it because hardware keeps getting faster.⁴ The improvements we celebrate — machine learning, face filters, AlphaGo — are tiny bubbles of genuine advancement riding on decades of hardware progress. The vast bulk of software, the part that loads bitmaps and responds to input events and manages windows, is getting worse. Blow started screenshotting every bug he encountered in daily use and filled a gallery in days without trying.

His sharpest observation is about the disappearance of "five nines" rhetoric. In the 1990s, selling enterprise software meant promising 99.999% uptime. Nobody makes that promise anymore — partly because the number of nines would be going down, and partly because we've collectively stopped caring. His laptop rebooted itself twice during sleep mode in one week, which alone drops it below three nines. Nothing running on a three-nines machine can itself be four or five nines. We haven't just lost quality; we've lost the vocabulary for demanding it.

The mechanism Blow identifies is generational knowledge loss — the same dynamic that cost the Romans their nanotechnology glass and the Byzantines their Greek fire. Each layer of abstraction that "saves effort" also removes capability. Programmers who grew up writing C# snippets for Unity have never built anything systemic or low-level, which is fine individually but catastrophic collectively: Unity and Unreal were built by people who came from a culture of writing engines from scratch, and that pipeline is drying up. The parallel to the Bronze Age collapse is deliberate — interconnected civilizations that looked healthy from inside degraded over a century before anyone noticed they couldn't recover. "If you're at the beginning of that collapse," Blow says, "in the first twenty years, you might think, well, things aren't as good as they were twenty years ago, but it's fine."

The web programmers' defense — "we could make software better but the market won't pay for it" — gets a pointed rebuttal: if you haven't seen an entire industry produce robust software for decades, what makes you think they actually can? The knowledge of how to make things less buggy may already be lost, replaced by trivia about toggling booleans in Unity panels. Deep knowledge (how cache coherency works) gets crowded out by shallow workarounds, and the harder something is to understand, the higher the probability that the available explanations are wrong. Complexity propagates and magnifies.

What Would Help

The various approaches to software correctness at scale — the shuttle group's exhaustive process, model-based design, formal verification as explored in Compiler Correctness, Rust's type-system-as-safety-net, Liveness In Programming as advocated by Victor — all share a common thread. They try to reduce the gap between the programmer's mental model and the system's actual behavior. The shuttle group closes the gap through obsessive documentation. Model-based design closes it by making the model be the code. Rust closes it by making the compiler reject programs whose behavior can't be statically verified. Victor closes it by making the behavior directly visible and manipulable.

None of these is a silver bullet. The shuttle group's approach doesn't scale to web services that ship daily. Model-based design works for state machines but not for the kind of emergent behavior that makes social media algorithms or LLM-powered applications so unpredictable. Rust catches memory bugs but not logic bugs. Victor's prototypes are inspiring but haven't been productized for anything beyond education.

The honest assessment is that we're building systems beyond our ability to verify, and we know it, and we're doing it anyway because the economic incentives are overwhelming. The question isn't whether the next catastrophic software failure will happen, but whether it will be bad enough to change how we work. The shuttle group changed because astronauts' lives were at stake. Toyota changed because juries awarded millions in damages. Most software organisations will change only when forced to, and the forcing function has not yet arrived for most of the industry.

The Coming Software Apocalypse by James Somers — source ↩ ↩² ↩³
They Write the Right Stuff by Charles Fishman — source ↩
Crash: how computers are setting us up for disaster by Tim Harford — source ↩
Preventing the Collapse of Civilization by Jonathan Blow — source ↩

Linked from

Liveness In Programming
The Software Correctness At Scale problem is, in many ways, a liveness problem: we can't keep complex systems correct because we can't see what they're doing.
Maps All The Way Down
Software Correctness At Scale: the gap between the map (the code) and the territory (the system's behavior) is where all software failures live.
Software As Infrastructure
"The class of systems that we can build at all may be larger than the class of systems we can build elegantly, at least at first." This connects directly to Software Correctness At Scale — the gap between what we can build and what we can build well …
Software Engineering Overview
Software Correctness At Scale frames the problem: 100 million lines of code in cars, 10 million failure paths in Toyota's onboard computer, 911 systems brought down by a single counter overflow.

Open in stacked reader →