Compiler Correctness

We are building life-critical systems using tools that demonstrably contain silent wrong-code bugs. We are comfortable doing so because we test these systems heavily, because this methodology has worked in the past, and because we don't have a better alternative. That's the situation as John Regehr described it in 2010, and it hasn't fundamentally changed.¹

The Testing Treadmill

Compiler correctness through usage works like this: you ship a compiler, people use it for years, they file bugs, you fix them, and gradually you converge on something that works for the kinds of programs people actually write. This relies on a variant of the locality principle — future programs resemble past programs, so if past translations were correct, future ones probably will be too. "Probably." Nobody can say what that word means precisely, which is what keeps developers and testers up at night.¹

Regehr's group spent years doing random testing of C compilers — generating random conforming C programs, compiling them with different compilers or flags, and checking whether the computation retained its meaning across all versions. They found silent wrong-code bugs in every compiler they tested: GCC, Clang/LLVM, Intel CC, commercial embedded compilers used for safety-critical systems. 270 crash and wrong-code bugs total, at the time of writing. The bugs cluster in the middle-ends and back-ends — optimizer bugs that require fairly large test cases (around 10 lines) and are effectively inexhaustible because the space of 10-line C programs is, for practical purposes, infinite.¹

The prediction that should concern everyone in compiler development: it is almost certainly impossible to create any fixed test suite that can find all instances of optimizer bugs in traditional compilers. The combinatorial explosion of code patterns, optimization passes, and their interactions means that bugs will always lurk in untested corners.

CompCert: The Gold Standard

Against this background, CompCert is remarkable. It's a verified optimizing C compiler built by Xavier Leroy's team at INRIA, with a machine-checkable proof that it correctly translates a low-level C-like language to PowerPC or ARM assembly. The compiler generates code comparable to GCC at -O1, applies most standard optimizations, and compiles at reasonable speed.¹

When Regehr's team turned their adversarial random tester against CompCert — a tool designed to create maximally evil C code that had found previously unknown silent wrong-code bugs in every other compiler tested — the results were striking. They found no errors in CompCert's PowerPC specification, C-light specification, or proof strategy. The bugs they did find were qualitatively different: three were in the unverified front end (typechecker bugs), and three involved incorrect assembly emission where numbers overflowed fixed-size instruction fields. The front-end bugs were in an early phase of the compiler and could be found with very simple test cases — the kind of bugs that get caught quickly in mature compilers. The deep, subtle optimizer bugs that plague GCC and LLVM? CompCert had none.¹

This is the real payoff of verification: it doesn't eliminate the need for testing, but it changes the tester's job and makes it easier. Verification eliminates the complicated optimizer bugs, leaving only the shallow front-end bugs that a good test suite can exhaustively cover. The two techniques are complementary, not competing.

The Future Stack

Regehr's vision for the next 25 years of compiler development rests on several predictions, and they're aging well.¹

First, no compiler will be considered trustworthy without extensive random testing. This seems obvious now — fuzzing has become standard practice — but in 2010 it was a stronger claim. Second, practical compiler verification work will converge around a small number of IRs. LLVM is the obvious candidate. If a verified subset of LLVM passes could be assembled, different teams could independently create verified backends for different architectures, with far less effort than CompCert required from scratch. The dream is a toolkit of proof-producing optimization passes that can be composed into verified compilers.

Third — and this is the most interesting prediction — testing and verification will merge. The path to trustworthy compilers runs through hybrid approaches: using verification to eliminate whole classes of bugs, then using testing to cover the gaps. A verified compiler dropped into a typical safety-critical development effort wouldn't eliminate much system testing directly — maybe 30% at most. But it would change the character of the remaining testing effort, making it less adversarial and more targeted.¹

Superoptimization via SAT

On the optimization side, the most provocative idea is that decision procedures — SAT and SMT solvers — could eventually replace a significant fraction of hand-written optimization passes. A peephole superoptimizer uses a SAT solver to answer: "Given this sequence of instructions, is there a shorter sequence with the same effect?" The existing results are modest, mostly rediscovering known optimizations. But Regehr argues that operating on small subgraphs of the program dependency graph (rather than linear instruction sequences), at the level of LLVM IR (rather than target assembly), and harnessing formal specifications when available, could yield a superoptimizer that subsumes perhaps half of LLVM's current optimization passes.²

The appeal is maintainability as much as performance. Hand-written optimization passes are some of the buggiest code in any compiler — LLVM's instruction combiner was the single largest C++ file in the project before being refactored. A superoptimizer that automatically discovers the same transformations, with a proof witness that each transformation is correct, would be both more reliable and easier to maintain. It would also keep getting stronger over time, learning new tricks as the SAT solver improves.²

The anti-predictions are equally interesting: machine learning in compilers will not be fundamental (just a way to get fast compile times for what you could achieve by running all passes to fixpoint), and the distinction between JIT and AOT compilation will cease to seem important as both draw from the same toolkit of passes around shared IRs.²

The Program Synthesis Connection

Superoptimization is really a special case of program synthesis — automatically generating programs from specifications. The CEGIS (counterexample-guided inductive synthesis) loop that drives modern synthesis tools is essentially the same verify-then-refine loop that Regehr describes for verified compilation: generate a candidate, check it against a specification, and if it fails, use the counterexample to guide the next attempt.¹

The convergence of verification and optimization that Regehr predicts is, in a deeper sense, a convergence of compiler construction with the Curry-Howard correspondence. If types are propositions and programs are proofs, then a verified optimization pass is a proof that two programs have the same meaning. The question is whether we can make those proofs cheap enough to generate and check that they become part of the normal compilation pipeline rather than a special academic exercise. CompCert suggests the answer is yes, at least for the critical core of an optimizing compiler. The remaining question is one of engineering: can we build the infrastructure to make it practical at scale?

Proof Assistants as Thinking Tools

The most striking recent data point for verification came not from compiler construction but from pure mathematics. In 2020, Peter Scholze — one of the world's top mathematicians — challenged the Lean proof assistant community to formalise a theorem from his work on condensed mathematics. The proof was so complex that Scholze himself wasn't confident it was correct: he "was essentially unable to keep all the objects in RAM." Six months later, the critical theorem was verified.³

What's interesting for compiler people isn't the verification itself — it's how it happened. The formalisers didn't write a blueprint and then encode it in Lean. They attacked the proof directly, and Lean served as a navigation tool through the jungle of intermediate goals. Johan Commelin, who led the effort, told Scholze that when formalising Theorem 9.4, he could "with the help of Lean — really only see one or two steps ahead, formalize those, and then proceed to the next step." The human-readable blueprint was written after the Lean proof, not before. The proof assistant wasn't checking work that humans had already done. It was helping humans do work they couldn't do unaided.³

This inverts the usual narrative about formal verification. The standard pitch is "write the code, then prove it correct." The Liquid Tensor Experiment suggests a different workflow: use the proof assistant to explore, letting it manage the complexity that exceeds human working memory, and extract the human-readable version afterwards. If this pattern generalises — and there's reason to think it will as tools like Lean 4 and its mathlib library mature — the bottleneck for verified software may shift from "proving is too hard" to "we haven't tried."³

The Future of Compiler Correctness by John Regehr — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
The Future of Compiler Optimization by John Regehr — source ↩ ↩² ↩³
Half a year of the Liquid Tensor Experiment by Peter Scholze — source ↩ ↩² ↩³

Linked from

C Abstract Machine
The deeper question is whether patching C with hardware features (CHERI), type systems (Substructural Type Systems), or formal verification (Compiler Correctness) is the right approach, or whether we should be designing languages and hardware togethe…
Program Synthesis
The immediate practical applications are exactly the kind of small, unintuitive programs that humans write badly: bit-twiddling hacks from Hacker's Delight, peephole optimizations in compilers, deobfuscation of deliberately obscured code.
Program Synthesis
The connection to verified compilation is direct: a superoptimizer is just a synthesizer whose specification is "produce a program equivalent to this one, but faster." Regehr's prediction that SAT-based superoptimizers could replace half of LLVM's op…
Programming Languages Overview
The convergence of compiler correctness and proof assistants suggests a future where verified compilation is derived from verified interpretation through reflective towers — write the interpreter, collapse it into a compiler, and the verification fol…
Programming Languages Overview
Compiler Correctness offers the constructive response: CompCert proved that verified compilation is practical, and random testing found zero optimizer bugs in its verified core while finding hundreds in GCC and LLVM.
Reflective Towers
The collapse result also connects to compiler correctness.
Software Correctness At Scale
The various approaches to software correctness at scale — the shuttle group's exhaustive process, model-based design, formal verification as explored in Compiler Correctness, Rust's type-system-as-safety-net, Liveness In Programming as advocated by V…
Software Engineering Overview
To Compiler Correctness and the PL section through the verification/testing spectrum.

Open in stacked reader →