Goodnight Wiki / Compiler Bootstrapping

Compiler Bootstrapping

Most compilers are written in the language they compile. This is a chicken-and-egg problem that every self-hosted language has solved at least once, and the solutions reveal something fundamental about the nature of trust in software.

The Chain of Trust

If rustc is written in Rust, you need rustc to compile rustc. Version 1.80.0 was compiled with 1.79.0, which was compiled with 1.78.0, all the way back to version 0.7 when the compiler was written in OCaml. So you "just" need an OCaml compiler. But OCaml is written in OCaml. And GCC is written in C++ (though it was C until version 5). Turtles all the way down.1

This chain matters because of Ken Thompson's "Reflections on Trusting Trust" -- the 1984 Turing Award lecture that laid out how a self-hosting compiler can harbor invisible backdoors. The attack is devastatingly simple: modify the compiler to detect when it's compiling itself, and inject the backdoor-injection code into the output. Now the backdoor propagates through every future compilation, even after the malicious source is removed. The binary carries the infection; the source looks clean.2

Manish Goregaokar actually implemented this attack against the Rust compiler as an exercise, modifying rustc's AST-folding machinery to replace "hello world" strings with Marathi text, then adding a self-reproducing module that injects itself when the compiler compiles itself. It works. The whole thing is a quine -- a program that outputs its own source code -- except the output happens to be a backdoor in a compiler binary.2

The Bootstrappable Builds Project

The radical response to Thompson's attack is the Bootstrappable Builds project, which asks: can we build an entire operating system from a seed small enough to manually verify?

Their Linux bootstrap process starts with a 512-byte binary. This is possibly the simplest compiler imaginable: it reads hexadecimal digits and outputs the corresponding raw bytes. The "source code" looks like this:

31 C0           # xor ax, ax
8E D8           # mov ds, ax
FC              # cld ; clear direction flag

From this seed, they bootstrap a very simple OS, a barebones shell, a slightly better compiler. That compiler compiles a slightly better compiler. Eventually you have something that looks like assembly. Then a basic C subset. Then TinyCC. Then coreutils, Bash, autotools, GCC, and finally Linux.1

Every step in this chain is small enough that a human could plausibly verify it. The trust model is fundamentally different from "download a binary from the internet and hope nobody tampered with it." It's reconstruction from first principles.

Zig's WebAssembly Solution

Zig took an unconventional approach that I find genuinely clever. The problem: Zig is self-hosted, so you need a Zig compiler to compile Zig. The old solution was 80,000 lines of C++ that implemented enough of Zig to bootstrap the real compiler. Nobody wanted to maintain it, and new language features had to be implemented twice.3

Andrew Kelley's insight was to use WebAssembly as a portable bootstrap format. They compile the Zig compiler (with only the C backend enabled) to a wasm32-wasi binary -- 2.4 MB, compressed to 637 KB with zstd. This binary is committed to the repository. To build from source, you first compile a 4,000-line C program (a minimal WASI interpreter), use it to run the wasm binary, which translates the Zig source to C, which you compile with your system C compiler.3

The result: instead of 80,000 lines of C++ that duplicate the compiler, there are 4,000 lines of portable C that implement a WebAssembly interpreter. The wasm binary is target-independent (it's a VM bytecode), so any platform with a C compiler can bootstrap Zig. No LLVM, no C++, no platform-specific binaries.

The further optimization was realizing you don't even need to interpret the wasm -- you can compile it to C. Their custom wasm2c tool converts the wasm binary to C source, which the system C compiler then compiles to a native binary. This is effectively JIT compilation, using the C compiler as the code generator.

The Dozer Project

Meanwhile, the Dozer project represents an even more extreme position: writing a Rust compiler in pure C (no C++, no flex, no yacc, not even a Makefile) specifically so that Rust can be bootstrapped from TinyCC earlier in the bootstrap chain. Currently, Rust enters the Bootstrappable Builds process very late because it requires mrustc (a C++ Rust implementation) and therefore depends on C++ being available.1

The project uses QBE as a backend -- a small C compiler backend that claims 70% of LLVM's optimization quality in 10% of the code. Brian Callahan's explorations with QBE IR show it's a surprisingly pleasant target: strict enough to be useful (SSA form, explicit types), lax enough to be approachable, and the entire thing is written in C and small enough to vendor.4

SectorC: A C Compiler in a Boot Sector

The Bootstrappable Builds project starts with a 512-byte hex-to-binary translator and builds up. xorvoid's SectorC goes the other direction: how much of C can you fit into 512 bytes? The answer turns out to be surprisingly much — global variables, functions, if/while blocks, a dozen operators, pointer dereference, inline assembly, and comments. Enough to write a program that animates a moving sine wave on a VGA display.5

The central trick is treating atoi() as a hash function. A conventional C tokenizer needs to recognize keywords, identifiers, operators, and numbers — just the lexer alone would exceed the 510-byte instruction budget. But if every token is space-delimited (borrowing from Forth), and you run each token through atoi(), the result serves triple duty: integer literals produce their numeric value, keywords produce deterministic "enum" values (since atoi() on "if" always returns the same thing), and identifiers produce hash values that index into a 64K symbol table. One function replaces an entire tokenizer.

The resulting language looks like legal C but reads like something else entirely: int(main)(){while(!done){ is a single mega-token. The spaces are placed strategically to create a "Barely C" that a standard C compiler would also accept — it's simultaneously valid C and a different language with a radically simpler grammar.

The first implementation fit in 468 bytes. Aggressive minimization — fall-through instead of jumps, tail calls via jmp, fusing consecutive call tok_next pairs, preferring cmp ax,imm over cmp bx,imm for shorter encoding — brought it to 303 bytes. The remaining 207 bytes bought enough room for the full feature set. There's a Forth-style byte-threaded code variant that didn't pay off (the overhead of 2-byte alignment and ret instructions ate the savings), but the straight-through recursive-descent parser proved surprisingly compact.

SectorC is a different kind of bootstrapping artifact than the ones above. It doesn't sit in any bootstrap chain — it's a freestanding compiler that runs on bare metal, taking input from a serial port and emitting x86-16 machine code. But it demonstrates the same principle that drives the Bootstrappable Builds project: a C compiler is a smaller thing than you think it is. The gap between "impossible" and "done" is often just one clever insight about what you can throw away.

What Bootstrapping Reveals

The bootstrapping problem is really a question about the minimum viable trust boundary. Every self-hosted language contains an implicit claim: "you can trust the previous version of me." The Bootstrappable Builds project says that's not good enough -- trust should be traceable back to something a human can verify.

This connects to broader themes in supply chain trust: bunnie huang's observation that Rust's curl | sh installation model and expansive crate dependency trees create a large attack surface, even in a memory-safe language. The compiler itself is just one link in the chain. The build system, the package manager, the dependencies, the CI pipeline -- each is a potential Thompson attack vector.

Zig's wasm approach and Dozer's C-from-scratch approach represent different points on the tradeoff curve between convenience and verifiability. What they share is the conviction that the bootstrap path matters -- that how your tools come into existence is a security-relevant question, not just a historical curiosity. For the broader picture of bootstrapping from bare metal — GCC cross-compilation chains, portable executables, booting from unusual media — see Bootstrapping And Trusting Trust.

Footnotes

  1. Why am I writing a Rust compiler in C? by John Nunley -- source 2 3

  2. Reflections on Rusting Trust by Manish Goregaokar -- source 2

  3. Goodbye to the C++ Implementation of Zig by Andrew Kelley -- source 2

  4. Let's get hands-on with QBE by Brian Robert Callahan -- source

  5. SectorC: A C Compiler in 512 bytes by xorvoid -- source

Open in stacked reader →