DNA as Code

Bert Hubert is a computer programmer who looked at DNA and saw something familiar. Not metaphorically — literally. The genome, he argues, has position-independent code, conditional compilation, dead code, dependency hell, fork bombs, virus scanners, and a central dogma that works almost exactly like the .c -> .o -> a.out compilation pipeline. His essay "DNA seen through the eyes of a coder" started as "rambling" in 2001, but two decades and several revisions later, it remains one of the most insightful comparisons between biological and computational systems that I've encountered.¹

The analogy works because it's not forced. DNA really does share structural properties with code — not because someone designed it that way, but because both are information-processing systems that evolved (one by natural selection, the other by decades of engineering) to solve similar problems: storage, error correction, modularity, and the management of overwhelming complexity.

The Source Code

The human genome is about 3 gigabases long, which boils down to roughly 750 megabytes. "Depressingly enough," Hubert notes, "this is only 3.6 Mozilla browsers." The language is digital but not binary. Where binary has two symbols (0 and 1), DNA has four: T, C, G, and A. A DNA "byte" (codon) has three digits, giving 64 possible values — each encoding one of 20 amino acids (with redundancy), plus stop signals.¹

But DNA isn't like C source. It's more like byte-compiled code for a virtual machine called "the nucleus." There's no higher-level source from which it was compiled. What you see is all you get. This is a billion-year-old codebase with no comments, no documentation, and no original developers to ask.

Compilation and Conditional Execution

The "Central Dogma" of molecular biology maps cleanly onto compilation: DNA is used to make RNA, RNA is used to make proteins. That's .c -> .o -> a.out. And like any good Central Dogma, it's been tarnished by hacking — sometimes RNA patches the DNA, and proteins modify the DNA that created them. But the main direction of information flow holds.¹

Of the roughly 20,000-30,000 genes in the human genome (that number is still debated), most cells express only a small fraction. A liver cell doesn't need neuron code. So the genome is full of conditional compilation — #ifdef statements that silence irrelevant genes. Cells are state machines that start as stem cells and progressively specialize, each specialization choosing a branch in a decision tree. The decisions persist across cell division through transcription factors and spatial modifications to how DNA is stored (steric effects). A liver cell carries the genes to be a skin cell but can't generally use them. The code has been #ifdefed out.¹

This is why stem cells are so interesting to both biologists and programmers: they're the cells where nothing has been compiled out yet. All paths are still open.

Epigenetics: Runtime Binary Patching

The most programmer-friendly analogy in Hubert's essay is epigenetics. The Linux kernel, at boot time, discovers what CPU it's running on and actually patches its own binary code — not source, not configuration, but the running instructions in memory — to disable locking on single-CPU systems. DNA does the same thing. As an embryo develops, its genome is edited substantially: methyl groups attach to DNA to flip activation switches, histones curl up regions so they're not read.¹

Some of these edits are heritable — passed to children — which means the "binary patches" can persist across generations without altering the underlying source code. The metabolic status of parents can influence the cancer and diabetes risks of their grandchildren. The genome is more dynamic than the Central Dogma suggests, and the field is still developing rapidly.

Dead Code, Comments, and Junk DNA

97% of your DNA is "commented out." The genome is read linearly, and the parts that code for proteins (exons) are separated by non-coding regions (introns) that must be physically spliced out during transcription. Introns have start markers (GT, analogous to /*) and end markers (AG, analogous to */), plus a branch site near the end that signals the splice machinery where to cut.¹

The genome is also littered with pseudo-genes — old copies of genes and failed experiments from the last half-million years. Dead code that's carried along but never executed.

Whether "junk DNA" is truly junk has been debated for decades. Some introns show fewer mutations than neighboring exons, suggesting they're doing something important — perhaps related to how DNA folds for storage (analogous to run-length limiting on magnetic media, where spacer data is inserted to ensure reliable physical encoding). Twenty years after Hubert first wrote about this, the debate isn't settled. "It is now somewhat consensual that 'Junk DNA' has important and diverse functions, but new discoveries are being made on a daily basis."¹

Fork Bombs, Viruses, and Security

Cells reproduce by forking, not spawning — just like Unix processes. And just like Unix, uncontrolled forking is catastrophic. A cell that keeps dividing becomes a tumor. The genome is riddled with safeguards: telomere shortening limits the number of divisions, checkpoint proteins verify conditions before each split, tumor suppressors act as watchdogs. It's a "secure by default" configuration. Cancer happens when multiple safeguards fail simultaneously.¹

Hubert draws a parallel to the Halting Problem: "Perhaps it is as impossible to predict if a program will ever finish as it is to create a functional genome that cannot get cancer?" This is speculative but suggestive. The fundamental tension between allowing useful computation (cell division for growth and repair) and preventing runaway processes (tumors) may be a deep information-theoretic constraint.

Biological viruses are literally what a programmer would design if asked to "compromise the genome and insert code that copies itself to other genomes." They've been doing it for millions of years, and many have become permanent parts of our genome — hitchhiking along in every cell. The cell's immune system is a virus scanner, and like all virus scanners, it's in a perpetual arms race with new threats.

Why This Analogy Matters

The DNA-as-code metaphor has limits. Biological systems are massively parallel and fault-tolerant in ways that conventional software isn't. Gene regulation involves analog gradients, not just digital switches. And evolution doesn't have a design process — there's no programmer, no specification, no test suite.

But the analogy illuminates something important: the problems that life has had to solve — reliable storage, error correction, modularity, access control, version compatibility — are the same problems that software engineering struggles with. And life has had three billion years of continuous deployment to refine its solutions. The cluttered APIs that Hubert describes (proteins that interact with many other proteins can't evolve because changing one breaks all its dependencies) will be immediately recognizable to anyone who has maintained a large codebase. The genome isn't well-designed code. It's a billion-year-old legacy system that works despite — and sometimes because of — its accumulated cruft.

The connection to information and computation is direct: DNA is a physical information processing system, and understanding it requires both biological and computational thinking. Hubert's contribution is making that bridge accessible — not by dumbing down the biology, but by showing that the right programming concepts make the biology more legible.

DNA seen through the eyes of a coder by Bert Hubert — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸

Linked from

Biology And Earth Systems Overview
DNA As Code draws the analogy between genome and software in satisfying technical detail: conditional compilation, epigenetic runtime patching, fork bombs and virus scanners, the Central Dogma as .c → .o → a.out.

Open in stacked reader →