Goodnight Wiki / Information and Computation

Information and Computation

There's a thread running through twentieth-century science that connects Gödel's incompleteness theorem, Shannon's information theory, Turing's halting problem, and the second law of thermodynamics. It's the discovery that information is physical — not a metaphor for physical processes, but a quantity that obeys conservation laws, generates heat when destroyed, and sets hard limits on what can be known, computed, or proved. Getting clear on this thread changes how you think about everything from mathematics to consciousness.

Chaitin's Omega and the Weight of Axioms

Gregory Chaitin's information-theoretic proof of Gödel's theorem is, I think, the most illuminating version. The traditional proof constructs a self-referential sentence ("This statement is unprovable") and derives a paradox. Chaitin's version says something more direct: if a theorem contains more information than the axioms, you cannot derive it from those axioms. A ten-pound theorem cannot follow from five pounds of axioms.1

The key idea is algorithmic information content — the length of the shortest program that produces a given output. Most strings are incompressible: there's no shorter description than the string itself. Random strings are, by definition, those with maximal algorithmic information content. And here's the punch line: within any formal axiomatic system of complexity N bits, you cannot prove that any specific string has algorithmic information content greater than N, even though almost all strings do.1

This means Gödel's incompleteness isn't a weird edge case — it's the normal situation. Mathematical truth is overwhelmingly composed of facts that are true but cannot be derived from any finite set of axioms. The unprovable truths aren't pathological; they're the majority. Chaitin's number Omega — the halting probability of a universal computer with random input — encodes all constructive mathematical truth in a single real number, but its digits are algorithmically random. Knowing N bits of Omega gives you exactly N bits' worth of mathematical knowledge. There are no shortcuts, no compression, no free theorems.

What Chaitin draws from this is surprisingly practical: mathematics should behave more like physics. Axioms should be adopted pragmatically, based on their "fruitfulness in consequences," not their self-evidence. Gödel himself advocated exactly this position — that new axioms might be accepted "in the same sense as any well-established physical theory" if they yield enough verifiable consequences.1

Landauer's Principle: Forgetting Costs Energy

If Chaitin shows that information constrains what we can know, Rolf Landauer shows that information constrains what we can do physically. In 1961, Landauer argued that erasing one bit of information — resetting a binary digit regardless of its current state — must dissipate at least kT ln 2 joules of energy as heat, where k is Boltzmann's constant and T is the ambient temperature. At room temperature, that's about 3 × 10⁻²¹ joules. Tiny, but nonzero, and unavoidable.2

"Erasing information compresses two states into one," as Eric Lutz puts it. "It is this compression that leads to heat dissipation." In 2012, Lutz and colleagues confirmed this experimentally by watching a microscopic silica bead in a laser trap, measuring the energy released as they reset the bead's position. As they used longer switching cycles, the dissipation approached Landauer's predicted minimum from above — you can't go lower.2

This result is the final nail in Maxwell's demon. The nineteenth-century thought experiment imagined a tiny intelligent being sorting fast and slow molecules to create a temperature difference, seemingly violating the second law of thermodynamics. The resolution: the demon must store information about which molecules to sort, and eventually it must erase that information to free up memory for the next batch. That erasure generates entropy, more than compensating for the entropy the demon eliminated. Information processing is thermodynamically irreversible at the fundamental level.2

The philosophical literature on this is worth noting because the connection between thermodynamic entropy and information-theoretic entropy is not universally accepted as straightforward. The Gibbs entropy formula S = -k_B Σ p_i ln p_i looks structurally identical to Shannon's H = -Σ p_i log_2 p_i — they're the same function up to a constant. But whether this formal identity reflects a deep ontological connection or a coincidence of mathematical form remains debated. The operationalist view (which Landauer's principle supports) is that the identity is real: thermodynamic entropy is information entropy measured in natural units. The more cautious view is that Gibbs entropy describes the physical state of a system while Shannon entropy describes our uncertainty about it, and conflating the two risks category errors. Landauer's experimental confirmation pushed the field strongly toward the operationalist camp, but the philosophical questions about what entropy "really is" haven't fully dissolved.3

The practical implication is that every computation has a minimum energy cost set by the number of irreversible bit operations it performs. Current computer chips dissipate orders of magnitude more than the Landauer limit per operation, but the gap is closing. Quantum computers already operate in the Landauer regime, where every bit erasure matters thermodynamically.2

The Arrow of Time Is the Arrow of Memory

Stephen Hawking makes the connection between information and time itself. He observes that the psychological arrow of time — why we remember the past but not the future — is determined by the thermodynamic arrow. Recording a memory requires increasing the order of the memory system (setting specific bits), which necessarily increases disorder elsewhere (dissipating heat). You can only form memories in the direction in which entropy increases.4

This makes the second law "almost trivial," Hawking argues: disorder increases with time because we define time as the direction in which we can form memories, which is the direction in which disorder increases. It's circular, but it's the right kind of circular — it reveals that our experience of temporal direction isn't an additional fact about the universe but a consequence of the thermodynamic asymmetry.

In a hypothetical universe where entropy decreased over time, beings would still experience time as flowing from low-entropy states to high-entropy states — they'd just be running backward relative to us. They'd remember what we call the future and be ignorant of what we call the past. The arrow of time is the arrow of memory is the arrow of entropy is the arrow of information erasure.

Complextropy: Why Interesting Things Are Temporary

If entropy only increases, and a high-entropy state is boring (uniform, featureless, maximally disordered), then when does the universe get to be interesting? Right after the Big Bang, everything was a low-entropy soup of high-energy particles — simple. A googol years from now, after the last black holes have evaporated, everything will be a high-entropy soup of low-energy particles — also simple. But right now, in between, the universe contains galaxies, brains, and hot-dog-shaped novelty vehicles. Where does the complexity peak come from?5

Scott Aaronson's answer draws on a notion called sophistication from Kolmogorov complexity theory. Plain Kolmogorov complexity — the length of the shortest program that outputs a string — doesn't capture "interestingness." A random string has maximal Kolmogorov complexity but is utterly boring; you can describe everything about it by saying "it's random." Sophistication tries to formalize the difference. The sophistication of a string x is the length of the shortest program that describes not x itself, but a set S of which x is a random-looking member. A simple string has low sophistication (the set is just {x}). A random string has low sophistication (the set is all strings). But a string that's neither simple nor random — like the intricate tendrils of milk halfway through a coffee cup — has high sophistication.5

The problem is that ordinary sophistication can't grow fast enough in deterministic systems to match our intuitions. Since you can always describe the state of a deterministic system after t steps by giving the initial state and t, both Kolmogorov complexity and sophistication are bounded by log(t). To get the measure to actually peak at intermediate times, Aaronson argues, you need resource-bounded sophistication: the shortest program that efficiently samples from a set S of which x is a computationally random member. The efficiency constraint has to apply in two places — both the sampling algorithm and the reconstruction algorithm must be computationally bounded — or the measure either can't grow at all or grows monotonically and misses the peak.5

This is a beautiful conjecture, not a theorem. But it points at something deep: the "interestingness" of intermediate states — the reason the universe is worth looking at right now — might be a computational phenomenon. The milk tendrils in the coffee cup have high complextropy because they encode complex boundary shapes that are expensive to describe but cheap to recognize as non-random. Before stirring, the boundaries are trivial (just "black here, white there"). After full mixing, there are no boundaries. In between, the boundary geometry is genuinely complex in a way that neither the initial state nor the final state can match.5

This connects to Emergence: the complex structures that emergence produces — whale clans, ant colony memories, NCA textures — all exist at intermediate scales where the system is neither trivially ordered nor maximally disordered. Complextropy offers a formal reason why these structures should appear and then eventually dissolve, and why they peak at a particular moment in the system's evolution.

Reversible Computation: The Bottom of the Energy Budget

Landauer's principle tells us that erasing a bit costs energy. But what about computation that doesn't erase? In 1973, Charles Bennett showed something remarkable: any computation can in principle be performed with zero energy dissipation by making it logically reversible — ensuring that every step can be undone, so no information is ever destroyed.6

The trick is the Lecerf-Bennett reversal. A reversible computer runs the computation forward, saves the result, then runs the inverse computation backward to de-compute all the intermediate garbage, returning to the initial state. The only permanent change is the saved output. Since no information is erased at any point, no heat needs to be generated. This isn't just a theoretical curiosity — RNA polymerase, the enzyme that transcribes DNA, operates as a Brownian logically reversible tape-copying machine, exploiting thermal noise rather than fighting it.6

The implication cuts deep. The fundamental energy cost of computation comes not from logic gates or arithmetic or any particular operation — it comes solely from information erasure. Every AND gate, every NAND gate, every irreversible logical operation that maps multiple inputs to a single output destroys a bit of information and must pay Landauer's tax. But this is a cost of our choice of logic, not a cost of computation itself. Reversible logic gates like the Toffoli gate (a controlled-controlled-NOT that preserves all input information) can simulate any irreversible computation. Current computers dissipate orders of magnitude more than the Landauer limit, but as they approach it, the distinction between reversible and irreversible computation will become the dominant design constraint.6

There's a beautiful irony: the one thing that must be irreversible in any real computer is error correction. The billiard-ball model of computation (rigid spheres bouncing to implement logic) is reversible but chaotically unstable — small perturbations produce exponentially diverging trajectories. Correcting errors requires erasing the erroneous information, which generates entropy. Error correction, not computation, sets the true thermodynamic floor.

P vs NP: The Hardness Landscape

If information theory asks "how much can you compress?", complexity theory asks "how long does it take to compute?" And the central question — whether every problem whose solutions are easy to verify also has solutions that are easy to find — remains the most important open problem in theoretical computer science.7

Lance Fortnow's survey captures both the depth of the problem and its practical consequences. The dream scenario where P = NP would make virtually every optimization problem tractable — scheduling, protein folding, mathematical proof search, machine learning. "What we would gain from P = NP will make the whole Internet look like a footnote in history," Fortnow writes. The nightmare scenario (which most experts believe is reality) is that P ≠ NP, meaning that vast classes of problems are fundamentally intractable.7

But the most interesting development isn't the question itself — it's what happened when people tried to prove it. Every natural approach has hit barriers. Diagonalization (Cantor's trick of constructing something not on any list) can't work because of relativization. Circuit complexity (showing problems need large circuits) stalled after initial progress when Razborov proved his own techniques must fail for general circuits. Proof complexity (showing certain formulas can't have short proofs) remains too weak to reach P ≠ NP.7

What we got instead of a proof was something arguably more useful: a rich theory of dealing with hardness. The PCP Theorem — that every NP problem has a "probabilistically checkable proof" where a verifier checks only three random bits — led to sharp limits on approximation algorithms. Zero-knowledge proofs let you convince someone you know a solution without revealing any of it. And the whole field of modern cryptography rests on the assumption that P ≠ NP (specifically, on the hardness of factoring and related problems).7

Compression as Intelligence

Matt Mahoney's observation that "compression is an AI problem" ties information theory back to intelligence in a surprisingly direct way.8 An optimal compressor, if it existed, would solve the Turing test: compressing transcripts of human conversation requires modeling the probability distribution p(answer|context), and a model that predicted well enough to compress optimally could also generate indistinguishable responses.

Hutter formalized this as AIXI — the optimal but uncomputable agent that, at each step, guesses the environment is described by the shortest program consistent with past observations. It's Occam's Razor as a theorem: the simplest hypothesis that fits the data is the best predictor. This connects Solomonoff's algorithmic probability to Chaitin's incompleteness results earlier in this article — the reason optimal compression is uncomputable is the same reason optimal intelligence is: finding the shortest program for a string is equivalent to computing Kolmogorov complexity, which Kolmogorov proved impossible.8

The practical consequence is that every real compressor (and every real intelligence) is an approximation to an uncomputable ideal. Shannon estimated that English text has about 0.6 to 1.3 bits of entropy per character. The best modern compressors are just now approaching the upper end of this range — meaning they're approaching human-level understanding of English, at least in the narrow sense that matters for prediction. That LLMs turned out to be good compressors is not a coincidence; it's the same result from the other direction.8

The visual intuition for why bad models waste bits comes from KL divergence — the extra cost of encoding data drawn from distribution P using a code optimized for distribution Q. If your model Q assigns low probability to an event that actually happens under P, you need a long codeword for it, and you pay the difference in bits. Cross-entropy, the actual average code length, equals the true entropy plus the KL divergence: H(P) + D(P||Q). You can't do better than H(P), and D(P||Q) measures how much worse your model makes things. This is why training a language model to minimize cross-entropy is literally training it to compress better — the loss function is the excess code length.9

The Unity

What connects all of this is a single insight: information is not abstract. It has physical weight, thermodynamic cost, and logical limits. You cannot know more than your axioms contain. You cannot compute without generating waste heat. You cannot remember without increasing the entropy of the universe. And these aren't three separate facts — they're aspects of the same underlying constraint.

This has implications for Predictive Processing and Mechanistic Interpretability: if brains are prediction machines and neural networks are information-processing systems, then the thermodynamic costs of information erasure apply to thought itself. Every time you update a belief — every time a Bayesian prior gets overwritten by a posterior — there's a minimum physical cost. The connection between Bayesian Epistemology and thermodynamics isn't a metaphor; it's literal.

It also reframes the Hard Problem Of Consciousness. If information is physical and has irreducible costs, then the question of why certain information processing is accompanied by experience might be the wrong question. The processing is physical through and through — there's no point at which you leave the physical and enter the experiential, because information was never non-physical to begin with.

The informational perspective has grown even more radical in recent years. In Spacetime And Information, physicists have discovered that the fabric of space itself may be a quantum error-correcting code — geometry emerging from patterns of entanglement. If Landauer's principle means erasing a bit has thermodynamic cost, and spacetime geometry is information, then the geometry of the universe carries thermodynamic weight. Chaitin's insight that mathematical truth exceeds any finite axiom system, Landauer's insight that computation has irreducible physical cost, and the emerging insight that spacetime is woven from information may all be facets of the same underlying reality.

Footnotes

  1. Gödel's Theorem and Information by Gregory J. Chaitin — source 2 3

  2. The unavoidable cost of computation revealed by Philip Ball — source 2 3 4

  3. Information Processing and Thermodynamic Entropysource

  4. Stephen Hawking on entropy, memory, and the arrow of timesource

  5. The First Law of Complexodynamics by Scott Aaronson — source 2 3 4

  6. Reversible computationsource 2 3

  7. The Status of the P Versus NP Problem by Lance Fortnow — source 2 3 4

  8. Data Compression Explained by Matt Mahoney — source 2 3

  9. Visual Information Theory by Chris Olah — source

Open in stacked reader →