World Models

Ha and Schmidhuber's "World Models" paper asks a question that sounds whimsical and turns out to be deep: can an agent learn to act entirely inside its own dream? The answer is yes, with caveats that illuminate something fundamental about the relationship between models and reality, between imagination and exploitation, between dreaming and waking.

The Architecture

The setup is clean enough to explain in a paragraph. An agent has three components: a Vision model (V) that compresses raw pixel frames into a low-dimensional latent vector z using a variational autoencoder, a Memory model (M) that predicts future z vectors using an RNN with mixture density outputs, and a Controller (C) that maps the current z and the RNN's hidden state h to actions. V and M are the "world model" — they learn to compress and predict the environment. C is deliberately tiny, a single linear layer with under a thousand parameters, trained with an evolution strategy rather than backpropagation.¹

The elegance is in what this separation buys you. V and M can be trained efficiently with standard gradient methods on random rollouts — no reward signal needed, just unsupervised prediction of what comes next. All the complexity lives in the world model. C is so small that even crude optimisation methods like CMA-ES can search its parameter space effectively. The system solved the CarRacing-v0 environment from raw pixels, achieving the first known solution to that benchmark, and the driving behaviour is genuinely interesting to watch — the agent attacks corners aggressively, using the RNN's hidden state as an implicit prediction of what's coming rather than explicitly planning ahead. "Like a seasoned Formula One driver... the agent can instinctively predict when and where to navigate in the heat of the moment."¹

This is already a useful result, but it's the next step that makes the paper conceptually important.

Dreaming

If the world model can predict the future well enough, you should be able to replace the actual environment with generated hallucinations and train the controller entirely inside the dream. Ha and Schmidhuber do exactly this for the VizDoom "Take Cover" task, where an agent must dodge fireballs. The M model learns to generate monsters, fireballs, physics, 3D rendering — the entire game logic — just from watching random play. The controller trains inside this hallucinated environment and then transfers successfully back to the real one, scoring well above the required threshold.¹

The connection to how biological brains work isn't accidental. There's strong evidence that mammals replay experiences during sleep, with hippocampal place cells firing in sequences that recapitulate waking navigation. Patrick Winston's research at MIT explored the idea that this replay constitutes a form of story-telling — that rats navigating mazes and then "re-running" those routes during sleep are engaging in the same basic operation as humans constructing narratives.² The parallel to Ha and Schmidhuber's dream training is provocative: in both cases, an agent builds an internal model of the world through experience, then uses that model generatively to rehearse and refine behaviour without further environmental interaction.

This connects to the Predictive Processing framework in cognitive science — the idea that perception is largely predictive. We don't passively receive sensory data; our brains generate predictions and then compare them against incoming signals. A baseball batter can't wait for visual information about a 100mph fastball to reach the cortex — the swing is already committed before the signal arrives. The ability to hit the ball comes from an internal model that predicts trajectory well enough to act reflexively.¹ World models in AI formalize this predictive processing framework in a trainable system.

The Adversarial Dream Problem

The most interesting part of the paper is what goes wrong. When you train a controller inside a learned world model, you're giving the controller access to the model's internal states — effectively letting it peek behind the curtain of the game engine. The controller can find adversarial policies that exploit the world model's imperfections: ways of moving that prevent hallucinated monsters from ever firing, or that "extinguish" fireballs by nudging the RNN into states where they disappear.¹

This is not a bug to be patched; it's a fundamental tension. Any learned model of an environment will be imperfect, and any optimiser with access to the model's internals can learn to exploit those imperfections. The policy will look great inside the dream and fail catastrophically in reality — a phenomenon that should sound familiar to anyone who's watched a reinforcement learning agent overfit to a simulator.

Ha and Schmidhuber's solution is elegant: increase the temperature parameter on the MDN-RNN's outputs, making the dream noisier and more uncertain than reality. Agents trained in harsher, more stochastic dreams generalise better to the cleaner real world. Too much noise makes the dream unlearnable; too little makes it exploitable. The temperature parameter is a dial between realism and robustness, and the optimal setting is empirically discoverable.¹

There's a deep principle lurking here. Training in a model of the world is always vulnerable to Goodhart's law — the agent optimises the model rather than the territory. The temperature trick works because it makes the model deliberately less precise, creating a margin of safety around its predictions. It's related to regularisation in supervised learning, to the explore-exploit tradeoff in bandits, to the philosophical observation that a map too detailed to be wrong is a map too detailed to be useful. The best world model for training isn't the most accurate one; it's the one that's accurate enough to be informative but noisy enough to be unexploitable.

The Inevitability of Causal Reasoning

Judea Pearl, in an interview with Quanta Magazine, makes a pointed argument against purely correlational AI: "All the impressive achievements of deep learning amount to just curve fitting."³ Pearl's ladder of causation has three rungs — seeing (correlation), doing (intervention), and imagining (counterfactual reasoning) — and current deep learning, he argues, is stuck on the first. A model that has never intervened in the world can learn that mud and rain correlate, but not that rain causes mud; it can learn that drug treatment and recovery correlate, but not disentangle the confound that sicker patients get more treatment.

Pearl isn't wrong that this is a real limitation, but I think the argument is more nuanced than he frames it. LLMs trained on text about causal relationships do encode causal structure — they can answer "what would happen if..." questions by pattern-matching against the vast corpus of human causal reasoning in their training data. Whether this constitutes "understanding causation" or just very sophisticated curve-fitting on text about causation is a philosophical question that Pearl's framework doesn't fully resolve. The practical question is whether it matters: if a model can reliably produce correct counterfactual predictions by leveraging human-written causal reasoning, is the absence of "true" causal understanding a meaningful deficiency or a philosophical concern?³

Minsky's older work on alien intelligence touches a related nerve. His "sparseness principle" — that any intelligence searching through the simplest processes will inevitably discover arithmetic, causal reasoning, and goal representations, because these concepts are isolated islands with no simpler alternatives — suggests that world models aren't arbitrary but convergent.⁴ The space of possible intelligences is far more constrained than it appears, because certain cognitive structures are the only efficient solutions to universal problems (managing resources in space and time, predicting consequences of actions). If Minsky is right, any sufficiently powerful world model, whether biological or artificial, will converge on something like causal reasoning — the question is whether it arrives there through explicit causal graphs (Pearl's preference) or through statistical regularities in text written by causal reasoners (the LLM path).

Beyond Games

The Helen Keller Analogy

A provocative argument for latent world models in LLMs comes from the BigVAE project, which draws on Helen Keller's language acquisition as a philosophical case study.⁵ Keller, deaf and blind from infancy, learned language through signs traced into her palm — and her breakthrough moment was understanding that "everything has a name." This required that her mind already had internal representations of distinct objects before she had labels for them. The naming didn't create the categories; it connected a symbolic system to pre-existing structure.

The author's conjecture: if the breakthrough for a deaf-blind person is realizing that everything has a name, the breakthrough for a language model might be realizing that every name has a thing — that the statistical correlations between words imply a "highly compressible latent logic which goes beyond the words themselves." This isn't just philosophical speculation; the BigVAE project builds a variational autoencoder on top of Mistral 7B that encodes text spans into continuous latent vectors, enabling interpolation and manipulation in a space that ostensibly represents the latent logic of language. Whether this constitutes "understanding" is debatable, but the fact that you can encode text into continuous representations that support meaningful interpolation (blending the meaning of two passages into something coherent) suggests that something more structured than word co-occurrence is being captured.⁵

This connects to the mechanistic interpretability finding that features in transformers are roughly universal across independently trained models — the internal representations converge not because the models are copying each other's solutions but because the underlying structure they're modeling (human concepts, causal relationships, logical entailments) constrains the solution space.

Beyond Games

The world models architecture has implications well beyond game environments. LLMs themselves can be understood as a kind of world model — they've learned to predict sequences of text, and their internal representations encode a compressed model of the processes that generate text, which necessarily includes models of the world those texts describe. The gap between "predicts text" and "models the world" is narrower than it appears, because predicting text well requires modelling causal structure, physical regularities, and social dynamics.

The connection to the era of experience is direct. David Silver and Richard Sutton's argument that AI needs to learn from interaction rather than static data is essentially a call for systems that build world models through exploration and then use those models to guide action — the same loop Ha and Schmidhuber demonstrate at small scale. The challenge is whether the adversarial dream problem scales: as world models become more capable, do the opportunities for exploitation grow faster or slower than the models' accuracy?

World Models by David Ha and Jürgen Schmidhuber — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
The Storytelling Computer by M.R. O'Connor — source ↩
To Build Truly Intelligent Machines, Teach Them Cause and Effect by Judea Pearl — source ↩ ↩²
Communication with Alien Intelligence by Marvin Minsky — source ↩
Revealing Intentionality in Language Models Through AdaVAE by jdp — source ↩ ↩²

Linked from

Extended Mind Thesis
Human judgment provides exactly the causal grounding and embodied experience that world models lack.
Maps All The Way Down
World Models: whether LLMs contain genuine world-models (territory-representations) or just text-models (map-representations) remains genuinely open.

Open in stacked reader →

World Models

The Architecture

Dreaming

The Adversarial Dream Problem

The Inevitability of Causal Reasoning

Beyond Games

The Helen Keller Analogy

Beyond Games

Footnotes

Linked from