World Models: Learning to Dream, Then Act

A world model is a learned, compressed simulator of an environment that an agent carries in its head: feed it the current situation and an action, and it predicts what comes next. The 2018 paper World Models (David Ha & Jürgen Schmidhuber) asks what happens if you build an agent almost entirely out of such a model. Its answer splits the agent into two very unequal halves — a large world model that learns, without rewards, to see and to predict, and a tiny controller that learns, by trial and error, to act on the model's compressed view. The controller is so small you can train it inside the world model's own hallucinations and still have it work in the real environment.

One sentence. Compress each frame to a short latent vector (V), learn to predict the next latent given the action (M), and let a minuscule controller (C) act on that compressed state — then train C inside the model's dreamed rollouts, where a single temperature knob decides whether the dream is a faithful teacher or an exploitable fake.

🧩 A big world model and a tiny controller

The architecture has three parts, and almost all of the weights live in the first two. V (Vision) is a variational autoencoder: it squeezes each raw 64×64×3 frame down to a small latent vector z — 32 numbers for the CarRacing experiment — keeping the gist and discarding the pixels.

z = V(o) with o ∈ ℝ^64×64×3 → z ∈ ℝ³²

M (Memory) is a recurrent network that predicts the next latent from the current latent, the action, and its own hidden state h — the running memory of everything seen so far. C (Controller) is then almost nothing: a single linear map from the compressed state [z, h] to an action. It is exactly the policy of a reinforcement-learning agent, but reading a learned 288-number summary of the world instead of raw pixels.

a_t = W_c [ z_t ; h_t ] + b_c

Why the controller is kept tiny. For CarRacing the parameter counts are lopsided: V has ~4.3M, M has ~422K, and C has just 867 (288 inputs × 3 actions + 3 biases). That smallness is the point: C is trained with CMA-ES, an evolution strategy that perturbs whole parameter vectors and keeps what scores well. Evolution shrugs off sparse, long-horizon rewards but does not scale to millions of weights — so the heavy lifting is pushed into V and M, which learn without any reward at all.

V learns to compress. Drag the dial to see what a bottleneck does: shrink the latent budget and the reconstruction turns blocky; widen it and detail returns. A real VAE chooses which structure to keep far more cleverly than a uniform grid — this is just a stand-in for the trade-off.

latent budget z dims = 6

observation o

reconstruction D(z)

— compression ≈ —× fewer numbers

💭 M is generative: rolling it forward is dreaming

M does not predict a single next latent — the future is uncertain, so it predicts a whole distribution over the next latent. It is a Mixture Density Network on top of the recurrent net (an "MDN-RNN"): its output is a mixture of K Gaussians (the paper uses K = 5), each with its own weight, mean, and spread.

p(z_t+1 | a_t, z_t, h_t) = Σ_k π_k N(z_t+1; μ_k, σ_k²)

Because M is a distribution, you can sample from it — and feeding each sampled latent back in as the next input rolls the model forward on its own, with no real frames at all. That self-generated rollout is the agent's dream. A single temperature τ controls how wild the dream is, applied in exactly the way the temperature dial reshapes any softmax — here it flattens the mixing weights and stretches the component spreads:

π_k ∝ exp(ℓ_k / τ) , σ_k → σ_k · √τ

Below is a toy MDN-RNN rolled forward from a single starting latent. Each thin line is one sampled future; the bold line is the most-likely (τ→0) path. Drag τ: near zero the futures collapse onto one deterministic line; raise it and they fan out into the genuinely different things the model thinks could happen. The sideways shapes are the spread of dreamed futures at three checkpoints.

🌡️ τ 1.00 —

🎯 Training inside the dream — and the cheating problem

Here is the move that gives the paper its name. Since M is a self-contained simulator, you can train the controller entirely inside the dream: never touch the real environment, just let C act, let M hallucinate the consequences and the reward, and let CMA-ES optimize C against that hallucination. For the VizDoom "Take Cover" task — dodge fireballs as long as possible — a controller trained only in the dream and then dropped into the real game survived for 1092 time steps on average, well past the task's 750-step "solved" bar. (On CarRacing the same recipe scored 906±21 over 100 tracks — the first reported solution of that benchmark.)

But a learned simulator has flaws, and an optimizer will find them. At low temperature the Doom dream is nearly deterministic, and CMA-ES discovered an adversarial policy: a way of moving that made the dreamed monsters never fire a single fireball. Inside the dream that policy is immortal and scores enormously — and in the real game it dies almost at once. The fix is the same temperature knob from above: crank τ up so the dream stays uncertain and noisy, and the controller can no longer lean on any one exploitable quirk. Too low and it cheats; too high and the dream is too chaotic to learn anything. There is a sweet spot in between.

The curves are the paper's actual Take Cover numbers: the dream score inside the model and the real score after transfer, as τ varies. Drag the dial and watch the gap between them — the gap is the cheating.

🌡️ dream τ 1.00 —

dream score

—

real score

—

dream / virtual score real / transfer score — — 750 = solved

This is the same τ as the dream above. There it set how far the imagined futures fanned out; here that very fanning is what stops the controller from memorizing a fake. A little uncertainty in the teacher makes the student robust.

📚 Where this sits

Unsupervised first, reward last. V and M learn from raw experience with no reward signal; only the 867-parameter C ever sees the score. Most of the agent's competence is built before the task is even specified.
Planning by imagination. Training in the dream is cheap — no simulator, no environment steps — and prefigures later "learning in imagination" agents (PlaNet, the Dreamer line), which keep the V/M/C decomposition and replace CMA-ES with backpropagation through the learned model.
Temperature as a robustness dial. The same τ that controls entropy in a softmax here controls how much a learned simulator can be gamed — a clean example of injecting uncertainty to prevent overfitting to a model's own mistakes.
The controller is a policy. Strip away V and M and C is an ordinary policy maximizing return — the world model just hands it a state worth acting on.

Written on June 23, 2026