Boltzmann Machines, Visually: From Hopfield Nets to Stochastic Neurons
When I first met Boltzmann machines at university, they arrived as a footnote to Hopfield networks: "now make the neurons flip a coin instead of following the rule deterministically." I nodded, passed the exam, and promptly lost the intuition. This post is the refresher I wish I'd had — built around things you can poke, drag, and run.
Here's the one-sentence story, and everything below just unpacks it:
⛰️ The setup you already know: energy
Take N units, each either on (+1) or off (−1). Connect them symmetrically with weights wij (and give each a bias bi). The whole configuration s has a single scalar energy:
Two units that "want" to agree have wij > 0: matching signs lower the energy. The whole point of an associative memory is to carve valleys into this landscape at the patterns we want to remember. The classic Hebbian rule — wij = Σpatterns ξi ξj — does exactly that: each stored pattern ξ becomes a local minimum.
In a Hopfield network, recall is deterministic: repeatedly set each unit to whatever sign lowers the energy. You slide downhill and stop. That's great — until "downhill" dumps you into a shallow spurious valley that isn't any real memory, with no way back out. Hold that thought.
🌡️ The one new ingredient: temperature
Below is a tiny network of 9 units drawn as a 3×3 grid, with two patterns stored by the Hebbian rule: a T and an L. Instead of flipping units deterministically, each update is a coin flip biased by the energy:
ΔEi is simply how much energy the network saves by switching unit i on. Drag the temperature and watch the dynamics change completely:
🎲 Why a sigmoid? Temperature interpolates between a thermostat and a coin
The update rule is the whole personality of the network, so it's worth staring at. The probability a unit turns on is a sigmoid of (energy saved) / temperature. The plot shows that curve; drag temperature to reshape it.
📊 Energy becomes probability: the Boltzmann distribution
Here's the payoff that gives the machine its name. If you let those stochastic units run forever, the network doesn't converge to a single state — it visits every configuration s with a probability that depends only on its energy:
Low energy → high probability. The network spends most of its time in the deepest valleys. To see this, here's a small 4-unit network — only 24 = 16 possible states, so we can plot the entire distribution at once. Each bar is one full configuration (shown as 4 cells); height is its probability. Drag temperature and watch the distribution morph.
This is the deep idea: shaping the weights shapes the energy, which shapes a probability distribution over states. Learning a Boltzmann machine means molding that distribution to match your data.
❄️ Simulated annealing: start hot, cool slowly, land deep
Now the practical magic. To find a deep valley (a real memory, not a spurious one), don't go straight to T≈0 — you'll freeze wherever you happen to be. Instead, start hot so the network roams freely over the whole landscape, then cool gradually so it settles into the deepest basin it found. This is simulated annealing, borrowed straight from metallurgy.
Pick a strategy, then Run a single descent and watch the energy trace. Then race 200 runs of each to see who reliably reaches a stored memory.
🧠 Learning: two phases, one elegant rule
So far we hand-built the valleys with the Hebbian rule. The real question is: can the network learn its own energy landscape so that its Boltzmann distribution matches a dataset? Yes — and the rule is beautifully symmetric. To make data likely, nudge the weights by the difference between two correlation measurements:
When the two correlations match, the model's distribution equals the data's and learning stops. The catch: that negative-phase expectation requires sampling from the full network, which is exponentially expensive. That cost is exactly what the Restricted Boltzmann machine was invented to dodge.
🔗 Restricted Boltzmann Machines: make the sampling cheap
The trick is structural. Split the units into a visible layer (the data) and a hidden layer (learned features), and forbid connections within a layer. The graph becomes bipartite:
Because no two hidden units talk to each other, given the visible layer every hidden unit is independent — you can sample the whole hidden layer in one shot, and vice versa. That makes a single round-trip v → h → v′ (the heart of Contrastive Divergence) trivially fast, and it's all you need for a good gradient estimate.
Below, an RBM with 9 visible units (a 3×3 image) and 4 hidden units learns three little shapes — T, L, and a square-ish O — via CD-1, live in your browser. Watch reconstruction error drop and the hidden units grow into feature detectors. (RBMs conventionally use 0/1 units, so this section switches from ±1 to {0,1}.)
Reconstruction error
Hidden-unit receptive fields
Reconstruct a pattern: v → h → v′
Click cells to draw on the visible layer (or load a corrupted shape), then push it once through the hidden layer and back. A trained RBM cleans up the input, snapping it toward a shape it knows.
🎯 The whole story in one breath
- Energy turns a network's weights into a landscape; stored patterns are valleys.
- Temperature + a sigmoid turn deterministic descent (Hopfield) into stochastic sampling.
- Run long enough and states appear with the Boltzmann probability e−E/T/Z.
- Annealing — hot then cold — escapes spurious minima and lands in deep ones.
- Learning matches data correlations (wake) to model correlations (sleep).
- RBMs restrict the graph to a bipartite one, making that sampling fast — the ancestor of deep belief nets and a cornerstone of the early deep-learning revival.
Same energy idea you met behind Hopfield networks — but once you let the neurons gamble, a deterministic memory becomes a generative model. That's the Boltzmann machine.
