Boltzmann Machines, Visually: From Hopfield Nets to Stochastic Neurons

When I first met Boltzmann machines at university, they arrived as a footnote to Hopfield networks: "now make the neurons flip a coin instead of following the rule deterministically." I nodded, passed the exam, and promptly lost the intuition. This post is the refresher I wish I'd had — built around things you can poke, drag, and run.

Here's the one-sentence story, and everything below just unpacks it:

A Hopfield network rolls downhill on an energy landscape until it gets stuck in the nearest valley. A Boltzmann machine adds a "temperature" that lets neurons occasionally roll uphill — so instead of freezing in the first local minimum, the network samples states with probability set by their energy. Cool the temperature slowly and it settles into deep valleys; learn the energy landscape from data and it becomes a generative model.

⛰️ The setup you already know: energy

Take N units, each either on (+1) or off (−1). Connect them symmetrically with weights wij (and give each a bias bi). The whole configuration s has a single scalar energy:

E(s) = −Σi<j wij si sj  −  Σi bi si

Two units that "want" to agree have wij > 0: matching signs lower the energy. The whole point of an associative memory is to carve valleys into this landscape at the patterns we want to remember. The classic Hebbian rulewij = Σpatterns ξi ξj — does exactly that: each stored pattern ξ becomes a local minimum.

In a Hopfield network, recall is deterministic: repeatedly set each unit to whatever sign lowers the energy. You slide downhill and stop. That's great — until "downhill" dumps you into a shallow spurious valley that isn't any real memory, with no way back out. Hold that thought.

🌡️ The one new ingredient: temperature

Below is a tiny network of 9 units drawn as a 3×3 grid, with two patterns stored by the Hebbian rule: a T and an L. Instead of flipping units deterministically, each update is a coin flip biased by the energy:

p(si = +1) = σ( ΔEi / T )   where  ΔEi = 2(Σj wij sj + bi),  σ(x)=1/(1+e−x)

ΔEi is simply how much energy the network saves by switching unit i on. Drag the temperature and watch the dynamics change completely:

exploring
stored: T
stored: L
🌡️ Temperature 0.20
Energy
0.0
Updates
0
energy over time
Try this: set temperature very low (slide left) and hit Run — the network freezes into T or L, exactly like Hopfield recall. Now crank it high: the grid boils, energy stays high, nothing settles. The interesting regime is in between. Hit Corrupt a memory at low temperature to watch pattern completion happen live.

🎲 Why a sigmoid? Temperature interpolates between a thermostat and a coin

The update rule is the whole personality of the network, so it's worth staring at. The probability a unit turns on is a sigmoid of (energy saved) / temperature. The plot shows that curve; drag temperature to reshape it.

🌡️ T 1.00
ΔE +2.0
p(turn on)
0.88
T → 0 (cold): the sigmoid becomes a step. p≈1 whenever turning on lowers energy, p≈0 otherwise. This is exactly a Hopfield update — a deterministic thermostat. Great for locking in, hopeless for escaping.
T → ∞ (hot): the sigmoid flattens to 0.5 everywhere. The unit ignores the energy entirely and flips a fair coin. Maximum exploration, zero memory.

📊 Energy becomes probability: the Boltzmann distribution

Here's the payoff that gives the machine its name. If you let those stochastic units run forever, the network doesn't converge to a single state — it visits every configuration s with a probability that depends only on its energy:

p(s) = e−E(s)/T / Z,    Z = Σs' e−E(s')/T

Low energy → high probability. The network spends most of its time in the deepest valleys. To see this, here's a small 4-unit network — only 24 = 16 possible states, so we can plot the entire distribution at once. Each bar is one full configuration (shown as 4 cells); height is its probability. Drag temperature and watch the distribution morph.

🌡️ Temperature 1.00
low energy (likely) high energy (rare)
Cold: almost all probability mass piles onto the single lowest-energy state. The machine is effectively a deterministic memory — it "knows the answer."
Hot: the bars flatten toward a uniform distribution — every state is roughly equally likely. The machine has "forgotten" its structure.

This is the deep idea: shaping the weights shapes the energy, which shapes a probability distribution over states. Learning a Boltzmann machine means molding that distribution to match your data.

❄️ Simulated annealing: start hot, cool slowly, land deep

Now the practical magic. To find a deep valley (a real memory, not a spurious one), don't go straight to T≈0 — you'll freeze wherever you happen to be. Instead, start hot so the network roams freely over the whole landscape, then cool gradually so it settles into the deepest basin it found. This is simulated annealing, borrowed straight from metallurgy.

Pick a strategy, then Run a single descent and watch the energy trace. Then race 200 runs of each to see who reliably reaches a stored memory.

energy temperature

🧠 Learning: two phases, one elegant rule

So far we hand-built the valleys with the Hebbian rule. The real question is: can the network learn its own energy landscape so that its Boltzmann distribution matches a dataset? Yes — and the rule is beautifully symmetric. To make data likely, nudge the weights by the difference between two correlation measurements:

Δwij = ε (  ⟨ si sjdata  −  ⟨ si sjmodel  )
Positive phase ("wake"): clamp the visible units to a real data example and measure how often unit i and j fire together. This digs valleys under the data.
Negative phase ("sleep"): let the network run free — dreaming — and measure the same correlation under its own distribution. This fills in valleys the model invented on its own.

When the two correlations match, the model's distribution equals the data's and learning stops. The catch: that negative-phase expectation requires sampling from the full network, which is exponentially expensive. That cost is exactly what the Restricted Boltzmann machine was invented to dodge.

🔗 Restricted Boltzmann Machines: make the sampling cheap

The trick is structural. Split the units into a visible layer (the data) and a hidden layer (learned features), and forbid connections within a layer. The graph becomes bipartite:

Because no two hidden units talk to each other, given the visible layer every hidden unit is independent — you can sample the whole hidden layer in one shot, and vice versa. That makes a single round-trip v → h → v′ (the heart of Contrastive Divergence) trivially fast, and it's all you need for a good gradient estimate.

Below, an RBM with 9 visible units (a 3×3 image) and 4 hidden units learns three little shapes — T, L, and a square-ish O — via CD-1, live in your browser. Watch reconstruction error drop and the hidden units grow into feature detectors. (RBMs conventionally use 0/1 units, so this section switches from ±1 to {0,1}.)

Reconstruction error

Hidden-unit receptive fields

negative weight positive weight

Reconstruct a pattern: v → h → v′

Click cells to draw on the visible layer (or load a corrupted shape), then push it once through the hidden layer and back. A trained RBM cleans up the input, snapping it toward a shape it knows.

visible (click to edit)
hidden activations
reconstruction v′

🎯 The whole story in one breath

  • Energy turns a network's weights into a landscape; stored patterns are valleys.
  • Temperature + a sigmoid turn deterministic descent (Hopfield) into stochastic sampling.
  • Run long enough and states appear with the Boltzmann probability e−E/T/Z.
  • Annealing — hot then cold — escapes spurious minima and lands in deep ones.
  • Learning matches data correlations (wake) to model correlations (sleep).
  • RBMs restrict the graph to a bipartite one, making that sampling fast — the ancestor of deep belief nets and a cornerstone of the early deep-learning revival.

Same energy idea you met behind Hopfield networks — but once you let the neurons gamble, a deterministic memory becomes a generative model. That's the Boltzmann machine.

Written on May 30, 2026