Temperature and Entropy: The One Dial Behind Softmax, LLMs, and Boltzmann Machines

Temperature is the scalar T that a model divides its logits by before the softmax turns them into probabilities: pi ∝ ezi/T. Turn it down and the distribution sharpens onto the single most likely token; turn it up and the distribution flattens out. The flatter a distribution, the higher its Shannon entropy — so the everyday intuition "higher temperature, higher entropy" is correct. What is worth seeing is that this is not an analogy at all: it is the same temperature, and the same entropy, that appear in statistical mechanics and in the Boltzmann machine.

Here is the whole note in one sentence; everything below unpacks it:

A softmax is a Boltzmann distribution with energy = −logit. Temperature is the dial that trades expected energy against entropy. It sweeps monotonically from a deterministic argmax at T → 0 (zero entropy) to the uniform distribution at T → ∞ (entropy log N, the maximum) — and the speed of that climb is governed by the variance of the logits, which produces a specific-heat-like "sweet spot" in between.

🌡️ The same distribution, two names

A model scores each candidate token with a real number called a logit zi. Temperature sampling rescales those logits by T and runs them through the softmax:

pi(T) = ezi/T ⁄ Σj ezj/T

Now set the energy of token i to be its negative logit, Ei = −zi, and write β = 1/T. The exact same formula becomes the Boltzmann (Gibbs) distribution of physics — the one the Boltzmann machine samples from:

pi = e−Ei/T ⁄ Z ,   Z = Σj e−Ej/T

High logit ⇄ low energy ⇄ high probability. The denominator Z is the partition function. Drag the temperature and watch the next-token distribution for the toy prompt "the cat sat on the ___" reshape, with its entropy tracked underneath.

🌡️ T 1.00
entropy H
bits
perplexity 2H
eff. choices

The entropy as a function of temperature — a single monotone climb from 0 to log2N:

Why "logits divided by T" is exactly "energy times β": dividing the exponent's numerator by T is the same as multiplying the energies by β = 1/T. Cold (small T, large β) makes energy differences loom large, so probability piles onto the single lowest-energy state. Hot (large T, small β) washes those differences out, so every state looks alike.

🧊🔥 The two ends of the dial

Both extremes are worth naming because both are used in practice, and both pin down the entropy exactly. Slide the temperature in the demo above to either edge to see them.

T → 0 — freezing / greedy decoding. The softmax becomes an argmax: all probability collapses onto the single largest logit. Entropy → 0 bits — the output is deterministic. (If two logits tie for the top, you get a uniform split over just those, i.e. log of the degeneracy — the residual entropy of a ground state in physics.) This is the same "freeze into the nearest memory" behaviour a Hopfield network shows at zero temperature in the Boltzmann note.
T → ∞ — boiling / uniform. Dividing by a huge number makes every rescaled logit ≈ 0, so e0 is equal for all tokens. The distribution becomes uniform and entropy hits its ceiling log2N — maximum ignorance, the same ceiling the entropy note hits when you make a distribution uniform. The logits are completely ignored.
Everything useful happens between these poles. T = 1 is "use the model's own probabilities, untouched." Below 1 sharpens (more confident, less diverse); above 1 flattens (more diverse, less coherent). That is the entire creativity-vs-coherence trade-off in one scalar.

⚖️ One dial, two appetites: energy vs entropy

There is a deeper reason temperature controls entropy, and it is the reason the Boltzmann distribution turns up everywhere. Among all distributions with a given expected energy ⟨E⟩, the one with the highest entropy is precisely the Boltzmann distribution — and β = 1/T is the Lagrange multiplier enforcing that energy budget (this is Jaynes' maximum-entropy principle). So picking a temperature is the same as picking how much you care about entropy relative to energy. That trade-off is bundled into one quantity, the free energy:

F(T) = ⟨E⟩ − T · S = −T · ln Z

The Boltzmann distribution is exactly the distribution that minimises F. Read the two terms as competing appetites: minimising ⟨E⟩ wants to concentrate on the best token; maximising S wants to spread out; T sets the exchange rate between them. Cold ⇒ energy wins (sharp, low entropy); hot ⇒ entropy wins (spread, high entropy). Drag the temperature and watch the three quantities trade off. (Units: ⟨E⟩ and F are in logit (score) units — energy is just −logit — while S is in nats (natural log, where F = −T ln Z holds exactly; nats are bits × ln 2). The factor T in T·S is what converts nats into the same logit units, so the subtraction is well-posed.)

🌡️ T 1.00
⟨E⟩ energy S entropy F free energy
dS/dT — entropy gained per degree
⟨E⟩
logits
S
nats
F = ⟨E⟩−TS
logits

📈 The sweet spot: where entropy rises fastest

"Entropy increases with temperature" is true, but it does not increase evenly. The right chart above is the rate dS/dT, and it has a hump. There is a characteristic temperature where the distribution is most sensitive — where a small turn of the dial buys the most new entropy. Below it the model is nearly frozen on its top token; above it the gains taper as it approaches uniform. That peak is reminiscent of a phase transition — though with a finite vocabulary there is no true transition (the curve is perfectly smooth, never non-analytic); it is the smeared-out, finite-size echo of one. The rate is exactly the statistical-mechanics specific heat (with S in nats; for bits, divide by ln 2):

dS/dT = C/T = Var(E) ⁄ T3

The whole shape is driven by Var(E), the variance of the logits. Two consequences fall straight out of this:

  • Because Var(E) ≥ 0, the rate is never negative — so entropy is monotonically non-decreasing in T. The intuition is now a theorem, not a hunch: more temperature can never mean less entropy.
  • The peak sits near the scale of the logit gaps. Models with spread-out logits (a confident model) need more heating before they loosen up; models with bunched logits are already near-uniform and have little entropy left to gain. This is the mechanical reason a single global temperature behaves so differently across prompts, and a rough reason the scale of the model's typical logit spread is a useful starting point for the sampling temperature (the best value is ultimately an empirical, quality-driven choice, not the dS/dT peak itself).
The annealing connection. Simulated annealing starts hot (high entropy, free to explore), then cools through this sensitive region so the system can settle into a deep low-energy state instead of freezing into the first one it meets. Same dial, used as a schedule rather than a fixed setting.

🎲 Sampling at temperature: the entropy you actually get

Entropy is an expectation, so take the average. This is the same sampler from the Shannon entropy note, but now temperature sets the target: we draw tokens from p(T), score each by its surprise −log2p, and track the running average. By the law of large numbers it homes in on H(T) — the dashed line — which the slider moves up and down.

🌡️ T 1.00
running average surprise entropy H(T) — target
draws
0
avg surprise
entropy H(T)
Try this: run it cold and the average surprise hugs a low line near 0 (the model keeps emitting "mat"); slide hot mid-run and the target jumps up toward log2N while the running average chases it. The wiggle early on is small-sample noise; it always settles onto H(T).

📚 Where this lives in the literature

The temperature–entropy link is not a folk heuristic; it is the bridge between information theory and statistical mechanics, and it has been discussed for decades under several names.

  • Maximum entropy (Jaynes, 1957). The Boltzmann distribution is derived as the maximum-entropy distribution at fixed expected energy; β = 1/T is the Lagrange multiplier. This is why temperature is the entropy dial — by construction.
  • Gibbs / statistical mechanics. The partition function Z, free energy F = −T ln Z, and specific heat C = Var(E)/T2 are the standard toolkit; everything in this note is a one-line translation of it into "logits."
  • Simulated annealing (Kirkpatrick, Gelatt & Vecchi, 1983) and Boltzmann machines (Ackley, Hinton & Sejnowski, 1985). Temperature used as a schedule to control how much the system explores — the subject of the Boltzmann note.
  • Knowledge distillation (Hinton, Vinyals & Dean, 2015). A high softmax temperature deliberately raises the entropy of a teacher's outputs to expose its "dark knowledge" — the relative probabilities of the wrong answers — so a student can learn from soft targets.
  • Neural-text decoding. Temperature sampling sits alongside top-k and nucleus / top-p sampling (Holtzman et al., 2020) as the standard knobs for trading diversity against coherence; all of them are, in effect, entropy controls on the next-token distribution.

🎯 The consequences worth remembering

  • Monotone, and provably so. dS/dT = Var(E)/T3 ≥ 0: heating never lowers entropy.
  • Two hard limits. T→0 is deterministic argmax (entropy 0); T→∞ is uniform (entropy log N). Greedy decoding and "maximally random" are the same dial at its two ends.
  • Perplexity is the visible face of entropy. Perplexity = 2H is the effective number of choices the model is weighing; temperature tunes it directly. That is the readout under the first demo.
  • A specific-heat sweet spot. The most sensitive temperature is set by the spread of the logits — which is why one global temperature is never optimal for every prompt, and why adaptive / entropy-aware sampling is an active research direction.
  • It is the same temperature. The T in your sampling settings, the T in a Boltzmann machine, and the T in a gas are the one parameter, doing the one job: setting the exchange rate between energy and entropy.
Written on June 22, 2026