Temperature and Entropy: The One Dial Behind Softmax, LLMs, and Boltzmann Machines
Temperature is the scalar T that a model divides its logits by before the softmax turns them into probabilities: pi ∝ ezi/T. Turn it down and the distribution sharpens onto the single most likely token; turn it up and the distribution flattens out. The flatter a distribution, the higher its Shannon entropy — so the everyday intuition "higher temperature, higher entropy" is correct. What is worth seeing is that this is not an analogy at all: it is the same temperature, and the same entropy, that appear in statistical mechanics and in the Boltzmann machine.
Here is the whole note in one sentence; everything below unpacks it:
🌡️ The same distribution, two names
A model scores each candidate token with a real number called a logit zi. Temperature sampling rescales those logits by T and runs them through the softmax:
Now set the energy of token i to be its negative logit, Ei = −zi, and write β = 1/T. The exact same formula becomes the Boltzmann (Gibbs) distribution of physics — the one the Boltzmann machine samples from:
High logit ⇄ low energy ⇄ high probability. The denominator Z is the partition function. Drag the temperature and watch the next-token distribution for the toy prompt "the cat sat on the ___" reshape, with its entropy tracked underneath.
The entropy as a function of temperature — a single monotone climb from 0 to log2N:
🧊🔥 The two ends of the dial
Both extremes are worth naming because both are used in practice, and both pin down the entropy exactly. Slide the temperature in the demo above to either edge to see them.
⚖️ One dial, two appetites: energy vs entropy
There is a deeper reason temperature controls entropy, and it is the reason the Boltzmann distribution turns up everywhere. Among all distributions with a given expected energy 〈E〉, the one with the highest entropy is precisely the Boltzmann distribution — and β = 1/T is the Lagrange multiplier enforcing that energy budget (this is Jaynes' maximum-entropy principle). So picking a temperature is the same as picking how much you care about entropy relative to energy. That trade-off is bundled into one quantity, the free energy:
The Boltzmann distribution is exactly the distribution that minimises F. Read the two terms as competing appetites: minimising 〈E〉 wants to concentrate on the best token; maximising S wants to spread out; T sets the exchange rate between them. Cold ⇒ energy wins (sharp, low entropy); hot ⇒ entropy wins (spread, high entropy). Drag the temperature and watch the three quantities trade off. (Units: 〈E〉 and F are in logit (score) units — energy is just −logit — while S is in nats (natural log, where F = −T ln Z holds exactly; nats are bits × ln 2). The factor T in T·S is what converts nats into the same logit units, so the subtraction is well-posed.)
📈 The sweet spot: where entropy rises fastest
"Entropy increases with temperature" is true, but it does not increase evenly. The right chart above is the rate dS/dT, and it has a hump. There is a characteristic temperature where the distribution is most sensitive — where a small turn of the dial buys the most new entropy. Below it the model is nearly frozen on its top token; above it the gains taper as it approaches uniform. That peak is reminiscent of a phase transition — though with a finite vocabulary there is no true transition (the curve is perfectly smooth, never non-analytic); it is the smeared-out, finite-size echo of one. The rate is exactly the statistical-mechanics specific heat (with S in nats; for bits, divide by ln 2):
The whole shape is driven by Var(E), the variance of the logits. Two consequences fall straight out of this:
- Because Var(E) ≥ 0, the rate is never negative — so entropy is monotonically non-decreasing in T. The intuition is now a theorem, not a hunch: more temperature can never mean less entropy.
- The peak sits near the scale of the logit gaps. Models with spread-out logits (a confident model) need more heating before they loosen up; models with bunched logits are already near-uniform and have little entropy left to gain. This is the mechanical reason a single global temperature behaves so differently across prompts, and a rough reason the scale of the model's typical logit spread is a useful starting point for the sampling temperature (the best value is ultimately an empirical, quality-driven choice, not the dS/dT peak itself).
🎲 Sampling at temperature: the entropy you actually get
Entropy is an expectation, so take the average. This is the same sampler from the Shannon entropy note, but now temperature sets the target: we draw tokens from p(T), score each by its surprise −log2p, and track the running average. By the law of large numbers it homes in on H(T) — the dashed line — which the slider moves up and down.
📚 Where this lives in the literature
The temperature–entropy link is not a folk heuristic; it is the bridge between information theory and statistical mechanics, and it has been discussed for decades under several names.
- Maximum entropy (Jaynes, 1957). The Boltzmann distribution is derived as the maximum-entropy distribution at fixed expected energy; β = 1/T is the Lagrange multiplier. This is why temperature is the entropy dial — by construction.
- Gibbs / statistical mechanics. The partition function Z, free energy F = −T ln Z, and specific heat C = Var(E)/T2 are the standard toolkit; everything in this note is a one-line translation of it into "logits."
- Simulated annealing (Kirkpatrick, Gelatt & Vecchi, 1983) and Boltzmann machines (Ackley, Hinton & Sejnowski, 1985). Temperature used as a schedule to control how much the system explores — the subject of the Boltzmann note.
- Knowledge distillation (Hinton, Vinyals & Dean, 2015). A high softmax temperature deliberately raises the entropy of a teacher's outputs to expose its "dark knowledge" — the relative probabilities of the wrong answers — so a student can learn from soft targets.
- Neural-text decoding. Temperature sampling sits alongside top-k and nucleus / top-p sampling (Holtzman et al., 2020) as the standard knobs for trading diversity against coherence; all of them are, in effect, entropy controls on the next-token distribution.
🎯 The consequences worth remembering
- Monotone, and provably so. dS/dT = Var(E)/T3 ≥ 0: heating never lowers entropy.
- Two hard limits. T→0 is deterministic argmax (entropy 0); T→∞ is uniform (entropy log N). Greedy decoding and "maximally random" are the same dial at its two ends.
- Perplexity is the visible face of entropy. Perplexity = 2H is the effective number of choices the model is weighing; temperature tunes it directly. That is the readout under the first demo.
- A specific-heat sweet spot. The most sensitive temperature is set by the spread of the logits — which is why one global temperature is never optimal for every prompt, and why adaptive / entropy-aware sampling is an active research direction.
- It is the same temperature. The T in your sampling settings, the T in a Boltzmann machine, and the T in a gas are the one parameter, doing the one job: setting the exchange rate between energy and entropy.
