Shannon Entropy, Visually: Surprise, Uncertainty, and Bits

Shannon entropy is usually introduced as a formula — H = −Σ p log p — labelled "the amount of information" and left there. That label is correct but inert: it doesn't say what the quantity actually counts, or why the logarithm shows up. The point of this note is to rebuild the meaning from the ground up, one piece at a time, with a demo attached to each step so the formula becomes something to reason about rather than recite.

Here's the whole story in one sentence, and everything below just unpacks it:

Entropy is average surprise. A rare outcome is surprising and carries a lot of information; a near-certain one carries almost none. Entropy is just the expected surprise of a random draw — and because we measure surprise in bits, it literally counts the average number of yes/no questions you'd need to pin down the outcome.

😲 The atom: surprise of a single outcome

Before averaging anything, look at one outcome. If an event has probability p, how surprised should you be when it happens? Shannon's answer — the only one satisfying a few natural rules — is the information content:

h(x) = log₂ ( 1 / p ) = −log₂ p bits

Why this shape? A certain event (p = 1) tells you nothing new → 0 bits. Halve a probability and you add exactly one bit of surprise (one extra yes/no question). So p = ½ is 1 bit, p = ¼ is 2 bits, p = 1/8 is 3 bits. Drag the probability and watch the surprise:

p(x) 0.125

surprise

3.00

bits

"The sun rose today" (p≈1): ≈ 0 bits. No information — you already knew it.

"I flipped 7 heads in a row" (p = 1/128): 7 bits. Genuinely surprising, and it takes 7 yes/no answers to single it out.

🪙 Average surprise: the biased coin

Now average. A coin lands heads with probability p, tails with 1 − p. Sometimes you get the surprising outcome, sometimes the likely one. Weight each surprise by how often it happens and you get the entropy of the coin:

H(p) = −p log₂ p − (1−p) log₂(1−p) bits

Drag the bias and watch the curve. It's maximal at p = ½ — a fair coin is the most uncertain, worth a full 1 bit — and collapses to 0 at either end, where the coin is rigged and there's nothing left to be surprised about.

p(heads) 0.50

entropy H

1.00

bits

maximally uncertain

Notice the asymmetry of certainty: moving from p = 0.5 to 0.6 barely dents the entropy, but moving from 0.9 to 0.99 sheds a lot. The last scraps of uncertainty are the cheapest to remove — which is exactly why confident predictions are so informative.

🎛️ More than two outcomes: shape a distribution

Coins are just the two-outcome case. For any distribution over outcomes x, entropy is the same average-surprise idea, summed over everything that can happen:

H = Σ_x p(x) · log₂ ( 1 / p(x) ) = −Σ_x p(x) log₂ p(x)

Below are tomorrow's weather odds. Drag the sliders to re-shape the distribution (the bars renormalise to sum to 1) and watch the entropy meter. Two limits bracket everything: 0 bits when one outcome is certain, and log₂N bits when all N outcomes are equally likely — maximum ignorance.

entropy H

—

bits

max (log₂N)

—

bits

—

Make it certain (push one slider to the top): H → 0. If you already know it'll be sunny, a forecast tells you nothing.

Make it uniform: H peaks at log₂N. With N equally likely states, that's the most a forecast could ever tell you.

🎲 "Average surprise" is not a metaphor

Entropy is an expectation, so let's actually take the average. Using the distribution you shaped above, we'll draw outcomes one at a time, score each draw by its surprise −log₂p(x), and track the running average. By the law of large numbers it must home in on the entropy H — the dashed line — no matter where it starts.

running average surprise entropy H (target)

draws

avg surprise

—

entropy H

—

Try this: shape a lopsided distribution above, then run the sampler — the average surprise still converges onto H, just to a lower value. Then make it uniform and watch it climb to log₂N. The wiggling early on is small-sample noise; it always settles.

🎯 The whole story in one breath

Surprise of one outcome is −log₂p: rare = many bits, certain = zero.
Entropy is the average surprise — the expected bits per draw, H = −Σ p log₂ p.
It's maximised by uniformity (log₂N bits — total ignorance) and zero under certainty.
Measured in bits, it's the average number of optimal yes/no questions — and the floor on lossless compression (Shannon's source coding theorem).

Same single idea everywhere: information is resolved uncertainty. Entropy just counts, in bits, how much there was to resolve. Once you see −log p as "surprise," the formula stops being a magic incantation and starts being obvious.

Written on June 16, 2026