Shannon Entropy, Visually: Surprise, Uncertainty, and Bits

When I first met Shannon entropy at university it arrived the way it usually does — as a formula, H = −Σ p log p, scribbled on a board, followed by "and this measures information." I copied it down, passed the exam, and lost every shred of intuition by the following week. This post is the refresher I wish I'd had: built around one idea you can poke, drag, and run.

Here's the whole story in one sentence, and everything below just unpacks it:

Entropy is average surprise. A rare outcome is surprising and carries a lot of information; a near-certain one carries almost none. Entropy is just the expected surprise of a random draw — and because we measure surprise in bits, it literally counts the average number of yes/no questions you'd need to pin down the outcome.

😲 The atom: surprise of a single outcome

Before averaging anything, look at one outcome. If an event has probability p, how surprised should you be when it happens? Shannon's answer — the only one satisfying a few natural rules — is the information content:

h(x) = log2 ( 1 / p ) = −log2 p   bits

Why this shape? A certain event (p = 1) tells you nothing new → 0 bits. Halve a probability and you add exactly one bit of surprise (one extra yes/no question). So p = ½ is 1 bit, p = ¼ is 2 bits, p = 1/8 is 3 bits. Drag the probability and watch the surprise:

p(x) 0.125
surprise
3.00
bits
"The sun rose today" (p≈1): ≈ 0 bits. No information — you already knew it.
"I flipped 7 heads in a row" (p = 1/128): 7 bits. Genuinely surprising, and it takes 7 yes/no answers to single it out.

🪙 Average surprise: the biased coin

Now average. A coin lands heads with probability p, tails with 1 − p. Sometimes you get the surprising outcome, sometimes the likely one. Weight each surprise by how often it happens and you get the entropy of the coin:

H(p) = −p log2 p − (1−p) log2(1−p)   bits

Drag the bias and watch the curve. It's maximal at p = ½ — a fair coin is the most uncertain, worth a full 1 bit — and collapses to 0 at either end, where the coin is rigged and there's nothing left to be surprised about.

p(heads) 0.50
entropy H
1.00
bits
maximally uncertain
Notice the asymmetry of certainty: moving from p = 0.5 to 0.6 barely dents the entropy, but moving from 0.9 to 0.99 sheds a lot. The last scraps of uncertainty are the cheapest to remove — which is exactly why confident predictions are so informative.

🎛️ More than two outcomes: shape a distribution

Coins are just the two-outcome case. For any distribution over outcomes x, entropy is the same average-surprise idea, summed over everything that can happen:

H = Σx p(x) · log2 ( 1 / p(x) ) = −Σx p(x) log2 p(x)

Below are tomorrow's weather odds. Drag the sliders to re-shape the distribution (the bars renormalise to sum to 1) and watch the entropy meter. Two limits bracket everything: 0 bits when one outcome is certain, and log2N bits when all N outcomes are equally likely — maximum ignorance.

entropy H
bits
max (log2N)
bits
Make it certain (push one slider to the top): H → 0. If you already know it'll be sunny, a forecast tells you nothing.
Make it uniform: H peaks at log2N. With N equally likely states, that's the most a forecast could ever tell you.

🎲 "Average surprise" is not a metaphor

Entropy is an expectation, so let's actually take the average. Using the distribution you shaped above, we'll draw outcomes one at a time, score each draw by its surprise −log2p(x), and track the running average. By the law of large numbers it must home in on the entropy H — the dashed line — no matter where it starts.

running average surprise entropy H (target)
draws
0
avg surprise
entropy H
Try this: shape a lopsided distribution above, then run the sampler — the average surprise still converges onto H, just to a lower value. Then make it uniform and watch it climb to log2N. The wiggling early on is small-sample noise; it always settles.

🎯 The whole story in one breath

  • Surprise of one outcome is −log2p: rare = many bits, certain = zero.
  • Entropy is the average surprise — the expected bits per draw, H = −Σ p log2 p.
  • It's maximised by uniformity (log2N bits — total ignorance) and zero under certainty.
  • Measured in bits, it's the average number of optimal yes/no questions — and the floor on lossless compression (Shannon's source coding theorem).

Same single idea everywhere: information is resolved uncertainty. Entropy just counts, in bits, how much there was to resolve. Once you see −log p as "surprise," the formula stops being a magic incantation and starts being obvious.

Written on June 16, 2026