Shannon Entropy, Visually: Surprise, Uncertainty, and Bits
When I first met Shannon entropy at university it arrived the way it usually does — as a formula, H = −Σ p log p, scribbled on a board, followed by "and this measures information." I copied it down, passed the exam, and lost every shred of intuition by the following week. This post is the refresher I wish I'd had: built around one idea you can poke, drag, and run.
Here's the whole story in one sentence, and everything below just unpacks it:
😲 The atom: surprise of a single outcome
Before averaging anything, look at one outcome. If an event has probability p, how surprised should you be when it happens? Shannon's answer — the only one satisfying a few natural rules — is the information content:
Why this shape? A certain event (p = 1) tells you nothing new → 0 bits. Halve a probability and you add exactly one bit of surprise (one extra yes/no question). So p = ½ is 1 bit, p = ¼ is 2 bits, p = 1/8 is 3 bits. Drag the probability and watch the surprise:
🪙 Average surprise: the biased coin
Now average. A coin lands heads with probability p, tails with 1 − p. Sometimes you get the surprising outcome, sometimes the likely one. Weight each surprise by how often it happens and you get the entropy of the coin:
Drag the bias and watch the curve. It's maximal at p = ½ — a fair coin is the most uncertain, worth a full 1 bit — and collapses to 0 at either end, where the coin is rigged and there's nothing left to be surprised about.
🎛️ More than two outcomes: shape a distribution
Coins are just the two-outcome case. For any distribution over outcomes x, entropy is the same average-surprise idea, summed over everything that can happen:
Below are tomorrow's weather odds. Drag the sliders to re-shape the distribution (the bars renormalise to sum to 1) and watch the entropy meter. Two limits bracket everything: 0 bits when one outcome is certain, and log2N bits when all N outcomes are equally likely — maximum ignorance.
🎲 "Average surprise" is not a metaphor
Entropy is an expectation, so let's actually take the average. Using the distribution you shaped above, we'll draw outcomes one at a time, score each draw by its surprise −log2p(x), and track the running average. By the law of large numbers it must home in on the entropy H — the dashed line — no matter where it starts.
🎯 The whole story in one breath
- Surprise of one outcome is −log2p: rare = many bits, certain = zero.
- Entropy is the average surprise — the expected bits per draw, H = −Σ p log2 p.
- It's maximised by uniformity (log2N bits — total ignorance) and zero under certainty.
- Measured in bits, it's the average number of optimal yes/no questions — and the floor on lossless compression (Shannon's source coding theorem).
Same single idea everywhere: information is resolved uncertainty. Entropy just counts, in bits, how much there was to resolve. Once you see −log p as "surprise," the formula stops being a magic incantation and starts being obvious.
