A Day in Luca's Life: Finding Optimal Decisions with MDPs
A Markov Decision Process (MDP) is the standard model for sequential decision-making under uncertainty: an agent in some state chooses an action, collects a reward, and lands in a new state — and the objective is to pick actions that maximize reward accumulated over the long run, not just the reward available right now. The hard part is precisely that tension: a choice that pays off today can leave the agent worse off tomorrow.
To keep that abstract definition concrete, the whole note runs on one small example. Luca is 9 and has exactly one decision each afternoon after school: study, play video games, or take a nap? For any single afternoon the choice barely matters, but over a whole school year it does. Playing (🎮) costs energy that hurts tomorrow's performance; studying (📚) is boring now but can mean better test scores next week; napping (💤) skips both. Maximizing happiness across the year is exactly the long-run trade-off an MDP captures.
The algorithm that solves it — value iteration — is short and worth understanding in full. The sections below build up the pieces one at a time, using Luca's afternoon as the running example, and end with a working solver.
🧱 The building blocks of an MDP
Every MDP is defined by four things. Here's how they map to Luca's life:
😴 Exhausted → 😐 Tired → 🙂 Balanced → 💪 Focused → ⭐ Star
| 📚 Study | 🎮 Play | 💤 Rest | |
|---|---|---|---|
| 😴 Exhausted | +2 | +4 | 0 |
| 😐 Tired | +3 | +5 | +1 |
| 🙂 Balanced | +5 | +7 | +3 |
| 💪 Focused | +8 | +10 | +6 |
| ⭐ Star | +12 | +14 | +10 |
🗺️ Explore the transition graph
Select a state (where Luca is) and an action (what he does) to see where he might end up — and with what probability.
📐 The Bellman equation — what's a state really worth?
To find the best policy we need to know the value of each state: roughly, "how much total happiness will Luca accumulate from now on, if he plays optimally starting from here?" Call it V*(s).
The key insight (due to Richard Bellman, 1957) is that the value of a state can be defined recursively:
Reading left to right: the value of state s equals the best action a you can take, where "best" means: (immediate reward) + (discount × expected value of next state).
After convergence (V approximately [74, 84, 96, 107, 114]):
Q(Balanced, 📚 Study) = 5 + 0.9×(0.1×84 + 0.3×96 + 0.6×107) = 5 + 91.5 = 96.5 ← winner
Q(Balanced, 🎮 Play) = 7 + 0.9×(0.4×84 + 0.4×96 + 0.2×107) = 7 + 84.4 = 91.4
Q(Balanced, 💤 Rest) = 3 + 0.9×(0.5×96 + 0.4×107 + 0.1×114) = 3 + 92.2 = 95.2
Play's +2 immediate lead is overwhelmed because Study transitions toward Focused and Star — states where the daily reward is 10–14, not 7. The long-term math is decisive.
🎮 Play as Luca — try your own policy
Make Luca's choices for 10 days. Try to maximize his total happiness. Your choices will be compared against the optimal policy afterward!
⚙️ Value iteration — watch V(s) converge
Start with V(s) = 0 everywhere. Repeatedly apply the Bellman update. Each step gives a better estimate of how valuable each state truly is — the bar chart shows this live.
Q-value table (current iteration)
🏆 How did you do vs. the optimal policy?
Complete 10 days above, then run value iteration to convergence — the comparison will appear here automatically.
