A Day in Luca's Life: Finding Optimal Decisions with MDPs
Meet Luca. He's 9 years old, loves dinosaurs, and has exactly one decision to make each afternoon when he gets home from school: should he study, play video games, or take a nap?
This feels like a simple question — and for any single afternoon, it kind of is. But what if Luca wants to maximize his happiness over the whole school year? Suddenly it's not obvious at all. Today's fun (🎮) costs energy that hurts tomorrow's performance. Today's studying (📚) feels boring but could mean great test scores next week. Napping (💤) skips both.
This is exactly the kind of problem that Markov Decision Processes (MDPs) were designed to solve — and the algorithm for solving them is surprisingly elegant. By the end of this post you'll have built intuition for how AI agents learn to make optimal decisions under uncertainty, using nothing more than Luca's homework dilemma as a guide.
🧱 The building blocks of an MDP
Every MDP is defined by four things. Here's how they map to Luca's life:
😴 Exhausted → 😐 Tired → 🙂 Balanced → 💪 Focused → ⭐ Star
| 📚 Study | 🎮 Play | 💤 Rest | |
|---|---|---|---|
| 😴 Exhausted | +2 | +4 | 0 |
| 😐 Tired | +3 | +5 | +1 |
| 🙂 Balanced | +5 | +7 | +3 |
| 💪 Focused | +8 | +10 | +6 |
| ⭐ Star | +12 | +14 | +10 |
🗺️ Explore the transition graph
Select a state (where Luca is) and an action (what he does) to see where he might end up — and with what probability.
📐 The Bellman equation — what's a state really worth?
To find the best policy we need to know the value of each state: roughly, "how much total happiness will Luca accumulate from now on, if he plays optimally starting from here?" Call it V*(s).
The key insight (due to Richard Bellman, 1957) is that the value of a state can be defined recursively:
Reading left to right: the value of state s equals the best action a you can take, where "best" means: (immediate reward) + (discount × expected value of next state).
After convergence (V approximately [70, 80, 94, 106, 113]):
Q(Balanced, 📚 Study) = 5 + 0.9×(0.1×80 + 0.3×94 + 0.6×106) = 5 + 89.8 = 94.8 ← winner
Q(Balanced, 🎮 Play) = 7 + 0.9×(0.4×80 + 0.4×94 + 0.2×106) = 7 + 81.7 = 88.7
Q(Balanced, 💤 Rest) = 3 + 0.9×(0.5×94 + 0.4×106 + 0.1×113) = 3 + 90.6 = 93.6
Play's +2 immediate lead is overwhelmed because Study transitions toward Focused and Star — states where the daily reward is 10–14, not 7. The long-term math is decisive.
🎮 Play as Luca — try your own policy
Make Luca's choices for 10 days. Try to maximize his total happiness. Your choices will be compared against the optimal policy afterward!
⚙️ Value iteration — watch V(s) converge
Start with V(s) = 0 everywhere. Repeatedly apply the Bellman update. Each step gives a better estimate of how valuable each state truly is — the bar chart shows this live.
Q-value table (current iteration)
🏆 How did you do vs. the optimal policy?
Complete 10 days above, then run value iteration to convergence — the comparison will appear here automatically.
