A Day in Luca's Life: Finding Optimal Decisions with MDPs

A Markov Decision Process (MDP) is the standard model for sequential decision-making under uncertainty: an agent in some state chooses an action, collects a reward, and lands in a new state — and the objective is to pick actions that maximize reward accumulated over the long run, not just the reward available right now. The hard part is precisely that tension: a choice that pays off today can leave the agent worse off tomorrow.

To keep that abstract definition concrete, the whole note runs on one small example. Luca is 9 and has exactly one decision each afternoon after school: study, play video games, or take a nap? For any single afternoon the choice barely matters, but over a whole school year it does. Playing (🎮) costs energy that hurts tomorrow's performance; studying (📚) is boring now but can mean better test scores next week; napping (💤) skips both. Maximizing happiness across the year is exactly the long-run trade-off an MDP captures.

The algorithm that solves it — value iteration — is short and worth understanding in full. The sections below build up the pieces one at a time, using Luca's afternoon as the running example, and end with a working solver.

🧱 The building blocks of an MDP

Every MDP is defined by four things. Here's how they map to Luca's life:

States S
Luca's situation, captured as his energy & grade level.

😴 Exhausted → 😐 Tired → 🙂 Balanced → 💪 Focused → ⭐ Star
Actions A & Rewards R(s,a)
Three choices each afternoon — but the happiness they bring depends on Luca's state. A Star Student enjoys everything more:

📚 Study🎮 Play💤 Rest
😴 Exhausted+2+40
😐 Tired+3+5+1
🙂 Balanced+5+7+3
💪 Focused+8+10+6
⭐ Star+12+14+10
Transitions P(s'|s,a)
Life is stochastic. Studying doesn't guarantee improvement — Luca might be too tired, or the test might be hard. Each action leads to next states with specific probabilities.
Discount factor γ = 0.9
Future happiness is worth slightly less than today's — not because it matters less, but because there's uncertainty about reaching that future. A reward tomorrow is worth 0.9× a reward today.

🗺️ Explore the transition graph

Select a state (where Luca is) and an action (what he does) to see where he might end up — and with what probability.

↑ Select a state and an action above to see the transitions

📐 The Bellman equation — what's a state really worth?

To find the best policy we need to know the value of each state: roughly, "how much total happiness will Luca accumulate from now on, if he plays optimally starting from here?" Call it V*(s).

The key insight (due to Richard Bellman, 1957) is that the value of a state can be defined recursively:

V*(s) = maxa [ R(s,a) + γ · Σs' P(s'|s,a) · V*(s') ]

Reading left to right: the value of state s equals the best action a you can take, where "best" means: (immediate reward) + (discount × expected value of next state).

Iteration 1 (V = 0 everywhere): Q(Balanced, 📚 Study) = 5 + 0 = 5, Q(Balanced, 🎮 Play) = 7 + 0 = 7, Q(Balanced, 💤 Rest) = 3 + 0 = 3. Play wins on pure immediate reward.

After convergence (V approximately [74, 84, 96, 107, 114]):
Q(Balanced, 📚 Study) = 5 + 0.9×(0.1×84 + 0.3×96 + 0.6×107) = 5 + 91.5 = 96.5 ← winner
Q(Balanced, 🎮 Play)  = 7 + 0.9×(0.4×84 + 0.4×96 + 0.2×107) = 7 + 84.4 = 91.4
Q(Balanced, 💤 Rest)  = 3 + 0.9×(0.5×96 + 0.4×107 + 0.1×114) = 3 + 92.2 = 95.2

Play's +2 immediate lead is overwhelmed because Study transitions toward Focused and Star — states where the daily reward is 10–14, not 7. The long-term math is decisive.

🎮 Play as Luca — try your own policy

Make Luca's choices for 10 days. Try to maximize his total happiness. Your choices will be compared against the optimal policy afterward!

🙂
Balanced
Feeling okay, grades fine
Day
1
Total Happiness
0

⚙️ Value iteration — watch V(s) converge

Start with V(s) = 0 everywhere. Repeatedly apply the Bellman update. Each step gives a better estimate of how valuable each state truly is — the bar chart shows this live.

Iteration 0  |  Δ = —
Q-value table (current iteration)

🏆 How did you do vs. the optimal policy?

Complete 10 days above, then run value iteration to convergence — the comparison will appear here automatically.

Written on May 21, 2026