A Day in Luca's Life: Finding Optimal Decisions with MDPs

Meet Luca. He's 9 years old, loves dinosaurs, and has exactly one decision to make each afternoon when he gets home from school: should he study, play video games, or take a nap?

This feels like a simple question — and for any single afternoon, it kind of is. But what if Luca wants to maximize his happiness over the whole school year? Suddenly it's not obvious at all. Today's fun (🎮) costs energy that hurts tomorrow's performance. Today's studying (📚) feels boring but could mean great test scores next week. Napping (💤) skips both.

This is exactly the kind of problem that Markov Decision Processes (MDPs) were designed to solve — and the algorithm for solving them is surprisingly elegant. By the end of this post you'll have built intuition for how AI agents learn to make optimal decisions under uncertainty, using nothing more than Luca's homework dilemma as a guide.

🧱 The building blocks of an MDP

Every MDP is defined by four things. Here's how they map to Luca's life:

States S

Luca's situation, captured as his energy & grade level.

😴 Exhausted → 😐 Tired → 🙂 Balanced → 💪 Focused → ⭐ Star

Actions A & Rewards R(s,a)

Three choices each afternoon — but the happiness they bring depends on Luca's state. A Star Student enjoys everything more:

	📚 Study	🎮 Play	💤 Rest
😴 Exhausted	+2	+4	0
😐 Tired	+3	+5	+1
🙂 Balanced	+5	+7	+3
💪 Focused	+8	+10	+6
⭐ Star	+12	+14	+10

Transitions P(s'|s,a)

Life is stochastic. Studying doesn't guarantee improvement — Luca might be too tired, or the test might be hard. Each action leads to next states with specific probabilities.

Discount factor γ = 0.9

Future happiness is worth slightly less than today's — not because it matters less, but because there's uncertainty about reaching that future. A reward tomorrow is worth 0.9× a reward today.

🗺️ Explore the transition graph

Select a state (where Luca is) and an action (what he does) to see where he might end up — and with what probability.

↑ Select a state and an action above to see the transitions

📐 The Bellman equation — what's a state really worth?

To find the best policy we need to know the value of each state: roughly, "how much total happiness will Luca accumulate from now on, if he plays optimally starting from here?" Call it V*(s).

The key insight (due to Richard Bellman, 1957) is that the value of a state can be defined recursively:

V*(s) = max_a [ R(s,a) + γ · Σ_s' P(s'|s,a) · V*(s') ]

Reading left to right: the value of state s equals the best action a you can take, where "best" means: (immediate reward) + (discount × expected value of next state).

Iteration 1 (V = 0 everywhere): Q(Balanced, 📚 Study) = 5 + 0 = 5, Q(Balanced, 🎮 Play) = 7 + 0 = 7, Q(Balanced, 💤 Rest) = 3 + 0 = 3. Play wins on pure immediate reward.

After convergence (V approximately [70, 80, 94, 106, 113]):
Q(Balanced, 📚 Study) = 5 + 0.9×(0.1×80 + 0.3×94 + 0.6×106) = 5 + 89.8 = 94.8 ← winner
Q(Balanced, 🎮 Play) = 7 + 0.9×(0.4×80 + 0.4×94 + 0.2×106) = 7 + 81.7 = 88.7
Q(Balanced, 💤 Rest) = 3 + 0.9×(0.5×94 + 0.4×106 + 0.1×113) = 3 + 90.6 = 93.6

Play's +2 immediate lead is overwhelmed because Study transitions toward Focused and Star — states where the daily reward is 10–14, not 7. The long-term math is decisive.

🎮 Play as Luca — try your own policy

Make Luca's choices for 10 days. Try to maximize his total happiness. Your choices will be compared against the optimal policy afterward!

🙂

Balanced

Feeling okay, grades fine

Day

Total Happiness

⚙️ Value iteration — watch V(s) converge

Start with V(s) = 0 everywhere. Repeatedly apply the Bellman update. Each step gives a better estimate of how valuable each state truly is — the bar chart shows this live.

Iteration 0 | Δ = —

Q-value table (current iteration)

🏆 How did you do vs. the optimal policy?

Complete 10 days above, then run value iteration to convergence — the comparison will appear here automatically.

Written on May 21, 2026

Alessandro Sanvito