Introduction
Reinforcement Learning (RL) has exploded in popularity — first in game-playing agents, and now in large language models via methods like RLHF (Reinforcement Learning from Human Feedback). These approaches don’t just help models learn context better; they also improve reasoning by teaching them to “think in steps.”
My fascination with RL began when the GPT-3 paper was published and ChatGPT emerged as the so-called “tool of the decade.” I wanted to go beyond using these models — I wanted to understand how they work under the hood. That meant building RL concepts from the ground up: deriving equations, implementing toy solutions in environments like CartPole and FrozenLake, and seeing theory come alive in code.
This post is the first in a series where we’ll start with foundations:
- What is an MDP and how does it differ from an HMM?
- How do rewards and value functions actually work?
- How can we move from knowing state values to learning optimal actions with Q-learning?
- And how do we scale that to Deep Q-Learning when the state space explodes?
By the end, you’ll have a baseline RL toolkit — from mathematical definitions to runnable code — and a clear picture of how these pieces fit together when building agents.
Why This Series?
When I first encountered MDPs, I assumed they were simply an extension of Hidden Markov Models. I quickly learned this was wrong. HMMs deal with hidden states and observations, while MDPs deal with fully observable states and decision-making under uncertainty. Understanding this difference changed how I approached RL problems.
This blog is written in a learning-by-doing style — meaning you’ll see the concepts, math, and code side by side, along with real-world analogies (including fraud detection examples) and small “try-it-yourself” prompts.
Part 1: Foundations with MDPs (and a “HMM aside”)
Markov Decision Process (MDP)
Markov Decision Process (MDP) models an agent interacting with an environment over discrete time steps. It’s defined by the 5-tuple $(S, A, P, R, \gamma)$ where:
- $S$ = states
- $A$ = actions
- $P(s’|s,a)$ = transition probability of reaching $s’$ after taking action $a$ in state $s$
- $R(s,a,s’)$ = reward for this transition
- $\gamma \in [0,1)$ = discount factor for future rewards
The Markov Property
The future is conditionally independent of the past, given the present state.
Formally: $$P(s_{t+1} \mid s_t, a_t, s_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)$$
Why MDPs Matter
- They formalize sequential decision making: robotics, games, recommendation systems.
- They give a clear mathematical foundation for defining and solving RL problems via value functions, policy search, and dynamic programming.
How do Markov Decision Processes differ from Hidden Markov Models
- HMMs deal with partial observability (you only see observations, not the actual underlying states).
- The context window for HMMs is technically just 1 since they’re first-order Markov models, meaning they depend only on the current hidden state.
HMM vs MDPs
Aspect | HMM | MDP |
---|---|---|
State | Hidden (unobserved) | Fully observed |
Observations | Emitted from hidden states | Not applicable (agent directly observes state) |
Actions | No actions; just a generative sequence model | Agent chooses actions $a$ in each state $s$ |
Transition Model | $P(h_{t+1} \mid h_t)$ | $P(s_{t+1} \mid s_t,a_t)$ |
Emission/Reward | Emission: $P(o_t \mid h_t)$ | Reward: $R(s_t,a_t,s_{t+1})$ |
Objective | Compute likelihood or decode hidden path | Maximize expected cumulative reward |
Algorithms | Forward/backward; Viterbi; Baum–Welch (EM) | Value iteration; policy iteration; Q-learning; policy gradients |
Use Cases | Sequence labeling (speech, POS, bioinfo) | Sequential decision-making (robotics, games, control) |
Reward Functions
Definition
A reward function tells you how “good” a single transition is. Formally:
$$R(s,a,s’) = \text{expected immediate reward when you take action } a \text{ in state } s \text{ and end up in } s’$$
Symbol | Meaning |
---|---|
$s$ | Current state |
$a$ | Action taken |
$s'$ | Next state |
$R(s,a,s’)$ | Reward you get for that $(s \to s’)$ transition |
Intuition
- In a game, you might get +10 points for eating a pellet or –100 if you hit a ghost.
- In finance, you might receive +$5 if a trade succeeds or –$2 if it fails.
- In fraud detection, you might get +$10 if you correctly detect fraud, -$100 if you miss fraud, -$20 for a false positive, and +$5 if you correctly detect non-fraud.
State-Value Function $V^\pi(s)$
Under a given policy $\pi$, the value of a state $s$ is the expected, discounted sum of future rewards when you start in $s$ and follow $\pi$:
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty}\gamma^t r_{t+1} \mid S_0=s\right]$$
- $\gamma$ trades off immediate vs. long-term reward.
- High $V^\pi(s)$ means “good to be here under $\pi$”
- $r_{t+1} = R(S_t, A_t, S_{t+1})$
- $\mathbb{E}$ = average over all possible futures under $\pi$
Action Value Function $Q^\pi(s,a)$
$Q^\pi(s, a)$ is the expected (average) discounted return if you take action $a$ in state $s$ now and then follow the policy $\pi$ afterwards.
Why do we need the Action-Value function $Q(s,a)$?
Short answer:
Because if you only know how good a state is (that’s $V(s)$), you still don’t know which action to take from that state without either (a) a model of the world to look ahead, or (b) evaluating every action by trial. $Q(s,a)$ tells you the expected return if you take action $a$ now in state $s$, then behave well after. That lets you pick the best action locally without a model.
Intuition (intersection analogy)
You’re at a road intersection (state $s$). Knowing “this intersection is promising” (high $V(s)$) doesn’t tell you whether to turn left or right. The thing you actually need at the decision point is: “If I turn left right now, how good is that?” That number is $Q(s,\text{left})$. Likewise for right.
$$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty}\gamma^k r_{t+1+k} \mid S_t=s, A_t=a\right]$$
Action-Value as an Expectation (from returns to Bellman)
TL;DR:
We turn the scary infinite return into a local recursion: “reward now + discounted value later,” averaged over what can happen next. This uses two facts:
- the law of total expectation
- the Markov property
Step 1: Start from the definition (returns view)
$$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty}\gamma^k r_{t+1+k} \mid S_t=s, A_t=a\right]$$
Step 2: Peel off one step
Separate the immediate reward from everything after:
$$Q^\pi(s,a) = \mathbb{E}\left[r_{t+1} + \gamma G_{t+1} \mid S_t=s, A_t=a\right]$$
where $G_{t+1} = \sum_{k=0}^{\infty}\gamma^k r_{t+2+k}$ is “the return starting next step.”
Step 3: Condition on the next state $s’$ (law of total expectation)
$$Q^\pi(s,a) = \mathbb{E}_{s’ \sim P(\cdot|s,a)}\left[\mathbb{E}\left[r_{t+1} + \gamma G_{t+1} \mid S_t=s,A_t=a,S_{t+1}=s’\right]\right]$$
Step 4: Use the Markov property (make it local)
Given $(s,a,s’)$:
- Immediate reward depends only on that transition: $\mathbb{E}[r_{t+1}|s,a,s’] = R(s,a,s’)$
- The future return depends only on the next state (and then following $\pi$): $\mathbb{E}[G_{t+1}|S_{t+1}=s’] = V^\pi(s’)$
So the inner expectation becomes $R(s,a,s’) + \gamma V^\pi(s’)$, yielding:
$$Q^\pi(s,a) = \mathbb{E}_{s’ \sim P(\cdot|s,a)}\big[R(s,a,s’) + \gamma V^\pi(s’)\big]$$
Step 5: Expand $V^\pi$ over next actions
By definition of the state value: $V^\pi(s’) = \sum_{a’} \pi(a’|s’) Q^\pi(s’,a’)$
Putting it all together gives the Bellman expectation equation for $Q^\pi$:
$$\boxed{Q^\pi(s,a) = \sum_{s’} P(s’|s,a)\Big[R(s,a,s’) + \gamma \sum_{a’} \pi(a’|s’) Q^\pi(s’,a’)\Big]}$$
Simple toy example
From state $s$, action $a$ leads to:
- $s_1$ with prob 0.7, reward 5
- $s_2$ with prob 0.3, reward 0
Let $\gamma=0.9$, and suppose $V^\pi(s_1)=10$ and $V^\pi(s_2)=2$. Then:
$$\begin{align} Q^\pi(s,a) &= 0.7(5 + 0.9 \times 10) + 0.3(0 + 0.9 \times 2) \ &= 0.7(14) + 0.3(1.8) \ &= 9.8 + 0.54 \ &= \mathbf{10.34} \end{align}$$
It’s a weighted average of “reward now + discounted value later” over possible next states.
Q-Learning
Knowing state value is great—if you also have a model of the world to look ahead. But when you don’t, you want to know directly: How good is taking action a in state s right now? That’s the action-value ($Q(s,a)$), and Q-learning learns it from experience with no model required.
Tabular Q-Learning — from optimality to an update you can code
1) Objective: optimal action-values
$$Q^(s,a) = \sum_{s’} P(s’|s,a)\Big[R(s,a,s’) + \gamma \max_{a’} Q^(s’,a’)\Big]$$
This is the Bellman optimality equation. If we had $Q^$, the greedy policy $\pi^(s) = \arg\max_a Q^*(s,a)$ is optimal.
2) One-step TD target (what we aim at each step)
Replace the expectation over $s’$ with a sample from the environment:
$$y_{\text{target}} = r + \gamma \max_{a’} Q(s’,a’)$$
This is a sampled estimate of the right-hand side of Bellman optimality.
3) Q-learning update (move toward the target)
$$Q(s,a) \leftarrow Q(s,a) + \alpha \Big[r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)\Big]$$
- $\alpha$ is the learning rate
- We bootstrap using our own $Q$ on $s'$
- Using $\max_{a’}$ (rather than the next action actually taken) makes Q-learning off-policy: it learns about the greedy policy even if behavior is exploratory
4) Exploration: ε-greedy behavior policy
We still need to visit state-actions to learn them:
- With probability ε: take a random action (explore)
- Otherwise: take $\arg\max_a Q(s,a)$ (exploit)
Typical schedule: start ε at 1.0, decay to 0.1 (or 0.01) over many episodes.
5) Tiny numeric example (feel the TD step)
Suppose at $(s,a)$ you observe:
- reward $r=1.0$, next state $s'$
- current $Q(s,a)=0.50$
- $\max_{a’} Q(s’,a’) = 0.80$
- $\gamma=0.9$, $\alpha=0.5$
Target: $y = r + \gamma \max_{a’} Q(s’,a’) = 1.0 + 0.9 \times 0.80 = 1.72$
Update: $Q_{\text{new}}(s,a) = 0.50 + 0.5 \times (1.72 - 0.50) = \mathbf{1.11}$
You’ve moved halfway toward the target.
6) SARSA vs Expected SARSA vs Q-learning (quick contrast)
- SARSA (on-policy) uses the actual next action $a’ \sim \pi$: $$Q \leftarrow Q + \alpha[r + \gamma Q(s’,a’) - Q]$$
- Expected SARSA computes the average over next actions: $$Q \leftarrow Q + \alpha[r + \gamma \sum_{a’} \pi(a’|s’)Q(s’,a’) - Q]$$
- Q-learning (off-policy) uses the max over actions: $$Q \leftarrow Q + \alpha[r + \gamma \max_{a’} Q(s’,a’) - Q]$$
7) Minimal NumPy implementation (FrozenLake, tabular)
import numpy as np, gym
env = gym.make('FrozenLake-v1', is_slippery=False)
nS, nA = env.observation_space.n, env.action_space.n
Q = np.zeros((nS, nA))
alpha, gamma = 0.8, 0.9
eps, eps_min, eps_decay = 1.0, 0.1, 0.995
episodes = 2000
def eps_greedy(s):
if np.random.rand() < eps:
return env.action_space.sample()
return np.argmax(Q[s])
for ep in range(episodes):
s, done = env.reset(), False
while not done:
a = eps_greedy(s)
s2, r, done, _ = env.step(a)
target = r + (0 if done else gamma * np.max(Q[s2]))
Q[s, a] += alpha * (target - Q[s, a])
s = s2
eps = max(eps_min, eps * eps_decay)
print("Q-table:\n", Q)
Terminal states: if done, set the bootstrapped term to 0 (as above).
Deep Q-Learning
Q-learning works only to an extent where it is possible to keep track of states and actions at a given time. However, as we move away from simple games like tic-tac-toe to something as complex as Chess, the number of states and actions at a given time increases exponentially. At that stage, keeping track of a Q-table becomes very compute intensive, doesn’t scale well, and is very slow. Hence, the paradigm of Deep Q-learning provides the potential to overcome this barrier.
Tabular methods break when the state space is huge. DQN replaces the table with a neural net $Q_\theta(s,a)$ that outputs all action values from a state.
Core ideas (why DQN works)
- Function approximation: $Q_\theta(s,a)$ via a neural net
- Experience replay: Store $(s,a,r,s’,\text{done})$ transitions, train on random mini‑batches to break correlation
Loss and targets
For a batch $\mathcal{B}$ of transitions:
$$y_i = r_i \quad \text{if episode is done}$$
$$y_i = r_i + \gamma \max_{a’} Q_{\theta^-}(s’_i, a’) \quad \text{otherwise}$$
Squared loss (often Huber in practice):
$$\mathcal{L}(\theta) = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \big( y_i - Q_\theta(s_i, a_i) \big)^2$$
Gradient step: $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)$
Target network update (periodic hard update every $C$ steps): $\theta^- \leftarrow \theta$
(or soft update: $\theta^- \leftarrow \tau \theta + (1-\tau)\theta^-$)
Double DQN (reduces overestimation bias)
Use the online net to pick the action and the target net to evaluate it:
$$y_i = r_i + \gamma Q_{\theta^-}(s’_i, a^*)$$
where $a^* = \text{argmax}_{a’} Q_\theta(s’_i,a’)$
Code Example for Deep Q-Learning
For a complete implementation example, check out this GitHub gist with a working DQN on CartPole.
What’s Next?
In the next post, we’ll dive deeper into:
- Policy Gradient methods (REINFORCE, Actor-Critic)
- Advanced DQN variants (Dueling DQN, Prioritized Experience Replay)
- Real-world applications in fraud detection and recommendation systems
Stay tuned for more hands-on RL content! 🚀
Have questions or thoughts about this post? Feel free to reach out - I’d love to discuss RL concepts and applications!