Reinforcement Learning: Agents That Learn by Doing
Reinforcement learning (RL) trains agents to make sequential decisions by maximizing cumulative reward through trial and error. Unlike supervised learning, RL has no labeled dataset — the agent learns from the consequences of its own actions in an environment.
The RL Framework
An RL system has four components:
- Agent: the learner and decision-maker
- Environment: everything the agent interacts with
- State (): the current situation
- Action (): what the agent can do
- Reward (): scalar feedback signal
Reinforcement Learning Loop
=============================
┌────────────────┐
│ Environment │
└──┬──────────┬──┘
state │ │ reward
s_t ▼ ▼ r_t
┌────────────────┐
│ Agent │
│ (policy π) │
└───────┬────────┘
│
▼ action a_t
┌────────────────┐
│ Environment │
│ transitions │
│ to s_{t+1} │
└────────────────┘
The agent's goal is to learn a policy that maximizes the expected cumulative discounted reward:
where is the discount factor balancing immediate vs future rewards.
Algorithm Taxonomy
| Family | Method | Key Idea | On/Off Policy | Sample Efficiency |
|---|---|---|---|---|
| Value-based | Q-Learning | Learn Q(s,a), act greedily | Off-policy | Low |
| DQN | Q-Learning + deep network + replay buffer | Off-policy | Moderate | |
| Double DQN | Two networks to reduce overestimation | Off-policy | Moderate | |
| Policy gradient | REINFORCE | Direct gradient on expected reward | On-policy | Low |
| PPO | Clipped surrogate objective for stability | On-policy | Moderate | |
| TRPO | Trust region constraint on policy updates | On-policy | Moderate | |
| Actor-Critic | A2C/A3C | Policy (actor) + value (critic) combined | On-policy | Moderate |
| SAC | Maximum entropy + off-policy actor-critic | Off-policy | High | |
| TD3 | Twin critics + delayed policy updates | Off-policy | High | |
| Model-based | MuZero | Learned model + planning (MCTS) | Off-policy | Very high |
| Dreamer | World model + imagination rollouts | Off-policy | Very high |
Deep Q-Networks (DQN)
DQN (Mnih et al., 2015) was the breakthrough that made deep RL practical. Two key innovations:
Experience replay: store transitions in a buffer and sample random mini-batches for training. Breaks temporal correlation and improves data efficiency.
Target network: maintain a slowly-updated copy of the Q-network for computing targets. Prevents the moving target problem where the network chases its own predictions.
import torch
import torch.nn as nn
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.net(x)
# Training loop (simplified)
replay_buffer = deque(maxlen=100_000)
q_net = DQN(state_dim=4, action_dim=2)
target_net = DQN(state_dim=4, action_dim=2)
target_net.load_state_dict(q_net.state_dict())
optimizer = torch.optim.Adam(q_net.parameters(), lr=1e-3)
def update(batch_size=64, gamma=0.99):
batch = random.sample(replay_buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
q_values = q_net(states).gather(1, actions)
with torch.no_grad():
next_q = target_net(next_states).max(1)[0]
targets = rewards + gamma * next_q * (1 - dones)
loss = nn.functional.mse_loss(q_values.squeeze(), targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
PPO: The Industry Standard
Proximal Policy Optimization (PPO) is the most widely used RL algorithm in practice — powering RLHF for LLMs, robotics at OpenAI, and game AI. Its success comes from simplicity and stability:
- Collects trajectories using the current policy
- Optimizes a clipped surrogate objective that prevents large policy updates
- No replay buffer needed (on-policy), but less sample-efficient
The clipped objective:
where is the probability ratio and is the advantage estimate.
Landmark Results
| Achievement | Year | Algorithm | Significance |
|---|---|---|---|
| Atari games (superhuman) | 2015 | DQN | First deep RL success at scale |
| AlphaGo (beats world champion) | 2016 | MCTS + policy/value nets | Go was considered decades away |
| OpenAI Five (Dota 2) | 2019 | PPO + self-play | Complex multi-agent strategy |
| AlphaFold 2 (protein folding) | 2020 | RL-inspired search | Solved a 50-year biology problem |
| RLHF for ChatGPT | 2022 | PPO | Aligned LLMs to human preferences |
| RT-2 (robotic manipulation) | 2023 | Vision-Language-Action | Transferred web knowledge to robots |
Challenges
Reward shaping: designing reward functions is hard. Sparse rewards (only at goal) make exploration difficult. Dense rewards risk reward hacking — the agent finds shortcuts that maximize reward without solving the intended task.
Sample efficiency: model-free RL often requires millions of environment interactions. Model-based methods and offline RL (learning from logged data) address this but add complexity.
Sim-to-real transfer: policies trained in simulation often fail in the real world due to the reality gap. Domain randomization and system identification help bridge it.
Multi-agent: when multiple agents learn simultaneously, the environment becomes non-stationary from each agent's perspective. Convergence guarantees break down.