tadata
Back to home

Reinforcement Learning: Agents That Learn by Doing

#machine-learning#reinforcement-learning#deep-learning#optimization

Reinforcement learning (RL) trains agents to make sequential decisions by maximizing cumulative reward through trial and error. Unlike supervised learning, RL has no labeled dataset — the agent learns from the consequences of its own actions in an environment.

The RL Framework

An RL system has four components:

  • Agent: the learner and decision-maker
  • Environment: everything the agent interacts with
  • State (ss): the current situation
  • Action (aa): what the agent can do
  • Reward (rr): scalar feedback signal
Reinforcement Learning Loop
=============================

         ┌────────────────┐
         │   Environment  │
         └──┬──────────┬──┘
    state   │          │  reward
    s_t     ▼          ▼  r_t
         ┌────────────────┐
         │     Agent      │
         │  (policy π)    │
         └───────┬────────┘
                 │
                 ▼ action a_t
         ┌────────────────┐
         │   Environment  │
         │   transitions  │
         │   to s_{t+1}   │
         └────────────────┘

The agent's goal is to learn a policy π(as)\pi(a|s) that maximizes the expected cumulative discounted reward:

Gt=k=0γkrt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

where γ[0,1)\gamma \in [0, 1) is the discount factor balancing immediate vs future rewards.

Algorithm Taxonomy

FamilyMethodKey IdeaOn/Off PolicySample Efficiency
Value-basedQ-LearningLearn Q(s,a), act greedilyOff-policyLow
DQNQ-Learning + deep network + replay bufferOff-policyModerate
Double DQNTwo networks to reduce overestimationOff-policyModerate
Policy gradientREINFORCEDirect gradient on expected rewardOn-policyLow
PPOClipped surrogate objective for stabilityOn-policyModerate
TRPOTrust region constraint on policy updatesOn-policyModerate
Actor-CriticA2C/A3CPolicy (actor) + value (critic) combinedOn-policyModerate
SACMaximum entropy + off-policy actor-criticOff-policyHigh
TD3Twin critics + delayed policy updatesOff-policyHigh
Model-basedMuZeroLearned model + planning (MCTS)Off-policyVery high
DreamerWorld model + imagination rolloutsOff-policyVery high

Deep Q-Networks (DQN)

DQN (Mnih et al., 2015) was the breakthrough that made deep RL practical. Two key innovations:

Experience replay: store transitions (s,a,r,s)(s, a, r, s') in a buffer and sample random mini-batches for training. Breaks temporal correlation and improves data efficiency.

Target network: maintain a slowly-updated copy of the Q-network for computing targets. Prevents the moving target problem where the network chases its own predictions.

import torch
import torch.nn as nn
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.net(x)

# Training loop (simplified)
replay_buffer = deque(maxlen=100_000)
q_net = DQN(state_dim=4, action_dim=2)
target_net = DQN(state_dim=4, action_dim=2)
target_net.load_state_dict(q_net.state_dict())
optimizer = torch.optim.Adam(q_net.parameters(), lr=1e-3)

def update(batch_size=64, gamma=0.99):
    batch = random.sample(replay_buffer, batch_size)
    states, actions, rewards, next_states, dones = zip(*batch)

    q_values = q_net(states).gather(1, actions)
    with torch.no_grad():
        next_q = target_net(next_states).max(1)[0]
        targets = rewards + gamma * next_q * (1 - dones)

    loss = nn.functional.mse_loss(q_values.squeeze(), targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

PPO: The Industry Standard

Proximal Policy Optimization (PPO) is the most widely used RL algorithm in practice — powering RLHF for LLMs, robotics at OpenAI, and game AI. Its success comes from simplicity and stability:

  • Collects trajectories using the current policy
  • Optimizes a clipped surrogate objective that prevents large policy updates
  • No replay buffer needed (on-policy), but less sample-efficient

The clipped objective:

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio and A^t\hat{A}_t is the advantage estimate.

Landmark Results

AchievementYearAlgorithmSignificance
Atari games (superhuman)2015DQNFirst deep RL success at scale
AlphaGo (beats world champion)2016MCTS + policy/value netsGo was considered decades away
OpenAI Five (Dota 2)2019PPO + self-playComplex multi-agent strategy
AlphaFold 2 (protein folding)2020RL-inspired searchSolved a 50-year biology problem
RLHF for ChatGPT2022PPOAligned LLMs to human preferences
RT-2 (robotic manipulation)2023Vision-Language-ActionTransferred web knowledge to robots

Challenges

Reward shaping: designing reward functions is hard. Sparse rewards (only at goal) make exploration difficult. Dense rewards risk reward hacking — the agent finds shortcuts that maximize reward without solving the intended task.

Sample efficiency: model-free RL often requires millions of environment interactions. Model-based methods and offline RL (learning from logged data) address this but add complexity.

Sim-to-real transfer: policies trained in simulation often fail in the real world due to the reality gap. Domain randomization and system identification help bridge it.

Multi-agent: when multiple agents learn simultaneously, the environment becomes non-stationary from each agent's perspective. Convergence guarantees break down.