Skip to content

Unified Policy Gradient Algorithm

The structure mirrors the Q-learning file. The core update() is always the same three steps — compute advantages, compute policy loss, gradient step — and the variants only swap out what goes into those.

The progression tells a clean story:

REINFORCE → works, but high variance because Ψ is the raw return for the entire trajectory. A good action in a bad episode gets punished.

+ Baseline → subtract V(s) so the signal becomes “better or worse than expected.” Same expected gradient, massively less variance. This is the single biggest practical improvement.

A2C → swap Monte Carlo returns for GAE, which blends multi-step TD errors. You get to tune the bias-variance tradeoff with λ, and you no longer need to wait for episodes to end.

PPO → same advantages as A2C, but changes how the gradient is used. Instead of one update then throw away the data, do K epochs of minibatch updates with the probability ratio clipped so no single step can wreck the policy. This is what makes PPO practical — it’s dramatically more sample-efficient while staying stable.

The key structural difference from Q-learning that PPO highlights: it overrides update() itself to add the multi-epoch minibatch loop. That’s the one place where the “core never changes” rule bends, because PPO’s whole point is reusing the same rollout multiple times — which is an outer-loop concern, not just a different loss.

Summary: What changes vs. what stays the same

Section titled “Summary: What changes vs. what stays the same”
  • Collect rollout with current policy
  • Compute advantages Ψ (PLUGGABLE)
  • Compute policy loss using Ψ (PLUGGABLE)
  • Gradient step on policy (+ optional value net)
  • Discard data, repeat (on-policy)
VariantAdvantage ΨPolicy loss
REINFORCEG_t (full return)−log π · G_t
REINFORCE+baselineG_t − V(s)−log π · (G_t − V)
A2CGAE(δ_t)−log π · A_GAE
PPO (clip)GAE(δ_t)−min(r·A, clip(r)·A)
VariantProblem SolvedIntuition for Solution
REINFORCENeed a model-free way to optimise a policy directly without learning Q-valuesUse the score function trick: ∇J ≈ E[G·∇log π]. Sample trajectories, weight actions by how much total reward followed
REINFORCE+baselineRaw returns have high variance: even good actions get noisy credit because the whole trajectory’s reward is lumped togetherSubtract V(s) — a learned estimate of the average return from state s. Now the signal is “better or worse than expected” instead of “good or bad in absolute terms”
A2CMonte Carlo returns still have high variance, you need complete episodes, and can’t do incremental updatesUse GAE: a weighted blend of multi-step TD advantages. λ controls the bias-variance tradeoff. Can update every N steps without waiting for episode ends
PPO (clip)Large policy updates are catastrophic — performance collapses and doesn’t recover. Vanilla PG wastes data (one update then discard)Reuse rollout data for K epochs of minibatch updates (more sample-efficient), but clip the probability ratio π/π_old to [1−ε, 1+ε] so no single update can change the policy too much
  • Q-learning is OFF-POLICY: can reuse old data from a replay buffer
  • Policy gradient is ON-POLICY: must collect fresh data each iteration (PPO stretches this with multiple epochs, but still discards after)
  • Q-learning learns a value, derives the policy (argmax Q)
  • Policy gradient directly optimises the policy parameters
  • Q-learning is natural for discrete actions
  • Policy gradient handles continuous actions naturally