Log-Derivative Trick
Log-Derivative Trick
Section titled “Log-Derivative Trick”The identity that lets you estimate gradients of expectations by sampling, without differentiating through the sampling process itself. Also called the REINFORCE trick or score function estimator. The foundation of all policy gradient methods (see policy-gradient/).
Intuition
Section titled “Intuition”Suppose you want to optimise — the expected value of some reward under a distribution you control. The problem: you can’t backprop through the act of sampling. Sampling is a discrete, non-differentiable operation — you rolled a die and got a 4, and there’s no gradient of “rolling a 4” with respect to the die’s probabilities.
The log-derivative trick sidesteps this entirely. Instead of differentiating through the sample, it says: “keep the sample fixed, and ask how much more likely that sample would become if you nudged the parameters.” If a high-reward sample would become more likely with a small parameter change, that’s a good direction. The gradient is exactly this “how to make more likely” direction, and weights it by how good that sample was.
The cost: high variance. You’re estimating a gradient from individual samples, and can be noisy. This is why every practical algorithm (A2C, PPO) adds a baseline to form — the baseline doesn’t change the expected gradient but dramatically reduces variance.
The core identity — from the chain rule applied to :
Gradient of an expectation — substitute the identity into :
This expectation can be estimated by Monte Carlo sampling — draw and average.
With baseline (variance reduction, does not change expectation because ):
Policy gradient specialisation — is a policy , is the return :
import torchimport torch.nn.functional as F
# ── Policy gradient using the log-derivative trick ──────────────# The key line: log_prob * advantage gives a surrogate loss whose# gradient equals the policy gradient. We NEVER differentiate# through the sampling — actions are treated as fixed constants.
logits = policy_net(states) # (B, n_actions)dist = torch.distributions.Categorical(logits=logits)actions = dist.sample() # (B,) — no gradient herelog_probs = dist.log_prob(actions) # (B,) — gradient flows through logits
# advantages: (B,) — precomputed, detached from the graphsurrogate_loss = -(log_probs * advantages).mean() # scalarsurrogate_loss.backward() # ∇θ matches the policy gradient
# WARNING: the negative sign is essential — we MAXIMISE expected reward# by MINIMISING negative log_prob * advantage.
# ── With entropy bonus (encourages exploration) ─────────────────entropy = dist.entropy().mean() # scalarloss = -(log_probs * advantages).mean() - 0.01 * entropyManual Implementation
Section titled “Manual Implementation”import numpy as np
def reinforce_gradient(logits, actions, rewards, baseline=0.0): """ Compute the REINFORCE policy gradient estimate. logits: (B, A) raw scores from policy network actions: (B,) integer actions that were taken rewards: (B,) scalar rewards received baseline: scalar or (B,) baseline for variance reduction Returns: (B, A) gradient of logits — the surrogate loss gradient """ B, A = logits.shape
# Softmax: convert logits to action probabilities shifted = logits - logits.max(axis=1, keepdims=True) # (B, A) numerical stability exp_logits = np.exp(shifted) # (B, A) probs = exp_logits / exp_logits.sum(axis=1, keepdims=True) # (B, A)
# Log-prob of the taken action log_probs = np.log(probs[np.arange(B), actions] + 1e-8) # (B,)
# ∇logits log π(a|s) for a categorical distribution: # = one_hot(a) - probs (the "softmax gradient" identity) one_hot = np.zeros_like(logits) # (B, A) one_hot[np.arange(B), actions] = 1.0 grad_log_prob = one_hot - probs # (B, A)
# Weight by advantage: (reward - baseline) advantage = (rewards - baseline)[:, None] # (B, 1) policy_gradient = advantage * grad_log_prob # (B, A)
return policy_gradient # average over B for the batch gradientPopular Uses
Section titled “Popular Uses”- REINFORCE and all policy gradient methods (A2C, PPO): the log-derivative trick IS the policy gradient theorem (see
policy-gradient/) - Variational inference (original VAE gradient estimator): before the reparameterisation trick, REINFORCE was used to estimate — it works but has high variance
- Discrete latent variable models: any model with discrete sampling (hard attention, discrete VAE) must use score-function estimators since you can’t reparameterise discrete distributions
- Black-box optimisation / evolution strategies (OpenAI ES): estimate gradients of non-differentiable objectives by perturbing parameters and weighting by fitness
- Neural architecture search (ENAS): policy gradient over discrete architecture decisions
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Reparameterisation trick | Continuous latent variables (VAEs, normalising flows) | Much lower variance but requires a differentiable sampling path — doesn’t work for discrete distributions |
| Gumbel-softmax | Discrete variables where you want low-variance gradients | Continuous relaxation introduces bias; requires a temperature schedule |
| Straight-through estimator | Discrete forward pass with simple gradient approximation (VQ-VAE) | Biased gradient, but zero variance and dead simple to implement |
| Pathwise derivative | Deterministic functions of random inputs | Same idea as reparameterisation — only works when you can express sampling as a deterministic transform of fixed noise |
| Evolution strategies | Non-differentiable objectives, parallel hardware | Scales to massive parallelism but needs many more samples than REINFORCE |
Historical Context
Section titled “Historical Context”The identity is elementary calculus, but its use for gradient estimation was formalised by Williams (1992) in the REINFORCE paper, which showed how to train stochastic neural networks by sampling. The key insight — that you could optimise expectations without differentiating through the sampling — opened the door to reinforcement learning with function approximation.
The trick was independently known in statistics as the “score function method” and in operations research as the “likelihood ratio method.” Its resurgence in deep learning came through policy gradient methods (Sutton et al., 1999) and later through variational inference (Wingate & Weber, 2013), though in the VAE setting it was quickly superseded by the lower-variance reparameterisation trick (Kingma & Welling, 2014).