Entropy Regularisation
Entropy Regularisation
Section titled “Entropy Regularisation”Adds the entropy of the policy to the reinforcement learning objective: . Encourages exploration by preventing the policy from collapsing to a deterministic action too early. The defining component of SAC and a key ingredient in A2C/PPO.
Intuition
Section titled “Intuition”Without entropy regularisation, a policy gradient agent that discovers one good action will immediately exploit it — putting all probability on that action and never trying alternatives. This is premature convergence: the agent gets stuck in a local optimum because it stopped exploring before finding better strategies.
Entropy regularisation adds a bonus for randomness. A policy that spreads probability across many actions has high entropy and gets rewarded for it. The temperature parameter controls the tradeoff: high means “explore a lot, even if it costs some reward,” low means “mostly exploit, with a small exploration nudge.” The optimal policy under entropy regularisation is the Boltzmann (softmax) distribution over Q-values — actions with higher value get more probability, but no action gets zero probability.
This has a secondary benefit: entropy regularisation makes the optimisation landscape smoother. A deterministic policy has zero-volume support (a single action), making gradients noisy and the landscape spiky. A stochastic policy spreads probability mass, creating smoother gradients and more stable training. SAC exploits this to achieve the sample efficiency of off-policy methods with the stability of stochastic policies.
Entropy of a discrete policy:
Maximum entropy is (uniform policy). Minimum is 0 (deterministic policy).
Entropy of a continuous Gaussian policy :
Only depends on , not . Larger variance = higher entropy.
Maximum-entropy RL objective (SAC):
Soft Bellman equation (Q includes entropy):
Optimal policy (soft policy improvement):
This is a Boltzmann distribution — actions are chosen proportionally to exponentiated Q-values.
import torchimport torch.nn.functional as Ffrom torch.distributions import Categorical, Normal
# ── Discrete policy (A2C-style) ────────────────────────────────logits = policy_net(state) # (B, n_actions)dist = Categorical(logits=logits)action = dist.sample() # (B,)log_prob = dist.log_prob(action) # (B,)entropy = dist.entropy() # (B,)
# Add entropy bonus to the policy loss# Note: we SUBTRACT because we minimise loss but want to MAXIMISE entropypolicy_loss = -(log_prob * advantage).mean()loss = policy_loss - alpha * entropy.mean()
# ── Continuous policy (SAC-style) ──────────────────────────────mu, log_std = policy_net(state).chunk(2, dim=-1) # (B, d), (B, d)std = log_std.clamp(-20, 2).exp() # (B, d)dist = Normal(mu, std)action = dist.rsample() # (B, d) — reparameterised
# SAC uses log_prob directly instead of separate entropy termlog_prob = dist.log_prob(action).sum(dim=-1) # (B,)
# For tanh-squashed actions (standard in SAC):squashed = torch.tanh(action) # (B, d) — bounded to [-1, 1]# Jacobian correction for the tanh transformlog_prob -= torch.log(1 - squashed.pow(2) + 1e-6).sum(dim=-1) # (B,)
# WARNING: The tanh correction is easy to forget and causes silent# training instability. Always include it when using squashed actions.
# ── Automatic temperature tuning (SAC) ─────────────────────────# Learn alpha to hit a target entropy (typically -dim(A))log_alpha = torch.zeros(1, requires_grad=True)alpha = log_alpha.exp()alpha_loss = -(alpha * (log_prob + target_entropy).detach()).mean()Manual Implementation
Section titled “Manual Implementation”import numpy as np
def discrete_entropy(logits): """ Entropy of a categorical distribution from logits. logits: (B, n_actions) raw scores """ # Stable log-softmax shifted = logits - logits.max(axis=1, keepdims=True) # (B, A) log_probs = shifted - np.log(np.exp(shifted).sum(axis=1, keepdims=True)) # (B, A) probs = np.exp(log_probs) # (B, A) return -(probs * log_probs).sum(axis=1) # (B,)
def gaussian_entropy(log_std): """ Entropy of a diagonal Gaussian. log_std: (B, d) log standard deviations """ d = log_std.shape[1] return 0.5 * d * np.log(2 * np.pi * np.e) + log_std.sum(axis=1) # (B,)
def entropy_regularised_loss(log_probs, advantages, entropy, alpha=0.01): """ A2C-style policy loss with entropy bonus. log_probs: (B,) log-prob of taken action advantages: (B,) advantage estimates entropy: (B,) policy entropy """ policy_loss = -(log_probs * advantages).mean() return policy_loss - alpha * entropy.mean()Popular Uses
Section titled “Popular Uses”- SAC (Soft Actor-Critic): entropy regularisation is the core idea — the entire algorithm is built around maximum-entropy RL, with automatic temperature tuning
- A2C / A3C: adds to the policy loss to prevent premature convergence; typically
- PPO (in practice): many PPO implementations include an entropy bonus even though the original paper doesn’t emphasise it
- SQL (Soft Q-Learning): predecessor to SAC, uses entropy-augmented Bellman equation with energy-based policies
- Exploration in sparse-reward environments: entropy bonus keeps the agent exploring when rewards are rare
- AlphaGo / AlphaZero: MCTS exploration bonus serves a similar role to entropy regularisation in encouraging diverse action selection
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Epsilon-greedy | Simple discrete action spaces (DQN) | No gradient signal for exploration; abrupt transition from random to greedy |
| Boltzmann exploration | Discrete actions, temperature-based | Achieves similar effect to entropy regularisation but applied post-hoc to Q-values |
| Curiosity / ICM | Sparse reward, large state spaces | Intrinsic reward from prediction error; complements entropy regularisation |
| Noise injection (NoisyNet) | When you want learned exploration | Adds parametric noise to weights; exploration adapts per-state without explicit entropy |
| Count-based exploration | Tabular or small state spaces | Bonuses for visiting novel states; doesn’t scale well to continuous spaces |
Historical Context
Section titled “Historical Context”Entropy regularisation in RL traces back to Ziebart et al. (2008, “Maximum Entropy Inverse Reinforcement Learning”), who formalised maximum-entropy decision-making. The idea entered deep RL through A3C (Mnih et al. 2016), which added an entropy bonus to stabilise policy gradient training — a small but critical detail that many practitioners discovered empirically was necessary.
SAC (Haarnoja et al. 2018, “Soft Actor-Critic”) elevated entropy regularisation from a training trick to a first-class design principle. By building the entire algorithm around maximum-entropy RL — including entropy-augmented Bellman equations and automatic temperature tuning — SAC achieved state-of-the-art sample efficiency and robustness in continuous control. The automatic temperature mechanism ( learned to target a desired entropy) was particularly important: it eliminated a sensitive hyperparameter and let the agent naturally transition from exploration to exploitation as training progresses.