Skip to content

Exploration-Exploitation Tradeoff

Should the agent exploit its current best-known action, or explore alternatives that might be better? Exploiting too early locks in suboptimal behaviour; exploring too long wastes time on known-bad options. From multi-armed bandits and control theory — the foundational decision-making dilemma in RL.

You’re in a new city choosing restaurants. You found a decent place on day 1. Do you keep going there (exploit), or try somewhere new (explore)? If you only exploit, you’ll never find the amazing place around the corner. If you only explore, you’ll eat badly most nights even though you already know a good option. The optimal strategy depends on how long you’re staying: a one-week visit favours exploitation; a one-year stay rewards exploration.

The problem is that exploration has uncertain, delayed payoff — you might try 5 bad restaurants before finding a great one, and you can’t know in advance which exploration will pay off. Exploitation has immediate, known payoff. Humans and agents both have a bias toward exploitation because the reward is tangible and immediate.

In deep RL, the problem is compounded by the function approximator: the agent’s value estimates for unexplored actions are unreliable, so it can’t even accurately judge whether exploration is likely to be worthwhile. It has to explore partly to learn which actions are worth exploring.

  • The agent converges to a mediocre policy and stops improving — it found a local optimum and has no mechanism to escape
  • Reward curves plateau early at a level well below the theoretical optimum
  • State visitation is narrow — the agent visits a small subset of the state space and never discovers rewarding regions
  • Performance varies dramatically across random seeds — some seeds happen to explore the right regions early, others get stuck
  • Sparse-reward environments are nearly impossible without explicit exploration mechanisms — the agent never stumbles onto the reward
  • Q-learning (q-learning/): ε-greedy exploration is the standard baseline — take a random action with probability ε. Simple but crude; doesn’t direct exploration toward informative states
  • Policy gradient (policy-gradient/): SAC adds entropy regularisation to the objective, rewarding the policy for maintaining randomness — this provides smooth, principled exploration proportional to uncertainty
  • Contrastive learning (contrastive-self-supervising/): not RL, but data augmentation is analogous to exploration — diverse augmentations “explore” the space of invariances the model should learn
SolutionMechanismWhere documented
ε-greedyRandom action with probability ε, greedy otherwiseatomic-concepts/rl-specific/epsilon-greedy-exploration.md
Entropy regularisation (SAC)Add αH(π)\alpha \mathcal{H}(\pi) to the reward — the policy is rewarded for being stochasticatomic-concepts/regularisation/entropy-regularisation.md
UCB (Upper Confidence Bound)Choose actions based on value estimate + uncertainty bonus(bandits literature)
Intrinsic motivation / curiosityReward the agent for visiting novel or surprising states(Pathak et al., 2017)
Thompson samplingSample from the posterior over values; explore naturally through posterior uncertainty(Bayesian methods)
Boltzmann explorationSample actions proportional to exponentiated Q-values (softmax policy)atomic-concepts/activation-functions/softmax.md

The exploration-exploitation tradeoff was formalised in the multi-armed bandit problem (Robbins, 1952), where it has an elegant theoretical treatment — the Gittins index (1979) gives the optimal solution for discounted bandits. In full RL (sequential decisions with state), the problem becomes intractable to solve optimally, and practical solutions are all heuristics. ε-greedy (Watkins, 1989) is the simplest. Entropy regularisation entered deep RL through SAC (Haarnoja et al., 2018), which provided a principled framework for treating exploration as part of the objective rather than a bolt-on heuristic. The shift from ε-greedy to entropy regularisation mirrors the broader trend in deep learning: replace discrete heuristics with differentiable objectives.