Epsilon-Greedy Exploration
Epsilon-Greedy Exploration
Section titled “Epsilon-Greedy Exploration”With probability epsilon take a random action, otherwise take the greedy (highest Q-value) action. The simplest exploration strategy for discrete action spaces, used in DQN and most tabular RL. Usually epsilon decays from a high value (1.0) to a small floor (0.01-0.1) over the course of training.
Intuition
Section titled “Intuition”Imagine you always eat at the restaurant you think is best. You’ll never discover the new place around the corner that might be even better. Epsilon-greedy fixes this by occasionally (with probability epsilon) choosing a random restaurant instead. Early in training, epsilon is high — you explore aggressively because your estimates are unreliable anyway. As you learn, epsilon decays — you exploit your knowledge more, but never stop exploring entirely.
The key property is simplicity: it requires no learned exploration model, no uncertainty estimates, no special architecture. You compute Q-values, pick the best one most of the time, and randomly deviate the rest. This is “dithering” exploration — completely undirected randomness. It works surprisingly well in practice for discrete action spaces, especially when combined with a replay buffer that re-uses past exploratory experiences.
The limitation is equally clear: epsilon-greedy explores uniformly over actions, not intelligently. In a 1000-action space, the random action is almost certainly useless. For continuous action spaces, random uniform noise is even worse — methods like Gaussian noise (DDPG) or entropy bonuses (SAC) are strictly better. But for Atari-scale discrete action spaces (4-18 actions), epsilon-greedy remains the standard.
Action selection:
Linear epsilon decay (most common):
Typical values: , , steps.
Probability of each action under epsilon-greedy:
This means even the greedy action gets a small boost from the random component.
import torchimport random
def epsilon_greedy_action(q_values, epsilon): """ Select action using epsilon-greedy. q_values: (n_actions,) Q-values for current state epsilon: float, exploration probability Returns: int, selected action index """ if random.random() < epsilon: return random.randrange(q_values.shape[0]) # random action else: return q_values.argmax().item() # greedy action
# ── Batched version (for vectorised envs) ───────────────────────def epsilon_greedy_batch(q_values, epsilon): """ q_values: (B, n_actions) Returns: (B,) action indices """ B = q_values.shape[0] greedy = q_values.argmax(dim=1) # (B,) random_actions = torch.randint(0, q_values.shape[1], (B,)) mask = (torch.rand(B) < epsilon) # (B,) True = explore return torch.where(mask, random_actions, greedy)
# ── Linear decay schedule ───────────────────────────────────────def get_epsilon(step, eps_start=1.0, eps_end=0.01, decay_steps=100_000): return max(eps_end, eps_start - step * (eps_start - eps_end) / decay_steps)Manual Implementation
Section titled “Manual Implementation”import numpy as np
def epsilon_greedy(q_values, epsilon, rng=None): """ Epsilon-greedy action selection, pure numpy. q_values: (n_actions,) estimated Q-values for current state epsilon: float in [0, 1], exploration probability rng: optional numpy random generator for reproducibility """ rng = rng or np.random.default_rng() n_actions = len(q_values)
if rng.random() < epsilon: return rng.integers(n_actions) # uniform random action else: # Break ties randomly (important early in training when Q ≈ 0) max_q = q_values.max() best_actions = np.where(q_values == max_q)[0] # all actions with max Q return rng.choice(best_actions)
def linear_decay(step, start=1.0, end=0.01, decay_steps=100_000): """Linear epsilon decay, clamped to [end, start].""" fraction = min(step / decay_steps, 1.0) return start + fraction * (end - start)
def exponential_decay(step, start=1.0, end=0.01, half_life=10_000): """Exponential decay: epsilon halves every half_life steps.""" return max(end, start * (0.5 ** (step / half_life)))Popular Uses
Section titled “Popular Uses”- DQN (see
q-learning/): the original DQN paper uses epsilon-greedy with linear decay from 1.0 to 0.1 over 1M frames, then holds at 0.1 - Double DQN, Dueling DQN: all DQN variants use the same epsilon-greedy exploration
- Tabular Q-learning and SARSA: the classic RL textbook algorithms (Sutton & Barto) use epsilon-greedy as the default exploration strategy
- CQL (see
q-learning/): offline RL — during the original data collection phase, epsilon-greedy with high epsilon generates the exploratory dataset
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Boltzmann (softmax) exploration | Want action probabilities proportional to Q-values | More directed than uniform random; temperature is another hyperparameter |
| Gaussian noise (DDPG, TD3) | Continuous action spaces | Adds N(0, sigma) to continuous actions; not applicable to discrete spaces |
| Entropy bonus (SAC) | Want principled exploration in continuous spaces | Maximises entropy alongside reward; requires modified objective |
| UCB (Upper Confidence Bound) | Bandit problems, tree search (MCTS) | Explores actions with high uncertainty; needs visit counts or uncertainty estimates |
| Noisy Networks (NoisyNet) | Want state-dependent exploration | Learned noise in network weights; replaces epsilon entirely; more compute |
| Intrinsic motivation (RND, ICM) | Sparse reward environments | Explores based on novelty/surprise; adds complexity and a second model |
Historical Context
Section titled “Historical Context”Epsilon-greedy is one of the oldest exploration strategies, originating from the multi-armed bandit literature in the 1950s-60s (Robbins, 1952). It was adopted into RL through Watkins’ Q-learning (1989) and became the standard exploration method in the tabular RL era.
The DQN paper (Mnih et al., 2015) used epsilon-greedy for Atari games and cemented it as the default for deep Q-learning. Despite being theoretically suboptimal (it explores uniformly rather than intelligently), its simplicity and reliability kept it dominant. More sophisticated methods like NoisyNets (Fortunato et al., 2018) and intrinsic motivation approaches exist but are rarely worth the added complexity for standard benchmarks. The trend in modern RL is toward entropy-based exploration (SAC) for continuous control, while epsilon-greedy remains standard for discrete action spaces.