Discount Factor (Gamma, γ)

Weights future rewards exponentially, making near-term rewards worth more than distant ones. The single scalar $\gamma \in [0, 1)$ defines what “long-term” means for an RL agent — it appears in every value function, every target computation, and every return estimate. Arguably the most important hyperparameter in RL.

Intuition

A dollar today is worth more than a dollar next year. Discounting in RL works the same way: a reward $k$ steps in the future is worth $\gamma^k$ times its face value. With $\gamma = 0.99$ , a reward 100 steps away is worth $0.99^{100} \approx 0.37$ — significant but diminished. With $\gamma = 0.9$ , that same reward is worth $0.9^{100} \approx 0.00003$ — effectively invisible.

The effective horizon — how far ahead the agent meaningfully plans — is approximately $1/(1-\gamma)$ . So $\gamma = 0.99$ gives a horizon of ~100 steps, $\gamma = 0.999$ gives ~1000 steps, and $\gamma = 0$ makes the agent completely myopic (only cares about immediate reward).

Why not set $\gamma = 1$ and care about all future rewards equally? For episodic tasks (games, episodes that end), you can — the sum is finite. But for continuing tasks (robot staying balanced forever), undiscounted returns are infinite and value functions diverge. Discounting also reduces variance in practice: distant rewards add noise because they depend on many uncertain future actions.

Math

Discounted return from time $t$ :

$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$

Recursive form (Bellman equation):

$G_t = r_t + \gamma G_{t+1}$

Value function:

$V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s] = \mathbb{E}_\pi\!\left[r_t + \gamma V^\pi(s_{t+1}) \mid s_t = s\right]$

Q-learning target:

$y = r + \gamma \max_{a'} Q(s', a')$

GAE (Generalised Advantage Estimation) — uses both $\gamma$ and $\lambda$ :

$\hat{A}_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

Effective horizon:

$\sum_{k=0}^{\infty} \gamma^k = \frac{1}{1 - \gamma}$

$\gamma$	Effective horizon	Character
0.0	1 step	Purely greedy
0.9	10 steps	Short-sighted
0.99	100 steps	Standard RL
0.999	1000 steps	Very far-sighted
1.0	$\infty$	Undiscounted (episodic only)

Code

import torch

# ── Discounted returns (used in REINFORCE, A2C) ──────────────────
def compute_returns(rewards, gamma=0.99):
    """Compute discounted returns from a list of rewards."""
    returns = []
    G = 0.0
    for r in reversed(rewards):                    # work backwards
        G = r + gamma * G
        returns.insert(0, G)
    return torch.tensor(returns)                   # (T,)

# ── Q-learning target ────────────────────────────────────────────
gamma = 0.99
with torch.no_grad():
    next_q = target_net(next_states).max(dim=1).values  # (B,)
    target = rewards + gamma * (1 - dones) * next_q     # (B,) — zero out terminal

# ── GAE ──────────────────────────────────────────────────────────
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    """Generalised Advantage Estimation."""
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0.0
    for t in reversed(range(T)):
        next_val = values[t + 1] if t + 1 < len(values) else 0.0
        delta = rewards[t] + gamma * next_val - values[t]  # TD error
        gae = delta + gamma * lam * gae                     # accumulate
        advantages[t] = gae
    return advantages                              # (T,)

Warning: Always multiply by (1 - done) when computing targets. Without this, the agent bootstraps from the next episode’s state at terminal transitions, creating nonsensical value estimates.

Manual Implementation

import numpy as np

def discounted_returns(rewards, gamma=0.99):
    """
    Compute G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... for each timestep.
    rewards: (T,) array of rewards
    Returns: (T,) array of discounted returns
    """
    T = len(rewards)
    returns = np.zeros(T)
    G = 0.0
    for t in range(T - 1, -1, -1):                # backward pass
        G = rewards[t] + gamma * G
        returns[t] = G
    return returns                                  # (T,)

def td_target(reward, next_value, done, gamma=0.99):
    """Single-step TD target: y = r + γ·V(s') (zeroed at terminal)."""
    return reward + gamma * (1.0 - done) * next_value

def gae(rewards, values, gamma=0.99, lam=0.95):
    """Generalised Advantage Estimation — combines γ and λ."""
    T = len(rewards)
    advantages = np.zeros(T)
    gae_val = 0.0
    for t in range(T - 1, -1, -1):
        next_v = values[t + 1] if t + 1 < len(values) else 0.0
        delta = rewards[t] + gamma * next_v - values[t]    # TD error δ_t
        gae_val = delta + gamma * lam * gae_val             # (γλ)-weighted sum
        advantages[t] = gae_val
    return advantages                               # (T,)

# Example: verify effective horizon
gamma = 0.99
horizon = 1.0 / (1.0 - gamma)                      # ≈ 100
weights = np.array([gamma**k for k in range(200)])
actual_95 = np.searchsorted(np.cumsum(weights) / weights.sum(), 0.95)
print(f"γ={gamma}: effective horizon={horizon:.0f}, 95% weight within {actual_95} steps")

Popular Uses

Q-learning / DQN: $y = r + \gamma \max Q(s', a')$ — the bootstrap target discounts the next state’s value
Policy gradient (REINFORCE, A2C, PPO): discounted returns $G_t$ or GAE advantages weight the policy gradient
GAE: the $(\gamma\lambda)$ product controls bias-variance tradeoff in advantage estimation; $\gamma=0.99, \lambda=0.95$ is the standard combo
Model-based RL (MuZero, Dreamer): discount factor in the imagined rollouts controls planning horizon
Inverse RL: inferring the discount factor from expert behaviour reveals the expert’s planning horizon

Alternatives

Alternative	When to use	Tradeoff
$\gamma = 1$ (undiscounted)	Short episodic tasks, bandits	Only works if episodes terminate; infinite returns otherwise
Average reward formulation	Continuing tasks (robotics)	Replaces discounting with $r - \bar{r}$ ; harder to implement
Hyperbolic discounting	Modelling human behaviour	$1/(1+kt)$ decay; matches human psychology but complicates Bellman equations
Learned / adaptive $\gamma$	Multi-timescale problems	Agent learns its own horizon; harder to train
$n$ -step returns	Compromise between TD(0) and Monte Carlo	Fixed horizon $n$ instead of exponential decay; requires tuning $n$

Historical Context

Discounting entered RL from economics and dynamic programming (Bellman, 1957), where it models time-preference for money. Sutton & Barto formalised its role in the RL framework, showing that $\gamma < 1$ is necessary and sufficient for convergence of value functions in continuing tasks.

The practical insight that $\gamma$ defines the effective planning horizon — and that tuning it can matter more than algorithm choice — emerged from empirical work in the 2000s-2010s. GAE (Schulman et al., 2016) introduced $\lambda$ as a second discount-like parameter that controls bias-variance tradeoff in advantage estimation, giving practitioners two knobs: $\gamma$ for “how far to plan” and $\lambda$ for “how much to trust the value function.”