Skip to content

Discount Factor (Gamma, γ)

Weights future rewards exponentially, making near-term rewards worth more than distant ones. The single scalar γ[0,1)\gamma \in [0, 1) defines what “long-term” means for an RL agent — it appears in every value function, every target computation, and every return estimate. Arguably the most important hyperparameter in RL.

A dollar today is worth more than a dollar next year. Discounting in RL works the same way: a reward kk steps in the future is worth γk\gamma^k times its face value. With γ=0.99\gamma = 0.99, a reward 100 steps away is worth 0.991000.370.99^{100} \approx 0.37 — significant but diminished. With γ=0.9\gamma = 0.9, that same reward is worth 0.91000.000030.9^{100} \approx 0.00003 — effectively invisible.

The effective horizon — how far ahead the agent meaningfully plans — is approximately 1/(1γ)1/(1-\gamma). So γ=0.99\gamma = 0.99 gives a horizon of ~100 steps, γ=0.999\gamma = 0.999 gives ~1000 steps, and γ=0\gamma = 0 makes the agent completely myopic (only cares about immediate reward).

Why not set γ=1\gamma = 1 and care about all future rewards equally? For episodic tasks (games, episodes that end), you can — the sum is finite. But for continuing tasks (robot staying balanced forever), undiscounted returns are infinite and value functions diverge. Discounting also reduces variance in practice: distant rewards add noise because they depend on many uncertain future actions.

Discounted return from time tt:

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+kG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

Recursive form (Bellman equation):

Gt=rt+γGt+1G_t = r_t + \gamma G_{t+1}

Value function:

Vπ(s)=Eπ[Gtst=s]=Eπ ⁣[rt+γVπ(st+1)st=s]V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s] = \mathbb{E}_\pi\!\left[r_t + \gamma V^\pi(s_{t+1}) \mid s_t = s\right]

Q-learning target:

y=r+γmaxaQ(s,a)y = r + \gamma \max_{a'} Q(s', a')

GAE (Generalised Advantage Estimation) — uses both γ\gamma and λ\lambda:

A^t=k=0Tt(γλ)kδt+k,δt=rt+γV(st+1)V(st)\hat{A}_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

Effective horizon:

k=0γk=11γ\sum_{k=0}^{\infty} \gamma^k = \frac{1}{1 - \gamma}

γ\gammaEffective horizonCharacter
0.01 stepPurely greedy
0.910 stepsShort-sighted
0.99100 stepsStandard RL
0.9991000 stepsVery far-sighted
1.0\inftyUndiscounted (episodic only)
import torch
# ── Discounted returns (used in REINFORCE, A2C) ──────────────────
def compute_returns(rewards, gamma=0.99):
"""Compute discounted returns from a list of rewards."""
returns = []
G = 0.0
for r in reversed(rewards): # work backwards
G = r + gamma * G
returns.insert(0, G)
return torch.tensor(returns) # (T,)
# ── Q-learning target ────────────────────────────────────────────
gamma = 0.99
with torch.no_grad():
next_q = target_net(next_states).max(dim=1).values # (B,)
target = rewards + gamma * (1 - dones) * next_q # (B,) — zero out terminal
# ── GAE ──────────────────────────────────────────────────────────
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
"""Generalised Advantage Estimation."""
T = len(rewards)
advantages = torch.zeros(T)
gae = 0.0
for t in reversed(range(T)):
next_val = values[t + 1] if t + 1 < len(values) else 0.0
delta = rewards[t] + gamma * next_val - values[t] # TD error
gae = delta + gamma * lam * gae # accumulate
advantages[t] = gae
return advantages # (T,)

Warning: Always multiply by (1 - done) when computing targets. Without this, the agent bootstraps from the next episode’s state at terminal transitions, creating nonsensical value estimates.

import numpy as np
def discounted_returns(rewards, gamma=0.99):
"""
Compute G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... for each timestep.
rewards: (T,) array of rewards
Returns: (T,) array of discounted returns
"""
T = len(rewards)
returns = np.zeros(T)
G = 0.0
for t in range(T - 1, -1, -1): # backward pass
G = rewards[t] + gamma * G
returns[t] = G
return returns # (T,)
def td_target(reward, next_value, done, gamma=0.99):
"""Single-step TD target: y = r + γ·V(s') (zeroed at terminal)."""
return reward + gamma * (1.0 - done) * next_value
def gae(rewards, values, gamma=0.99, lam=0.95):
"""Generalised Advantage Estimation — combines γ and λ."""
T = len(rewards)
advantages = np.zeros(T)
gae_val = 0.0
for t in range(T - 1, -1, -1):
next_v = values[t + 1] if t + 1 < len(values) else 0.0
delta = rewards[t] + gamma * next_v - values[t] # TD error δ_t
gae_val = delta + gamma * lam * gae_val # (γλ)-weighted sum
advantages[t] = gae_val
return advantages # (T,)
# Example: verify effective horizon
gamma = 0.99
horizon = 1.0 / (1.0 - gamma) # ≈ 100
weights = np.array([gamma**k for k in range(200)])
actual_95 = np.searchsorted(np.cumsum(weights) / weights.sum(), 0.95)
print(f"γ={gamma}: effective horizon={horizon:.0f}, 95% weight within {actual_95} steps")
  • Q-learning / DQN: y=r+γmaxQ(s,a)y = r + \gamma \max Q(s', a') — the bootstrap target discounts the next state’s value
  • Policy gradient (REINFORCE, A2C, PPO): discounted returns GtG_t or GAE advantages weight the policy gradient
  • GAE: the (γλ)(\gamma\lambda) product controls bias-variance tradeoff in advantage estimation; γ=0.99,λ=0.95\gamma=0.99, \lambda=0.95 is the standard combo
  • Model-based RL (MuZero, Dreamer): discount factor in the imagined rollouts controls planning horizon
  • Inverse RL: inferring the discount factor from expert behaviour reveals the expert’s planning horizon
AlternativeWhen to useTradeoff
γ=1\gamma = 1 (undiscounted)Short episodic tasks, banditsOnly works if episodes terminate; infinite returns otherwise
Average reward formulationContinuing tasks (robotics)Replaces discounting with rrˉr - \bar{r}; harder to implement
Hyperbolic discountingModelling human behaviour1/(1+kt)1/(1+kt) decay; matches human psychology but complicates Bellman equations
Learned / adaptive γ\gammaMulti-timescale problemsAgent learns its own horizon; harder to train
nn-step returnsCompromise between TD(0) and Monte CarloFixed horizon nn instead of exponential decay; requires tuning nn

Discounting entered RL from economics and dynamic programming (Bellman, 1957), where it models time-preference for money. Sutton & Barto formalised its role in the RL framework, showing that γ<1\gamma < 1 is necessary and sufficient for convergence of value functions in continuing tasks.

The practical insight that γ\gamma defines the effective planning horizon — and that tuning it can matter more than algorithm choice — emerged from empirical work in the 2000s-2010s. GAE (Schulman et al., 2016) introduced λ\lambda as a second discount-like parameter that controls bias-variance tradeoff in advantage estimation, giving practitioners two knobs: γ\gamma for “how far to plan” and λ\lambda for “how much to trust the value function.”