Skip to content

Tanh

Hyperbolic tangent: maps any real number to (1,1)(-1, 1). Zero-centred unlike sigmoid, which makes it better suited for hidden layers and anywhere outputs should be symmetric around zero. The standard activation inside LSTM and GRU cells, and used for bounding outputs in RL and normalisation contexts.

Tanh is sigmoid’s zero-centred cousin. While sigmoid maps to (0,1)(0, 1), tanh maps to (1,1)(-1, 1). This zero-centring matters: when all activations are positive (as with sigmoid), the gradients for the weights in the next layer are all the same sign, which forces the optimiser into inefficient zig-zag updates. Tanh produces both positive and negative outputs, allowing weights to receive mixed-sign gradients and update more efficiently.

Tanh has the same saturation problem as sigmoid — for x>3|x| > 3, the gradient is nearly zero, causing vanishing gradients in deep networks. But its maximum gradient at x=0x = 0 is 1.0 (compared to sigmoid’s 0.25), so it passes through gradients 4x more effectively in the linear regime. This made tanh the preferred hidden activation before ReLU.

Today, tanh appears in two main roles. First, inside recurrent cells: LSTMs use tanh to produce the candidate cell state and to squash the cell state before output. The bounded range prevents the cell state from growing without bound. Second, for bounding outputs: in continuous-action RL (SAC, TD3), tanh squashes unbounded network outputs into [1,1][-1, 1] action ranges. In normalisation, tanh can clip extreme values while preserving gradients near zero.

Definition:

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Gradient:

ddxtanh(x)=1tanh2(x)\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)

The maximum gradient is 1.01.0 at x=0x = 0 (vs. sigmoid’s 0.250.25). Drops to <0.01< 0.01 for x>3|x| > 3.

Relation to sigmoid:

tanh(x)=2σ(2x)1σ(x)=tanh(x/2)+12\tanh(x) = 2\sigma(2x) - 1 \qquad \sigma(x) = \frac{\tanh(x/2) + 1}{2}

They are linear transformations of each other — same shape, different range.

Log-derivative trick for RL (used in SAC for squashed Gaussian policies):

If a=tanh(u)a = \tanh(u) where uN(μ,σ2)u \sim \mathcal{N}(\mu, \sigma^2), the log-probability correction is:

logπ(as)=logπu(us)ilog(1tanh2(ui))\log \pi(a|s) = \log \pi_u(u|s) - \sum_i \log(1 - \tanh^2(u_i))

import torch
x = torch.randn(B, d_model) # (B, d_model)
# ── As an activation ───────────────────────────────────────────
out = torch.tanh(x) # (B, d_model) — values in (-1, 1)
# ── Inside an LSTM cell (simplified) ───────────────────────────
forget = torch.sigmoid(W_f @ x + U_f @ h) # (B, H) — gate: [0, 1]
input_gate = torch.sigmoid(W_i @ x + U_i @ h) # (B, H) — gate: [0, 1]
candidate = torch.tanh(W_c @ x + U_c @ h) # (B, H) — new info: [-1, 1]
cell = forget * cell_prev + input_gate * candidate # (B, H)
out_gate = torch.sigmoid(W_o @ x + U_o @ h) # (B, H) — gate: [0, 1]
h_new = out_gate * torch.tanh(cell) # (B, H) — squashed output
# ── Bounding RL actions (SAC, TD3) ─────────────────────────────
raw_action = policy_net(state) # (B, action_dim) — unbounded
action = torch.tanh(raw_action) # (B, action_dim) — bounded to [-1, 1]
# WARNING: You must apply the log-probability correction when computing
# the policy gradient. Forgetting this is a common SAC implementation bug.
# ── Don't forget: PyTorch LSTM does this internally ────────────
lstm = torch.nn.LSTM(input_size=d_in, hidden_size=d_h, batch_first=True)
output, (h_n, c_n) = lstm(x) # tanh is built into the LSTM cell
import numpy as np
def tanh(x):
"""
Numerically stable tanh. x: any shape array.
np.tanh handles this fine, but here's the explicit version:
"""
# For very large |x|, exp overflows. np.tanh is safe; manual version:
pos = np.exp(np.minimum(x, 20)) # clip to prevent overflow
neg = np.exp(np.minimum(-x, 20))
return (pos - neg) / (pos + neg)
def tanh_backward(x, grad_output):
"""Gradient of tanh: 1 - tanh(x)^2."""
t = np.tanh(x)
return grad_output * (1.0 - t * t) # max = 1.0 at x = 0
def squashed_gaussian_log_prob(raw_action, mean, log_std):
"""
Log probability of tanh-squashed Gaussian (for SAC).
raw_action: (B, D) — pre-tanh sample u
Applies the change-of-variables correction.
"""
std = np.exp(log_std)
# Gaussian log-prob of u
log_prob = -0.5 * ((raw_action - mean) / std) ** 2 - log_std - 0.5 * np.log(2 * np.pi)
# Correction for tanh squashing: -log(1 - tanh(u)^2)
log_prob -= np.log(1 - np.tanh(raw_action) ** 2 + 1e-6) # (B, D)
return log_prob.sum(axis=-1) # (B,)
  • LSTM/GRU internal computations: tanh produces the candidate cell state and squashes the cell output — the bounded range prevents values from growing unboundedly through recurrence
  • Continuous action spaces in RL (policy-gradient/, q-learning/): SAC and TD3 use tanh to bound policy outputs to [1,1][-1, 1], which is then rescaled to the environment’s action range
  • Weight initialisation (Xavier/Glorot): the original Xavier init was derived assuming tanh activations — it sets weight variance to keep activations in the linear regime of tanh
  • Normalisation layers: some architectures use tanh to bound intermediate representations, e.g. in flow-based models
  • Positional encodings: original Transformer sinusoidal positional encodings use sin/cos (which have the same [1,1][-1, 1] range as tanh) for position-dependent features
AlternativeWhen to useTradeoff
SigmoidWhen you need (0,1)(0, 1) range (gates, binary probabilities)Not zero-centred; lower max gradient (0.25 vs 1.0). Use sigmoid for gates, tanh for values
ReLUHidden layers in feedforward networksNo saturation for positive inputs; not bounded — can’t use where you need a fixed range
HardtanhWhen you want bounded outputs without exp()clip(x,1,1)\text{clip}(x, -1, 1): piecewise linear, faster, but gradient is exactly 0 outside [1,1][-1, 1]
SoftsignSmoother alternative with lighter tails$x / (1 +
Clamp/clipHard bounding during inferenceNot differentiable at boundaries; fine for inference but problematic for training

The hyperbolic tangent has been used in neural networks since the earliest days, preferred over sigmoid for hidden layers due to its zero-centred output. LeCun et al. (1998, “Efficient BackProp”) provided the theoretical and practical arguments for using tanh over sigmoid: zero-centred activations lead to faster convergence because gradients don’t have a systematic sign bias.

The Xavier initialisation (Glorot & Bengio, 2010) was specifically designed for tanh networks, deriving the optimal weight variance to keep activations in tanh’s linear regime. When ReLU replaced tanh as the default hidden activation (2010-2012), a new initialisation was needed — He init (2015) filled that role. Tanh found renewed importance in reinforcement learning when Haarnoja et al. (2018, SAC) used it to bound continuous actions from a Gaussian policy, requiring the now-standard tanh squashing correction for log-probabilities. Inside LSTMs and GRUs, tanh has never been replaced — it remains the standard activation for cell state computation.