Tanh

Hyperbolic tangent: maps any real number to $(-1, 1)$ . Zero-centred unlike sigmoid, which makes it better suited for hidden layers and anywhere outputs should be symmetric around zero. The standard activation inside LSTM and GRU cells, and used for bounding outputs in RL and normalisation contexts.

Intuition

Tanh is sigmoid’s zero-centred cousin. While sigmoid maps to $(0, 1)$ , tanh maps to $(-1, 1)$ . This zero-centring matters: when all activations are positive (as with sigmoid), the gradients for the weights in the next layer are all the same sign, which forces the optimiser into inefficient zig-zag updates. Tanh produces both positive and negative outputs, allowing weights to receive mixed-sign gradients and update more efficiently.

Tanh has the same saturation problem as sigmoid — for $|x| > 3$ , the gradient is nearly zero, causing vanishing gradients in deep networks. But its maximum gradient at $x = 0$ is 1.0 (compared to sigmoid’s 0.25), so it passes through gradients 4x more effectively in the linear regime. This made tanh the preferred hidden activation before ReLU.

Today, tanh appears in two main roles. First, inside recurrent cells: LSTMs use tanh to produce the candidate cell state and to squash the cell state before output. The bounded range prevents the cell state from growing without bound. Second, for bounding outputs: in continuous-action RL (SAC, TD3), tanh squashes unbounded network outputs into $[-1, 1]$ action ranges. In normalisation, tanh can clip extreme values while preserving gradients near zero.

Math

Definition:

$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$

Gradient:

$\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)$

The maximum gradient is $1.0$ at $x = 0$ (vs. sigmoid’s $0.25$ ). Drops to $< 0.01$ for $|x| > 3$ .

Relation to sigmoid:

$\tanh(x) = 2\sigma(2x) - 1 \qquad \sigma(x) = \frac{\tanh(x/2) + 1}{2}$

They are linear transformations of each other — same shape, different range.

Log-derivative trick for RL (used in SAC for squashed Gaussian policies):

If $a = \tanh(u)$ where $u \sim \mathcal{N}(\mu, \sigma^2)$ , the log-probability correction is:

$\log \pi(a|s) = \log \pi_u(u|s) - \sum_i \log(1 - \tanh^2(u_i))$

Code

import torch

x = torch.randn(B, d_model)                    # (B, d_model)

# ── As an activation ───────────────────────────────────────────
out = torch.tanh(x)                             # (B, d_model) — values in (-1, 1)

# ── Inside an LSTM cell (simplified) ───────────────────────────
forget = torch.sigmoid(W_f @ x + U_f @ h)      # (B, H) — gate: [0, 1]
input_gate = torch.sigmoid(W_i @ x + U_i @ h)  # (B, H) — gate: [0, 1]
candidate = torch.tanh(W_c @ x + U_c @ h)      # (B, H) — new info: [-1, 1]
cell = forget * cell_prev + input_gate * candidate  # (B, H)
out_gate = torch.sigmoid(W_o @ x + U_o @ h)    # (B, H) — gate: [0, 1]
h_new = out_gate * torch.tanh(cell)             # (B, H) — squashed output

# ── Bounding RL actions (SAC, TD3) ─────────────────────────────
raw_action = policy_net(state)                  # (B, action_dim) — unbounded
action = torch.tanh(raw_action)                 # (B, action_dim) — bounded to [-1, 1]
# WARNING: You must apply the log-probability correction when computing
# the policy gradient. Forgetting this is a common SAC implementation bug.

# ── Don't forget: PyTorch LSTM does this internally ────────────
lstm = torch.nn.LSTM(input_size=d_in, hidden_size=d_h, batch_first=True)
output, (h_n, c_n) = lstm(x)                   # tanh is built into the LSTM cell

Manual Implementation

import numpy as np

def tanh(x):
    """
    Numerically stable tanh. x: any shape array.
    np.tanh handles this fine, but here's the explicit version:
    """
    # For very large |x|, exp overflows. np.tanh is safe; manual version:
    pos = np.exp(np.minimum(x, 20))        # clip to prevent overflow
    neg = np.exp(np.minimum(-x, 20))
    return (pos - neg) / (pos + neg)

def tanh_backward(x, grad_output):
    """Gradient of tanh: 1 - tanh(x)^2."""
    t = np.tanh(x)
    return grad_output * (1.0 - t * t)     # max = 1.0 at x = 0

def squashed_gaussian_log_prob(raw_action, mean, log_std):
    """
    Log probability of tanh-squashed Gaussian (for SAC).
    raw_action: (B, D) — pre-tanh sample u
    Applies the change-of-variables correction.
    """
    std = np.exp(log_std)
    # Gaussian log-prob of u
    log_prob = -0.5 * ((raw_action - mean) / std) ** 2 - log_std - 0.5 * np.log(2 * np.pi)
    # Correction for tanh squashing: -log(1 - tanh(u)^2)
    log_prob -= np.log(1 - np.tanh(raw_action) ** 2 + 1e-6)   # (B, D)
    return log_prob.sum(axis=-1)                                # (B,)

Popular Uses

LSTM/GRU internal computations: tanh produces the candidate cell state and squashes the cell output — the bounded range prevents values from growing unboundedly through recurrence
Continuous action spaces in RL (policy-gradient/, q-learning/): SAC and TD3 use tanh to bound policy outputs to $[-1, 1]$ , which is then rescaled to the environment’s action range
Weight initialisation (Xavier/Glorot): the original Xavier init was derived assuming tanh activations — it sets weight variance to keep activations in the linear regime of tanh
Normalisation layers: some architectures use tanh to bound intermediate representations, e.g. in flow-based models
Positional encodings: original Transformer sinusoidal positional encodings use sin/cos (which have the same $[-1, 1]$ range as tanh) for position-dependent features

Alternatives

Alternative	When to use	Tradeoff
Sigmoid	When you need $(0, 1)$ range (gates, binary probabilities)	Not zero-centred; lower max gradient (0.25 vs 1.0). Use sigmoid for gates, tanh for values
ReLU	Hidden layers in feedforward networks	No saturation for positive inputs; not bounded — can’t use where you need a fixed range
Hardtanh	When you want bounded outputs without exp()	$\text{clip}(x, -1, 1)$ : piecewise linear, faster, but gradient is exactly 0 outside $[-1, 1]$
Softsign	Smoother alternative with lighter tails	$x / (1 +
Clamp/clip	Hard bounding during inference	Not differentiable at boundaries; fine for inference but problematic for training

Historical Context

The hyperbolic tangent has been used in neural networks since the earliest days, preferred over sigmoid for hidden layers due to its zero-centred output. LeCun et al. (1998, “Efficient BackProp”) provided the theoretical and practical arguments for using tanh over sigmoid: zero-centred activations lead to faster convergence because gradients don’t have a systematic sign bias.

The Xavier initialisation (Glorot & Bengio, 2010) was specifically designed for tanh networks, deriving the optimal weight variance to keep activations in tanh’s linear regime. When ReLU replaced tanh as the default hidden activation (2010-2012), a new initialisation was needed — He init (2015) filled that role. Tanh found renewed importance in reinforcement learning when Haarnoja et al. (2018, SAC) used it to bound continuous actions from a Gaussian policy, requiring the now-standard tanh squashing correction for log-probabilities. Inside LSTMs and GRUs, tanh has never been replaced — it remains the standard activation for cell state computation.