Sigmoid

The logistic function $\sigma(x) = 1/(1 + e^{-x})$ , mapping any real number to the range $(0, 1)$ . The original neural network activation function, now primarily used for gating mechanisms (LSTM, GRU), binary classification outputs, and anywhere you need a smooth switch between 0 and 1.

Intuition

Sigmoid squashes the entire real line into the interval $(0, 1)$ . Think of it as a “soft switch”: values far below zero map to ~0 (off), values far above zero map to ~1 (on), and the transition happens smoothly around zero. This makes it a natural choice anywhere you need a probability or a gate value.

The problem with sigmoid as a hidden-layer activation is saturation. When the input is far from zero (say, $|x| > 5$ ), the output is nearly flat — the gradient is close to zero. In a deep network, multiplying many near-zero gradients together via the chain rule causes the vanishing gradient problem: early layers receive essentially no learning signal. This is why sigmoid was replaced by ReLU for hidden layers.

But sigmoid is still heavily used as a gate. In LSTMs, the forget gate, input gate, and output gate all use sigmoid — they need to produce values in $[0, 1]$ that represent “how much to remember” or “how much to let through.” The saturation that hurts hidden layers actually helps gates: you want gates to commit to being near 0 or near 1 after training, and sigmoid’s flat regions make those decisions stable.

Sigmoid is also the output activation for binary classification. A single logit passed through sigmoid gives the predicted probability of the positive class. This is the foundation of logistic regression and remains the standard for binary prediction heads.

Math

Definition:

$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$

Key property — derivative in terms of itself:

$\sigma'(x) = \sigma(x)(1 - \sigma(x))$

The maximum gradient is $0.25$ at $x = 0$ . This means every sigmoid layer can at most pass through 25% of the gradient — the root cause of vanishing gradients when stacking sigmoid layers.

Relation to softmax: for two classes with logits $z_1$ and $z_2$ :

$\text{softmax}(z)_1 = \sigma(z_1 - z_2)$

Binary sigmoid classification is equivalent to 2-class softmax.

Relation to tanh:

$\tanh(x) = 2\sigma(2x) - 1$

Tanh is just a rescaled and shifted sigmoid.

Code

import torch
import torch.nn.functional as F

x = torch.randn(B, d_model)               # (B, d_model)

# ── As an activation ───────────────────────────────────────────
out = torch.sigmoid(x)                     # (B, d_model) — values in (0, 1)

# ── Binary classification output ───────────────────────────────
logit = model(x)                           # (B, 1) or (B,) — single raw score
prob = torch.sigmoid(logit)                # (B,) — predicted P(y=1)

# WARNING: For training, do NOT do sigmoid + BCE. Use
# F.binary_cross_entropy_with_logits(logit, target) which is numerically
# stable. Only apply sigmoid at inference time for interpretable probabilities.

# ── LSTM-style gating ──────────────────────────────────────────
forget_gate = torch.sigmoid(self.W_f(x) + self.U_f(h))  # (B, hidden_size) — in [0,1]
cell = forget_gate * cell_prev + ...       # gate controls how much to keep

# ── Multi-label classification ─────────────────────────────────
logits = model(x)                          # (B, n_labels) — independent per label
probs = torch.sigmoid(logits)              # (B, n_labels) — each in (0,1), NOT summing to 1
loss = F.binary_cross_entropy_with_logits(logits, targets)  # per-label BCE

Manual Implementation

import numpy as np

def sigmoid(x):
    """
    Numerically stable sigmoid. x: any shape array.
    Uses two branches to avoid overflow in exp():
    - x >= 0: 1 / (1 + exp(-x))    — exp(-x) is small, no overflow
    - x < 0:  exp(x) / (1 + exp(x)) — exp(x) is small, no overflow
    """
    return np.where(x >= 0,
                    1.0 / (1.0 + np.exp(-x)),
                    np.exp(x) / (1.0 + np.exp(x)))

def sigmoid_backward(x, grad_output):
    """Gradient of sigmoid: σ(x)(1 - σ(x))."""
    s = sigmoid(x)
    return grad_output * s * (1 - s)       # max gradient = 0.25 at x = 0

def binary_cross_entropy_with_logits(logits, targets):
    """
    Stable BCE from raw logits (no explicit sigmoid needed).
    logits: (B,), targets: (B,) floats in {0, 1}.
    """
    return (np.maximum(0, logits) - logits * targets
            + np.log1p(np.exp(-np.abs(logits)))).mean()

Popular Uses

LSTM/GRU gates: forget, input, and output gates all use sigmoid to produce values in $[0, 1]$ — controls information flow through recurrent cells
Binary classification (logistic regression, spam filters, real/fake discriminators): sigmoid on a single logit gives $P(y=1|x)$
GAN discriminators (gans/): vanilla GAN discriminator outputs sigmoid probability of “real” — though modern GANs often skip sigmoid and use logit-based losses
Multi-label classification: independent sigmoid per label when multiple labels can be true simultaneously (unlike softmax which assumes mutual exclusivity)
Attention gating (gated attention, GLU): sigmoid gates control how much information flows through multiplicative interactions
Diffusion timestep embeddings (diffusion/): sigmoid sometimes used in timestep conditioning networks

Alternatives

Alternative	When to use	Tradeoff
Softmax	Multi-class classification (mutually exclusive classes)	Outputs sum to 1; enforces competition between classes. Sigmoid treats each class independently
Tanh	When you need outputs in $(-1, 1)$ instead of $(0, 1)$	Zero-centred; same saturation issues but better for hidden layers than sigmoid
ReLU	Hidden layer activation	No saturation for positive inputs; replaced sigmoid as the default hidden activation
Hard sigmoid	Mobile/embedded inference	Piecewise linear: $\max(0, \min(1, 0.2x + 0.5))$ . No exp(), faster on CPUs
Straight-through sigmoid	Discrete/binary latent variables (VQ-VAE style)	Forward: hard threshold at 0.5. Backward: sigmoid gradient. Enables discrete decisions with gradient flow

Historical Context

The logistic function was introduced by Pierre-Francois Verhulst in 1838 for modelling population growth. It entered neural networks in the 1940s-60s as the activation for perceptrons and became the standard activation through the backpropagation era (Rumelhart et al., 1986). For decades, “neural network” essentially meant “layers of sigmoid neurons.”

The vanishing gradient problem with sigmoid was identified by Hochreiter (1991) and became a central obstacle to training deep networks. The LSTM (Hochreiter & Schmidhuber, 1997) solved this specifically for recurrent networks by using sigmoid gates to control a linear memory cell — ironically using sigmoid’s saturation as a feature rather than a bug. The shift away from sigmoid as a hidden activation began with ReLU (Nair & Hinton, 2010; Glorot et al., 2011), which eliminated the vanishing gradient problem for feedforward networks. Today, sigmoid is rarely used as a hidden-layer activation but remains essential as a gating mechanism and for binary outputs.