Sigmoid
Sigmoid
Section titled “Sigmoid”The logistic function , mapping any real number to the range . The original neural network activation function, now primarily used for gating mechanisms (LSTM, GRU), binary classification outputs, and anywhere you need a smooth switch between 0 and 1.
Intuition
Section titled “Intuition”Sigmoid squashes the entire real line into the interval . Think of it as a “soft switch”: values far below zero map to ~0 (off), values far above zero map to ~1 (on), and the transition happens smoothly around zero. This makes it a natural choice anywhere you need a probability or a gate value.
The problem with sigmoid as a hidden-layer activation is saturation. When the input is far from zero (say, ), the output is nearly flat — the gradient is close to zero. In a deep network, multiplying many near-zero gradients together via the chain rule causes the vanishing gradient problem: early layers receive essentially no learning signal. This is why sigmoid was replaced by ReLU for hidden layers.
But sigmoid is still heavily used as a gate. In LSTMs, the forget gate, input gate, and output gate all use sigmoid — they need to produce values in that represent “how much to remember” or “how much to let through.” The saturation that hurts hidden layers actually helps gates: you want gates to commit to being near 0 or near 1 after training, and sigmoid’s flat regions make those decisions stable.
Sigmoid is also the output activation for binary classification. A single logit passed through sigmoid gives the predicted probability of the positive class. This is the foundation of logistic regression and remains the standard for binary prediction heads.
Definition:
Key property — derivative in terms of itself:
The maximum gradient is at . This means every sigmoid layer can at most pass through 25% of the gradient — the root cause of vanishing gradients when stacking sigmoid layers.
Relation to softmax: for two classes with logits and :
Binary sigmoid classification is equivalent to 2-class softmax.
Relation to tanh:
Tanh is just a rescaled and shifted sigmoid.
import torchimport torch.nn.functional as F
x = torch.randn(B, d_model) # (B, d_model)
# ── As an activation ───────────────────────────────────────────out = torch.sigmoid(x) # (B, d_model) — values in (0, 1)
# ── Binary classification output ───────────────────────────────logit = model(x) # (B, 1) or (B,) — single raw scoreprob = torch.sigmoid(logit) # (B,) — predicted P(y=1)
# WARNING: For training, do NOT do sigmoid + BCE. Use# F.binary_cross_entropy_with_logits(logit, target) which is numerically# stable. Only apply sigmoid at inference time for interpretable probabilities.
# ── LSTM-style gating ──────────────────────────────────────────forget_gate = torch.sigmoid(self.W_f(x) + self.U_f(h)) # (B, hidden_size) — in [0,1]cell = forget_gate * cell_prev + ... # gate controls how much to keep
# ── Multi-label classification ─────────────────────────────────logits = model(x) # (B, n_labels) — independent per labelprobs = torch.sigmoid(logits) # (B, n_labels) — each in (0,1), NOT summing to 1loss = F.binary_cross_entropy_with_logits(logits, targets) # per-label BCEManual Implementation
Section titled “Manual Implementation”import numpy as np
def sigmoid(x): """ Numerically stable sigmoid. x: any shape array. Uses two branches to avoid overflow in exp(): - x >= 0: 1 / (1 + exp(-x)) — exp(-x) is small, no overflow - x < 0: exp(x) / (1 + exp(x)) — exp(x) is small, no overflow """ return np.where(x >= 0, 1.0 / (1.0 + np.exp(-x)), np.exp(x) / (1.0 + np.exp(x)))
def sigmoid_backward(x, grad_output): """Gradient of sigmoid: σ(x)(1 - σ(x)).""" s = sigmoid(x) return grad_output * s * (1 - s) # max gradient = 0.25 at x = 0
def binary_cross_entropy_with_logits(logits, targets): """ Stable BCE from raw logits (no explicit sigmoid needed). logits: (B,), targets: (B,) floats in {0, 1}. """ return (np.maximum(0, logits) - logits * targets + np.log1p(np.exp(-np.abs(logits)))).mean()Popular Uses
Section titled “Popular Uses”- LSTM/GRU gates: forget, input, and output gates all use sigmoid to produce values in — controls information flow through recurrent cells
- Binary classification (logistic regression, spam filters, real/fake discriminators): sigmoid on a single logit gives
- GAN discriminators (
gans/): vanilla GAN discriminator outputs sigmoid probability of “real” — though modern GANs often skip sigmoid and use logit-based losses - Multi-label classification: independent sigmoid per label when multiple labels can be true simultaneously (unlike softmax which assumes mutual exclusivity)
- Attention gating (gated attention, GLU): sigmoid gates control how much information flows through multiplicative interactions
- Diffusion timestep embeddings (
diffusion/): sigmoid sometimes used in timestep conditioning networks
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Softmax | Multi-class classification (mutually exclusive classes) | Outputs sum to 1; enforces competition between classes. Sigmoid treats each class independently |
| Tanh | When you need outputs in instead of | Zero-centred; same saturation issues but better for hidden layers than sigmoid |
| ReLU | Hidden layer activation | No saturation for positive inputs; replaced sigmoid as the default hidden activation |
| Hard sigmoid | Mobile/embedded inference | Piecewise linear: . No exp(), faster on CPUs |
| Straight-through sigmoid | Discrete/binary latent variables (VQ-VAE style) | Forward: hard threshold at 0.5. Backward: sigmoid gradient. Enables discrete decisions with gradient flow |
Historical Context
Section titled “Historical Context”The logistic function was introduced by Pierre-Francois Verhulst in 1838 for modelling population growth. It entered neural networks in the 1940s-60s as the activation for perceptrons and became the standard activation through the backpropagation era (Rumelhart et al., 1986). For decades, “neural network” essentially meant “layers of sigmoid neurons.”
The vanishing gradient problem with sigmoid was identified by Hochreiter (1991) and became a central obstacle to training deep networks. The LSTM (Hochreiter & Schmidhuber, 1997) solved this specifically for recurrent networks by using sigmoid gates to control a linear memory cell — ironically using sigmoid’s saturation as a feature rather than a bug. The shift away from sigmoid as a hidden activation began with ReLU (Nair & Hinton, 2010; Glorot et al., 2011), which eliminated the vanishing gradient problem for feedforward networks. Today, sigmoid is rarely used as a hidden-layer activation but remains essential as a gating mechanism and for binary outputs.