SiLU / Swish

Self-gated activation function: $x \cdot \sigma(x)$ , where $\sigma$ is the sigmoid function. Smooth and non-monotonic like GELU, but uses sigmoid gating instead of the Gaussian CDF. The activation inside SwiGLU, which is the feedforward block in LLaMA, Mistral, Gemma, and most post-2023 LLMs.

Intuition

SiLU is “self-gated”: the input itself controls how much of itself passes through. The sigmoid $\sigma(x)$ acts as a gate that outputs a value between 0 and 1. For large positive $x$ , the gate is fully open ( $\sigma(x) \approx 1$ ) and SiLU behaves like the identity. For large negative $x$ , the gate is nearly closed ( $\sigma(x) \approx 0$ ) and the output is near zero. So far, this sounds like ReLU.

The interesting part is what happens near zero. SiLU has a small negative region: it dips below zero around $x \approx -1.28$ before coming back up. This non-monotonicity means the function provides a small negative “push” for moderately negative inputs before suppressing them. Empirically, this seems to help optimisation by creating smoother loss landscapes compared to ReLU’s sharp corner.

SiLU and GELU are nearly identical in shape — the maximum difference between them is about 0.01. The practical choice between them is driven more by convention and how they compose with other components. SiLU’s claim to fame is its pairing with Gated Linear Units: in SwiGLU (Shazeer, 2020), the feedforward block computes $\text{SiLU}(xW_1) \odot (xW_2)$ , where the element-wise product with a second linear projection gives the network an additional multiplicative interaction. SwiGLU consistently outperforms plain GELU feedforward blocks.

Math

Definition (SiLU/Swish):

$\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$

Generalised Swish (with learnable $\beta$ , rarely used):

$\text{Swish}_\beta(x) = x \cdot \sigma(\beta x)$

When $\beta = 1$ , this is SiLU. As $\beta \to \infty$ , it converges to ReLU.

Gradient:

$\frac{d}{dx}\text{SiLU}(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x)) = \sigma(x)(1 + x(1 - \sigma(x)))$

Key values: SiLU(0) = 0, minimum $\approx -0.278$ at $x \approx -1.28$ .

SwiGLU feedforward block:

$\text{SwiGLU}(x) = \text{SiLU}(xW_1) \odot (xW_2)$

Note: SwiGLU uses 3 weight matrices ( $W_1$ , $W_2$ , $W_{\text{out}}$ ) instead of the standard FFN’s 2, so the hidden dimension is typically reduced by $2/3$ to keep parameter count constant.

Code

import torch
import torch.nn.functional as F

x = torch.randn(B, T, d_model)          # (B, T, d_model)

# Functional form
out = F.silu(x)                          # (B, T, d_model)

# As a module
layer = torch.nn.SiLU()

# ── SwiGLU feedforward block (as in LLaMA) ─────────────────────
# This is the main reason SiLU matters in modern LLMs.
class SwiGLU(torch.nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = torch.nn.Linear(d_model, d_ff, bias=False)
        self.w2 = torch.nn.Linear(d_model, d_ff, bias=False)
        self.w_out = torch.nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):                # (B, T, d_model)
        gate = F.silu(self.w1(x))        # (B, T, d_ff) — SiLU-gated projection
        up = self.w2(x)                  # (B, T, d_ff) — ungated projection
        return self.w_out(gate * up)     # (B, T, d_model) — element-wise gate then project back

# WARNING: SwiGLU has 3 matrices instead of 2. When matching parameter counts
# with a standard FFN (d_ff = 4 * d_model), use d_ff = (2/3) * 4 * d_model,
# often rounded to a multiple of 256 for hardware efficiency.

Manual Implementation

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1.0 / (1.0 + np.exp(-x)),
                    np.exp(x) / (1.0 + np.exp(x)))

def silu(x):
    """SiLU/Swish: x * sigmoid(x). x: any shape array."""
    return x * sigmoid(x)

def silu_backward(x, grad_output):
    """Gradient of SiLU."""
    s = sigmoid(x)
    return grad_output * (s + x * s * (1 - s))     # σ(x)(1 + x(1 - σ(x)))

def swiglu(x, W1, W2, W_out):
    """
    SwiGLU feedforward block.
    x: (B, d_model), W1/W2: (d_model, d_ff), W_out: (d_ff, d_model)
    """
    gate = silu(x @ W1)                 # (B, d_ff)
    up = x @ W2                         # (B, d_ff)
    return (gate * up) @ W_out          # (B, d_model)

Popular Uses

SwiGLU in LLMs (LLaMA 1/2/3, Mistral, Gemma, PaLM): SiLU is the gating activation in the SwiGLU feedforward block, now the standard FFN design for large language models
EfficientNet / EfficientNetV2: Swish activation throughout the convolutional backbone, one of the first large-scale uses
Transformer feedforward variants (transformer/): SwiGLU is the modern replacement for ReLU/GELU FFN blocks, covered in the SwiGLU section
Diffusion U-Nets (Stable Diffusion XL): SiLU activations in the residual blocks of the denoising backbone
Mobile architectures (MobileNetV3): hard-swish ( $x \cdot \text{ReLU6}(x+3)/6$ ) is a hardware-friendly approximation

Alternatives

Alternative	When to use	Tradeoff
GELU	Transformer models without gated FFN (BERT, GPT-2, ViT)	Nearly identical shape; standard when using plain 2-matrix FFN blocks
ReLU	CNNs, speed-critical inference, simple MLPs	Fastest to compute; no smooth gating but works well in non-Transformer architectures
Hard Swish	Mobile/edge deployment	$x \cdot \text{ReLU6}(x+3)/6$ : piecewise linear approximation, no exp() needed
GeGLU	Alternative gated FFN (some research models)	Uses GELU instead of SiLU in the gate; similar performance, less widely adopted
ReGLU	Simpler gated FFN	Uses ReLU in the gate; slightly worse than SwiGLU but faster

Historical Context

Swish was proposed by Ramachandran et al. (2017, “Searching for Activation Functions”) at Google Brain, discovered through automated search over activation function design spaces using reinforcement learning. The same function was independently proposed as SiLU (Sigmoid Linear Unit) by Elfwing et al. (2018). The name “SiLU” is now the standard in PyTorch.

Swish gained traction when EfficientNet (Tan & Le, 2019) used it throughout, showing consistent gains over ReLU in image classification. The function’s real impact came when Shazeer (2020, “GLU Variants Improve Transformer”) showed that pairing SiLU with Gated Linear Units (SwiGLU) substantially improved Transformer feedforward blocks. This combination was adopted by PaLM (Chowdhery et al., 2022) and then LLaMA (Touvron et al., 2023), making SwiGLU the de facto standard for LLM architectures. The learnable $\beta$ parameter from the original Swish paper is almost never used — $\beta = 1$ (plain SiLU) works fine.