Skip to content

SiLU / Swish

Self-gated activation function: xσ(x)x \cdot \sigma(x), where σ\sigma is the sigmoid function. Smooth and non-monotonic like GELU, but uses sigmoid gating instead of the Gaussian CDF. The activation inside SwiGLU, which is the feedforward block in LLaMA, Mistral, Gemma, and most post-2023 LLMs.

SiLU is “self-gated”: the input itself controls how much of itself passes through. The sigmoid σ(x)\sigma(x) acts as a gate that outputs a value between 0 and 1. For large positive xx, the gate is fully open (σ(x)1\sigma(x) \approx 1) and SiLU behaves like the identity. For large negative xx, the gate is nearly closed (σ(x)0\sigma(x) \approx 0) and the output is near zero. So far, this sounds like ReLU.

The interesting part is what happens near zero. SiLU has a small negative region: it dips below zero around x1.28x \approx -1.28 before coming back up. This non-monotonicity means the function provides a small negative “push” for moderately negative inputs before suppressing them. Empirically, this seems to help optimisation by creating smoother loss landscapes compared to ReLU’s sharp corner.

SiLU and GELU are nearly identical in shape — the maximum difference between them is about 0.01. The practical choice between them is driven more by convention and how they compose with other components. SiLU’s claim to fame is its pairing with Gated Linear Units: in SwiGLU (Shazeer, 2020), the feedforward block computes SiLU(xW1)(xW2)\text{SiLU}(xW_1) \odot (xW_2), where the element-wise product with a second linear projection gives the network an additional multiplicative interaction. SwiGLU consistently outperforms plain GELU feedforward blocks.

Definition (SiLU/Swish):

SiLU(x)=xσ(x)=x1+ex\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

Generalised Swish (with learnable β\beta, rarely used):

Swishβ(x)=xσ(βx)\text{Swish}_\beta(x) = x \cdot \sigma(\beta x)

When β=1\beta = 1, this is SiLU. As β\beta \to \infty, it converges to ReLU.

Gradient:

ddxSiLU(x)=σ(x)+xσ(x)(1σ(x))=σ(x)(1+x(1σ(x)))\frac{d}{dx}\text{SiLU}(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x)) = \sigma(x)(1 + x(1 - \sigma(x)))

Key values: SiLU(0) = 0, minimum 0.278\approx -0.278 at x1.28x \approx -1.28.

SwiGLU feedforward block:

SwiGLU(x)=SiLU(xW1)(xW2)\text{SwiGLU}(x) = \text{SiLU}(xW_1) \odot (xW_2)

Note: SwiGLU uses 3 weight matrices (W1W_1, W2W_2, WoutW_{\text{out}}) instead of the standard FFN’s 2, so the hidden dimension is typically reduced by 2/32/3 to keep parameter count constant.

import torch
import torch.nn.functional as F
x = torch.randn(B, T, d_model) # (B, T, d_model)
# Functional form
out = F.silu(x) # (B, T, d_model)
# As a module
layer = torch.nn.SiLU()
# ── SwiGLU feedforward block (as in LLaMA) ─────────────────────
# This is the main reason SiLU matters in modern LLMs.
class SwiGLU(torch.nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.w1 = torch.nn.Linear(d_model, d_ff, bias=False)
self.w2 = torch.nn.Linear(d_model, d_ff, bias=False)
self.w_out = torch.nn.Linear(d_ff, d_model, bias=False)
def forward(self, x): # (B, T, d_model)
gate = F.silu(self.w1(x)) # (B, T, d_ff) — SiLU-gated projection
up = self.w2(x) # (B, T, d_ff) — ungated projection
return self.w_out(gate * up) # (B, T, d_model) — element-wise gate then project back
# WARNING: SwiGLU has 3 matrices instead of 2. When matching parameter counts
# with a standard FFN (d_ff = 4 * d_model), use d_ff = (2/3) * 4 * d_model,
# often rounded to a multiple of 256 for hardware efficiency.
import numpy as np
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1.0 / (1.0 + np.exp(-x)),
np.exp(x) / (1.0 + np.exp(x)))
def silu(x):
"""SiLU/Swish: x * sigmoid(x). x: any shape array."""
return x * sigmoid(x)
def silu_backward(x, grad_output):
"""Gradient of SiLU."""
s = sigmoid(x)
return grad_output * (s + x * s * (1 - s)) # σ(x)(1 + x(1 - σ(x)))
def swiglu(x, W1, W2, W_out):
"""
SwiGLU feedforward block.
x: (B, d_model), W1/W2: (d_model, d_ff), W_out: (d_ff, d_model)
"""
gate = silu(x @ W1) # (B, d_ff)
up = x @ W2 # (B, d_ff)
return (gate * up) @ W_out # (B, d_model)
  • SwiGLU in LLMs (LLaMA 1/2/3, Mistral, Gemma, PaLM): SiLU is the gating activation in the SwiGLU feedforward block, now the standard FFN design for large language models
  • EfficientNet / EfficientNetV2: Swish activation throughout the convolutional backbone, one of the first large-scale uses
  • Transformer feedforward variants (transformer/): SwiGLU is the modern replacement for ReLU/GELU FFN blocks, covered in the SwiGLU section
  • Diffusion U-Nets (Stable Diffusion XL): SiLU activations in the residual blocks of the denoising backbone
  • Mobile architectures (MobileNetV3): hard-swish (xReLU6(x+3)/6x \cdot \text{ReLU6}(x+3)/6) is a hardware-friendly approximation
AlternativeWhen to useTradeoff
GELUTransformer models without gated FFN (BERT, GPT-2, ViT)Nearly identical shape; standard when using plain 2-matrix FFN blocks
ReLUCNNs, speed-critical inference, simple MLPsFastest to compute; no smooth gating but works well in non-Transformer architectures
Hard SwishMobile/edge deploymentxReLU6(x+3)/6x \cdot \text{ReLU6}(x+3)/6: piecewise linear approximation, no exp() needed
GeGLUAlternative gated FFN (some research models)Uses GELU instead of SiLU in the gate; similar performance, less widely adopted
ReGLUSimpler gated FFNUses ReLU in the gate; slightly worse than SwiGLU but faster

Swish was proposed by Ramachandran et al. (2017, “Searching for Activation Functions”) at Google Brain, discovered through automated search over activation function design spaces using reinforcement learning. The same function was independently proposed as SiLU (Sigmoid Linear Unit) by Elfwing et al. (2018). The name “SiLU” is now the standard in PyTorch.

Swish gained traction when EfficientNet (Tan & Le, 2019) used it throughout, showing consistent gains over ReLU in image classification. The function’s real impact came when Shazeer (2020, “GLU Variants Improve Transformer”) showed that pairing SiLU with Gated Linear Units (SwiGLU) substantially improved Transformer feedforward blocks. This combination was adopted by PaLM (Chowdhery et al., 2022) and then LLaMA (Touvron et al., 2023), making SwiGLU the de facto standard for LLM architectures. The learnable β\beta parameter from the original Swish paper is almost never used — β=1\beta = 1 (plain SiLU) works fine.