Quick answer
Mixture of Depths (MoD) is an architecture where the model decides per-token how many layers to use. Easy tokens (filler words, predictable continuations) skip layers; hard tokens (reasoning, key facts) get the full depth. Net: ~30% fewer FLOPs per token at the same quality. MoD is being adopted in 2026 frontier models — quietly, because architectural details rarely make headlines.
Most frontier models in 2024 used dense Transformers (every token uses every layer) or Mixture of Experts (every token uses every layer but only some "experts" per layer). MoD adds a third dimension: dynamic depth. Different tokens can go through different numbers of layers.
How MoD differs from MoE
- MoE: every token goes through every layer, but only a subset of experts within each layer
- MoD: every token can skip whole layers entirely
- MoE saves compute width-wise (parameters), MoD saves depth-wise (compute time)
- You can combine them: MoD + MoE = "Mixture of Depths and Experts" (MoDE), which is what most 2026 frontier models use
Why it works
Not every token needs the same processing. "The cat sat on the" — the next token is probably "mat" or "floor" and doesn't require deep reasoning. "The answer to the optimisation problem is" — that needs every layer. MoD learns a router that decides which case each token is.
Practical impact
- Latency: ~30% lower per-token time on real workloads
- Cost: ~25% cheaper inference at scale
- Quality: matches or slightly beats dense models of the same parameter count
- Long-context efficiency: MoD especially helps on long contexts because most tokens are "easy" filler
Who uses it
DeepMind published the foundational paper in 2024. By 2026 it's broadly adopted — Gemini 3.5, Llama 4 Behemoth, and reportedly Opus 4.8 all use some MoD variant. OpenAI hasn't published architecture details but their inference cost drops suggest similar techniques.
MoD is invisible to API users. You don't configure it or interact with it. It's just one of the architectural reasons modern frontier models are cheaper and faster than 2024 equivalents.
Related reading
Bottom line
MoD is one of the quiet architectural reasons AI got cheaper and faster in 2026. It works by spending less compute on easy tokens. Combined with MoE and speculative decoding, it's why your AI bill is lower than a year ago.
