Mixture of Depths (MoD) Explained — The New AI Architecture Beating MoE

Quick answer

Mixture of Depths (MoD) is an architecture where the model decides per-token how many layers to use. Easy tokens (filler words, predictable continuations) skip layers; hard tokens (reasoning, key facts) get the full depth. Net: ~30% fewer FLOPs per token at the same quality. MoD is being adopted in 2026 frontier models — quietly, because architectural details rarely make headlines.

Most frontier models in 2024 used dense Transformers (every token uses every layer) or Mixture of Experts (every token uses every layer but only some "experts" per layer). MoD adds a third dimension: dynamic depth. Different tokens can go through different numbers of layers.

How MoD differs from MoE

MoE: every token goes through every layer, but only a subset of experts within each layer
MoD: every token can skip whole layers entirely
MoE saves compute width-wise (parameters), MoD saves depth-wise (compute time)
You can combine them: MoD + MoE = "Mixture of Depths and Experts" (MoDE), which is what most 2026 frontier models use

Why it works

Not every token needs the same processing. "The cat sat on the" — the next token is probably "mat" or "floor" and doesn't require deep reasoning. "The answer to the optimisation problem is" — that needs every layer. MoD learns a router that decides which case each token is.

Practical impact

Latency: ~30% lower per-token time on real workloads
Cost: ~25% cheaper inference at scale
Quality: matches or slightly beats dense models of the same parameter count
Long-context efficiency: MoD especially helps on long contexts because most tokens are "easy" filler

Who uses it

DeepMind published the foundational paper in 2024. By 2026 it's broadly adopted — Gemini 3.5, Llama 4 Behemoth, and reportedly Opus 4.8 all use some MoD variant. OpenAI hasn't published architecture details but their inference cost drops suggest similar techniques.

MoD is invisible to API users. You don't configure it or interact with it. It's just one of the architectural reasons modern frontier models are cheaper and faster than 2024 equivalents.

Bottom line

MoD is one of the quiet architectural reasons AI got cheaper and faster in 2026. It works by spending less compute on easy tokens. Combined with MoE and speculative decoding, it's why your AI bill is lower than a year ago.

How MoD differs from MoE

Why it works

Practical impact

Who uses it

Bottom line

What Is Sora 2 — and Is It Better Than Veo and Runway in 2026?

AI for Small Business in 2026 — 7 Tools That Actually Save Time

AI Voice Generators in 2026 — The 5 That Actually Sound Human