AI Model Distillation Explained — How Small Models Get So Good

Quick answer

Distillation is training a small "student" model by having a big "teacher" model show it the answers. The student learns not just the final answer but the teacher's probability distribution over possible answers — which captures nuance. Result: Claude Haiku, GPT-5 mini, Gemini Flash all reach roughly 80% of frontier quality at 5% of the cost.

In 2024 the trade-off was stark: pay frontier prices for frontier quality, or pay cheap prices for noticeably worse outputs. By 2026 distillation has narrowed that gap dramatically. The cheapest tier of every frontier lab is now genuinely usable for most tasks.

How distillation actually works

Big model generates outputs (and full probability distributions) for millions of prompts
Small model is trained to match those distributions — not just the top answer
The "soft labels" carry more information than hard labels alone
Optional: distil the reasoning process too (chain-of-thought distillation)
Optional: combine with quantisation (smaller numerical precision) for further size reduction

Why distillation works so well

A big model isn't just better at picking the right answer — it has a more nuanced sense of which wrong answers are close-to-right and which are way off. Distillation transfers that nuance. The student model ends up generalising better than it would if trained on hard labels alone.

What you give up

Edge cases: distilled models fail more often on rare or adversarial inputs
Long reasoning: extended thinking doesn't distil well — small models can't hold long chains
Multilingual: small models lose more performance on low-resource languages
Brand-new tasks: distilled models inherit teacher's biases for what kinds of tasks matter

Where to use distilled models

High-volume simple tasks (summarisation, classification, extraction) — distilled models are great
Latency-sensitive apps — small models respond faster
Cost-sensitive RAG — small models are perfect for the "answer from retrieved docs" step
Edge deployment — distilled + quantised models run on phones

For high-volume LLM apps, route ~80% of traffic to a distilled model (Haiku 4.5 / GPT-5 mini / Gemini Flash) and reserve the frontier model for the 20% that genuinely need it. Costs drop 60-80% with minimal quality loss.

Bottom line

Distillation is why the cheap-and-fast tier of every frontier lab is now genuinely good in 2026. For most tasks you don't need the frontier model — you need the distilled one, properly routed.

How distillation actually works

Why distillation works so well

What you give up

Where to use distilled models

Bottom line

What Is Sora 2 — and Is It Better Than Veo and Runway in 2026?

AI for Small Business in 2026 — 7 Tools That Actually Save Time

AI Voice Generators in 2026 — The 5 That Actually Sound Human