Quick answer

Distillation is training a small "student" model by having a big "teacher" model show it the answers. The student learns not just the final answer but the teacher's probability distribution over possible answers — which captures nuance. Result: Claude Haiku, GPT-5 mini, Gemini Flash all reach roughly 80% of frontier quality at 5% of the cost.

In 2024 the trade-off was stark: pay frontier prices for frontier quality, or pay cheap prices for noticeably worse outputs. By 2026 distillation has narrowed that gap dramatically. The cheapest tier of every frontier lab is now genuinely usable for most tasks.

How distillation actually works

  • Big model generates outputs (and full probability distributions) for millions of prompts
  • Small model is trained to match those distributions — not just the top answer
  • The "soft labels" carry more information than hard labels alone
  • Optional: distil the reasoning process too (chain-of-thought distillation)
  • Optional: combine with quantisation (smaller numerical precision) for further size reduction

Why distillation works so well

A big model isn't just better at picking the right answer — it has a more nuanced sense of which wrong answers are close-to-right and which are way off. Distillation transfers that nuance. The student model ends up generalising better than it would if trained on hard labels alone.

What you give up

  • Edge cases: distilled models fail more often on rare or adversarial inputs
  • Long reasoning: extended thinking doesn't distil well — small models can't hold long chains
  • Multilingual: small models lose more performance on low-resource languages
  • Brand-new tasks: distilled models inherit teacher's biases for what kinds of tasks matter

Where to use distilled models

  • High-volume simple tasks (summarisation, classification, extraction) — distilled models are great
  • Latency-sensitive apps — small models respond faster
  • Cost-sensitive RAG — small models are perfect for the "answer from retrieved docs" step
  • Edge deployment — distilled + quantised models run on phones

For high-volume LLM apps, route ~80% of traffic to a distilled model (Haiku 4.5 / GPT-5 mini / Gemini Flash) and reserve the frontier model for the 20% that genuinely need it. Costs drop 60-80% with minimal quality loss.

Bottom line

Distillation is why the cheap-and-fast tier of every frontier lab is now genuinely good in 2026. For most tasks you don't need the frontier model — you need the distilled one, properly routed.