Quick answer
Speculative decoding is an inference trick: use a tiny model to guess the next several tokens, then use the big model to verify them in one batch instead of generating one at a time. Net result: 2-3x faster inference at the same quality. It's the main reason Claude and ChatGPT feel so much snappier in 2026 versus 2024.
Without speculative decoding, generating each token requires running the entire frontier model. For a 400B parameter model, that's expensive. Speculative decoding flips this: a small (1B or smaller) draft model proposes the next 4-8 tokens; the big model verifies them in a single pass. When the small model guesses right (it does about 70-80% of the time), you get those tokens for almost free.
Why it works
- Big models are slow because each token requires a full forward pass through the model
- Small models are fast but less accurate
- Most of the next tokens are "obvious" (the, of, in, a) — small model gets them right easily
- Hard tokens still need the big model's judgement
- Speculative decoding combines: small model speed for easy tokens, big model accuracy for hard ones
How it works in practice
A draft model (typically a Haiku-sized model for Claude, GPT-5 mini for GPT-5) generates 4-8 candidate tokens. The big model then runs once over all of them, checking each one. Any that don't match the big model's probability distribution get rejected and corrected. Accepted tokens are emitted instantly. The math works out: 70-80% acceptance × 4-8 tokens per draft = ~3x net speed-up.
What it doesn't change
- Output quality is identical to running the big model alone — every token is still verified
- Cost per token can actually drop because draft-model passes are cheap
- Behaviour on reasoning-heavy responses (where the big model disagrees more) sees less speed-up
Why it took until 2026 to land
The idea is old (DeepMind paper, 2023) but production-grade implementation is hard. You need a draft model trained on similar data, careful KV-cache management, and infrastructure that batches verification efficiently. By 2026 all the major labs have it working. It's one of the under-discussed reasons AI feels faster this year.
If you're an API user — speculative decoding is invisible to you. It happens server-side. You just get faster responses for free. That's the whole point.
Related reading
Bottom line
Speculative decoding is the quiet 2026 efficiency win. Same model quality, dramatically faster. If you've been thinking "Claude got snappier recently," this is part of why.

