Speculative Decoding Explained — Why Claude and GPT Got 3x Faster

Quick answer

Speculative decoding is an inference trick: use a tiny model to guess the next several tokens, then use the big model to verify them in one batch instead of generating one at a time. Net result: 2-3x faster inference at the same quality. It's the main reason Claude and ChatGPT feel so much snappier in 2026 versus 2024.

Without speculative decoding, generating each token requires running the entire frontier model. For a 400B parameter model, that's expensive. Speculative decoding flips this: a small (1B or smaller) draft model proposes the next 4-8 tokens; the big model verifies them in a single pass. When the small model guesses right (it does about 70-80% of the time), you get those tokens for almost free.

Why it works

Big models are slow because each token requires a full forward pass through the model
Small models are fast but less accurate
Most of the next tokens are "obvious" (the, of, in, a) — small model gets them right easily
Hard tokens still need the big model's judgement
Speculative decoding combines: small model speed for easy tokens, big model accuracy for hard ones

How it works in practice

A draft model (typically a Haiku-sized model for Claude, GPT-5 mini for GPT-5) generates 4-8 candidate tokens. The big model then runs once over all of them, checking each one. Any that don't match the big model's probability distribution get rejected and corrected. Accepted tokens are emitted instantly. The math works out: 70-80% acceptance × 4-8 tokens per draft = ~3x net speed-up.

What it doesn't change

Output quality is identical to running the big model alone — every token is still verified
Cost per token can actually drop because draft-model passes are cheap
Behaviour on reasoning-heavy responses (where the big model disagrees more) sees less speed-up

Why it took until 2026 to land

The idea is old (DeepMind paper, 2023) but production-grade implementation is hard. You need a draft model trained on similar data, careful KV-cache management, and infrastructure that batches verification efficiently. By 2026 all the major labs have it working. It's one of the under-discussed reasons AI feels faster this year.

If you're an API user — speculative decoding is invisible to you. It happens server-side. You just get faster responses for free. That's the whole point.

Bottom line

Speculative decoding is the quiet 2026 efficiency win. Same model quality, dramatically faster. If you've been thinking "Claude got snappier recently," this is part of why.

Why it works

How it works in practice

What it doesn't change

Why it took until 2026 to land

Bottom line

What Is Sora 2 — and Is It Better Than Veo and Runway in 2026?

AI for Small Business in 2026 — 7 Tools That Actually Save Time

AI Voice Generators in 2026 — The 5 That Actually Sound Human