Speculative Decoding — Plain English Definition

Speculative decoding is an inference optimisation technique. Instead of generating one token at a time with the big model, a small "draft" model proposes the next 4-8 tokens; the big model then verifies them in a single pass. When the small model is right (about 70-80% of the time), those tokens come essentially for free. The result is roughly 2-3x faster inference at identical quality — same model weights, same output distribution, just less compute spent per token. Speculative decoding is one of the main reasons Claude and ChatGPT feel noticeably snappier in 2026 versus 2024. It's invisible to API users — entirely server-side. Works best on prose generation where most tokens are "easy" (the, of, in). Less speedup on reasoning-heavy responses where the big model disagrees more with the draft.

Read the full guide

What Is Speculative Decoding

Read the full guide

Tools that use this