Techniques & Methods
Speculative Decoding
A trick that makes frontier AI 2-3x faster by using a tiny model to guess tokens for a big one to verify.
Also known as: speculative sampling,draft model decoding
Speculative decoding is an inference optimisation technique. Instead of generating one token at a time with the big model, a small "draft" model proposes the next 4-8 tokens; the big model then verifies them in a single pass. When the small model is right (about 70-80% of the time), those tokens come essentially for free. The result is roughly 2-3x faster inference at identical quality — same model weights, same output distribution, just less compute spent per token. Speculative decoding is one of the main reasons Claude and ChatGPT feel noticeably snappier in 2026 versus 2024. It's invisible to API users — entirely server-side. Works best on prose generation where most tokens are "easy" (the, of, in). Less speedup on reasoning-heavy responses where the big model disagrees more with the draft.

