AI Inference Latency Explained — Why It Matters More Than Benchmarks

Quick answer

Inference latency is how long the model takes to respond. Two metrics matter: time-to-first-token (TTFT, how long before the first word appears) and tokens-per-second (TPS, how fast the rest comes out). For real-time products like voice assistants and coding agents, latency matters more than benchmark scores. A "smarter" model that's 5x slower is often the worse choice.

Benchmarks tell you accuracy. Latency tells you whether the model is usable. The two have been diverging in 2026 — frontier models with extended thinking can be brilliant on hard problems but unusable for fast interactions. Knowing which metric matters for your product is half the battle.

Two latencies that matter

Time-to-first-token (TTFT): the wait between submitting a prompt and seeing the first word. Critical for chat UX. Typical: 200-800ms.
Tokens-per-second (TPS): how fast the rest of the response streams. Typical for frontier models: 30-100 TPS.
Total wall-clock time: response length / TPS + TTFT. For a 300-token reply, this is ~5-15 seconds.
Tail latency: the 99th percentile of latency, which is what determines whether your app feels reliable.

Latency budgets by product type

Voice assistant: TTFT under 300ms, TPS over 100. Frontier models with extended thinking are unusable here.
Code completion: TTFT under 200ms, response time under 1.5s. Use Cursor Tab's distilled small model.
Chat: TTFT under 1s, response under 5s. Most frontier models work.
Research/agent: latency mostly irrelevant. Use the strongest model you can afford.

What affects latency

Model size: bigger = slower (Opus 4.8 is slower than Haiku 4.5)
Architecture: MoE and MoD are faster per parameter; speculative decoding speeds up everything
Hardware: Cerebras WSE-3 dominates low-latency niches
Network distance: API latency adds 50-200ms; edge deployment helps
Prompt caching: cached prompts can drop TTFT from 800ms to 200ms

Before picking a model, write down your latency budget. Use the smallest, fastest model that clears your benchmarks. Pretty much nobody actually needs Opus 4.8 — they just default to it.

Bottom line

Latency is the under-discussed dimension of AI quality. The "best" model for your product is rarely the one that scores highest on benchmarks — it's the one that clears your minimum quality bar at the latency budget you can afford.

Two latencies that matter

Latency budgets by product type

What affects latency

Bottom line

What Is Sora 2 — and Is It Better Than Veo and Runway in 2026?

AI for Small Business in 2026 — 7 Tools That Actually Save Time

AI Voice Generators in 2026 — The 5 That Actually Sound Human