Quick answer
Inference latency is how long the model takes to respond. Two metrics matter: time-to-first-token (TTFT, how long before the first word appears) and tokens-per-second (TPS, how fast the rest comes out). For real-time products like voice assistants and coding agents, latency matters more than benchmark scores. A "smarter" model that's 5x slower is often the worse choice.
Benchmarks tell you accuracy. Latency tells you whether the model is usable. The two have been diverging in 2026 — frontier models with extended thinking can be brilliant on hard problems but unusable for fast interactions. Knowing which metric matters for your product is half the battle.
Two latencies that matter
- Time-to-first-token (TTFT): the wait between submitting a prompt and seeing the first word. Critical for chat UX. Typical: 200-800ms.
- Tokens-per-second (TPS): how fast the rest of the response streams. Typical for frontier models: 30-100 TPS.
- Total wall-clock time: response length / TPS + TTFT. For a 300-token reply, this is ~5-15 seconds.
- Tail latency: the 99th percentile of latency, which is what determines whether your app feels reliable.
Latency budgets by product type
- Voice assistant: TTFT under 300ms, TPS over 100. Frontier models with extended thinking are unusable here.
- Code completion: TTFT under 200ms, response time under 1.5s. Use Cursor Tab's distilled small model.
- Chat: TTFT under 1s, response under 5s. Most frontier models work.
- Research/agent: latency mostly irrelevant. Use the strongest model you can afford.
What affects latency
- Model size: bigger = slower (Opus 4.8 is slower than Haiku 4.5)
- Architecture: MoE and MoD are faster per parameter; speculative decoding speeds up everything
- Hardware: Cerebras WSE-3 dominates low-latency niches
- Network distance: API latency adds 50-200ms; edge deployment helps
- Prompt caching: cached prompts can drop TTFT from 800ms to 200ms
Before picking a model, write down your latency budget. Use the smallest, fastest model that clears your benchmarks. Pretty much nobody actually needs Opus 4.8 — they just default to it.
Related reading
Bottom line
Latency is the under-discussed dimension of AI quality. The "best" model for your product is rarely the one that scores highest on benchmarks — it's the one that clears your minimum quality bar at the latency budget you can afford.
