Quick answer
AI API pricing in 2026 spans a 100× range. GPT-5 and Claude Opus 4.7 sit at the top ($5-$15 per million input tokens, $15-$75 per million output). DeepSeek V3 and Gemini 3.5 Flash sit at the bottom ($0.10-$0.30 per million). The most-overlooked detail: prompt caching cuts input cost by 50% on OpenAI, 75% on Google, and 90% on Anthropic. If your workload has repeat context (RAG, agents, multi-turn chat), this is the single biggest cost lever you have.
I have been comparing AI API pricing weekly for the past year. Here is the honest snapshot of where things stand in late May 2026 — what each model actually costs, where the surprises are, and the caching trick that almost every cost calculator on the web ignores.
The pricing landscape, in one table
All prices below are USD per 1 million tokens. Input / Output columns are the per-token rate.
- GPT-5 — $5 / $15 (cached input: $2.50)
- GPT-5 mini — $0.15 / $0.60 (cached: $0.075)
- GPT-4o — $2.50 / $10 (cached: $1.25)
- Claude Opus 4.7 — $15 / $75 (cached: $1.50, 90% off)
- Claude Sonnet 4.6 — $3 / $15 (cached: $0.30, 90% off)
- Claude Haiku 4.5 — $0.80 / $4 (cached: $0.08)
- Gemini 3.5 Pro — $1.25 / $5 (cached: $0.31, 75% off)
- Gemini 3.5 Flash — $0.10 / $0.30 (cached: $0.025)
- Grok 4 — $5 / $15 (no caching tier)
- DeepSeek V3 — $0.27 / $1.10 (cached off-peak: $0.07)
- Llama 3.3 70B (Groq) — $0.59 / $0.79 (no caching)
- Mistral Large 2 — $2 / $6 (no caching)
The biggest surprise: caching
If you only take one thing from this article: prompt caching is the single most impactful pricing change of the past two years, and most developers I talk to are not using it.
Here is how it works. When your prompt includes a long, stable prefix — a system prompt, retrieved documents, conversation history — the provider can cache the model's internal computation of that prefix. The next time the same prefix shows up, you pay 50–90% less for those input tokens. Output tokens are unchanged.
- Anthropic: 90% off cached input. Write costs 25% extra on the first call. 5-minute TTL.
- Google (Gemini): 75% off implicit caching. 4-hour minimum cache.
- OpenAI: 50% off cached input. Automatic for prompts ≥1024 tokens with reused prefix.
- DeepSeek: 50% off standard, ~75% off during off-peak hours.
- Anthropic, OpenAI: no extra config needed beyond proper prefix structure.
Real example: a RAG chatbot with a 4,000-token system prompt + retrieved context, running on Claude Sonnet 4.6 at 10,000 conversations/day with 95% cache hit rate. Without caching: $12,000/mo just on input tokens. With caching: $1,500/mo. The output side is unchanged. That 87% reduction is the difference between "this product is profitable" and "this product is unviable."
Which model is actually cheapest?
Per token, Gemini 3.5 Flash is the cheapest frontier-class model. Per dollar of useful work, it depends entirely on your task.
- For autocomplete / classification / extraction: Gemini 3.5 Flash or DeepSeek V3. You are paying pennies.
- For agentic workflows with tool use: GPT-5 or Claude Sonnet 4.6 — reliability matters more than per-token cost.
- For long-document analysis: Gemini 3.5 Pro. The 2M context window often lets you skip RAG entirely.
- For complex coding: Claude Opus 4.7 is expensive but cuts iteration cost. Often cheapest overall on real work.
- For at-scale customer-facing chat: Sonnet 4.6 with aggressive caching is usually the sweet spot.
The traps that surprise developers
- Output tokens cost 3-5× input tokens — long responses get expensive fast
- Reasoning tokens count as output tokens — GPT-5 / Claude Opus reasoning mode bills can be huge
- Image inputs are billed at a fixed token equivalent per image, often 1k-2k tokens
- JSON-mode and structured output add a small overhead — usually negligible
- Function calls / tool definitions count as input on every request — keep them lean
- Streaming does not affect cost, only perceived latency
How to actually budget
The mental model that works: estimate per-request cost, multiply by your usage at scale, then test with a small live sample to calibrate. The per-request math is easy. The traps are in the assumptions about output length and cache hit rates.
We built a free calculator that does exactly this — paste your prompt or enter token counts, set requests/day, set cache hit rate, see the cost across 12 models. Side-by-side. With the caching discounts already baked in. Try it at /ai-cost-calculator.
Provider strategy in 2026
- Anthropic: aggressive caching discounts, premium pricing on Opus, strong enterprise contracts
- OpenAI: tiered model lineup, prompt caching automatic, pricing relatively static
- Google: cheapest frontier prices, multi-model integration with Workspace
- DeepSeek: aggressive on price, off-peak discounts, open weights for self-hosting
- Groq: fastest inference, no caching tier, very competitive on Llama 3.3
What is likely to change
Three things to watch. (1) Output token prices are likely to drop further as reasoning-mode usage scales — that is the bottleneck right now. (2) More providers will add caching tiers — Mistral and xAI are the obvious next ones. (3) Multimodal pricing (images, audio, video) will get more standardised. Today it is per-image, per-second, per-minute depending on provider, which makes comparing nearly impossible.
Related reading
Bottom line
AI API pricing in 2026 spans 100×. Within that range, output tokens cost more than input, caching cuts input by 50-90%, and frontier models give you reliability worth paying for on agentic work. The cheapest model is rarely the right answer — the right answer is the cheapest model that solves your problem reliably. Use the calculator, test live, and budget with cache hit rate factored in.




