Quick answer
Frontier models now hold huge context. Gemini 3.5 Pro at 2M tokens. GPT-5.6 at 1M. Claude Opus 4.8 at 200K (with effective extended-thinking compression). Long context changes the workflow — you can fit entire codebases, multi-document research, full year-long Slack threads in one prompt. The trade-off: cost, latency, and "needle in a haystack" reliability still imperfect.
In 2023 a 4K context window felt generous. In 2026 the bar is 1M+ tokens for frontier models. The implications go beyond "we can paste more in" — entire workflow categories that didn't exist are now real.
Context window leaderboard (mid-2026)
- Gemini 3.5 Pro — 2M tokens
- GPT-5.6 Terra/Sol — 1M tokens
- Llama 4 Behemoth — 1M tokens
- Claude Opus 4.8 — 200K (with effective extended-thinking memory beyond)
- Claude Mythos 5 — 500K (rumoured)
- Most other open models — 128K-256K typical
What 1M+ tokens enables
- Entire codebases in the prompt — full project context for refactors
- Multi-document research — paste 50-100 papers, ask synthesised questions
- Year-long Slack / email threads in one query
- Whole legal contracts with all amendments at once
- Long-form fiction — keep characters consistent across novel-length context
- RAG-less retrieval for medium-sized knowledge bases
What's still hard
- Cost — 1M-token prompts cost real money (~$5-15 per query at frontier prices)
- Latency — first token can take 5-15 seconds on long prompts
- Reliability — needle-in-haystack accuracy drops near the boundaries
- Caching is critical — without prompt caching, long-context is unaffordable at scale
- Reasoning over long context is weaker than over short context
When to use long context vs RAG
A common 2026 question: now that long context exists, do we still need RAG? Answer: yes, for most cases. Long context is for ad-hoc, one-shot tasks where the data is small enough to fit. RAG is for repeated queries against large knowledge bases — cheaper, faster, and more reliable.
Practical workflow patterns
- Code review — paste the entire diff + relevant context files; ask "what could break?"
- Research synthesis — paste 20-50 papers; ask for cross-paper comparison
- Long-doc summarisation — full 500-page report into a 2-page exec summary
- Legal review — entire contract + standard playbook; ask "what's risky?"
- Codebase onboarding — paste critical files; ask "explain how this works"
If you're running long-context workflows in production, prompt caching is non-negotiable. Cached prompts get 75-88% input discount on Anthropic and 50% on OpenAI. Without caching, long context is too expensive to justify.
Related reading
Bottom line
Long context is real and useful in 2026 — but not a replacement for RAG. Use it for ad-hoc, large-document tasks. Use RAG for repeated queries on large knowledge. Both win when used right.

