Quick answer

Frontier models now hold huge context. Gemini 3.5 Pro at 2M tokens. GPT-5.6 at 1M. Claude Opus 4.8 at 200K (with effective extended-thinking compression). Long context changes the workflow — you can fit entire codebases, multi-document research, full year-long Slack threads in one prompt. The trade-off: cost, latency, and "needle in a haystack" reliability still imperfect.

In 2023 a 4K context window felt generous. In 2026 the bar is 1M+ tokens for frontier models. The implications go beyond "we can paste more in" — entire workflow categories that didn't exist are now real.

Context window leaderboard (mid-2026)

  • Gemini 3.5 Pro — 2M tokens
  • GPT-5.6 Terra/Sol — 1M tokens
  • Llama 4 Behemoth — 1M tokens
  • Claude Opus 4.8 — 200K (with effective extended-thinking memory beyond)
  • Claude Mythos 5 — 500K (rumoured)
  • Most other open models — 128K-256K typical

What 1M+ tokens enables

  • Entire codebases in the prompt — full project context for refactors
  • Multi-document research — paste 50-100 papers, ask synthesised questions
  • Year-long Slack / email threads in one query
  • Whole legal contracts with all amendments at once
  • Long-form fiction — keep characters consistent across novel-length context
  • RAG-less retrieval for medium-sized knowledge bases

What's still hard

  • Cost — 1M-token prompts cost real money (~$5-15 per query at frontier prices)
  • Latency — first token can take 5-15 seconds on long prompts
  • Reliability — needle-in-haystack accuracy drops near the boundaries
  • Caching is critical — without prompt caching, long-context is unaffordable at scale
  • Reasoning over long context is weaker than over short context

When to use long context vs RAG

A common 2026 question: now that long context exists, do we still need RAG? Answer: yes, for most cases. Long context is for ad-hoc, one-shot tasks where the data is small enough to fit. RAG is for repeated queries against large knowledge bases — cheaper, faster, and more reliable.

Practical workflow patterns

  • Code review — paste the entire diff + relevant context files; ask "what could break?"
  • Research synthesis — paste 20-50 papers; ask for cross-paper comparison
  • Long-doc summarisation — full 500-page report into a 2-page exec summary
  • Legal review — entire contract + standard playbook; ask "what's risky?"
  • Codebase onboarding — paste critical files; ask "explain how this works"

If you're running long-context workflows in production, prompt caching is non-negotiable. Cached prompts get 75-88% input discount on Anthropic and 50% on OpenAI. Without caching, long context is too expensive to justify.

Bottom line

Long context is real and useful in 2026 — but not a replacement for RAG. Use it for ad-hoc, large-document tasks. Use RAG for repeated queries on large knowledge. Both win when used right.