Quick answer
Test-time compute is the idea that AI models can be much smarter if you let them think longer before answering. Instead of generating a response immediately, the model works through the problem step by step internally — sometimes for minutes — then commits to a final answer. This is the breakthrough behind GPT-5 Pro, Claude Opus 4.8 with extended thinking, and Gemini 3.5 Pro Thinking. It is also why your AI bill can spike unexpectedly: thinking tokens cost the same as output tokens.
For most of the modern AI era, the rule was simple: bigger model = smarter model. Train on more text, add more parameters, get better results. That rule mostly still holds — but a new lever appeared in late 2024 and quietly became the most impactful AI breakthrough of 2025–2026. It is called "test-time compute" — or "extended thinking" if Anthropic is selling it. Here is what it actually is.
The simple version
Old way: you ask the model a question, it immediately starts generating an answer one word at a time. The "thinking" is the same as the "speaking" — there is no internal monologue, no scratch work.
New way: you ask the model a question. Before generating its answer, it runs an extended internal reasoning process — visible as a "thinking" or "reasoning" trace that you can sometimes see, sometimes not. It might try multiple approaches, catch its own errors, revise its plan. Only after this internal work does it commit to a final answer.
Why does this make models so much smarter?
Three reasons keep showing up in the research.
- Multi-step problems compound errors — if each step has 95% accuracy, ten steps drops to 60%. Letting the model verify its own work mid-reasoning catches errors before they compound
- Backtracking becomes possible — without extended thinking, once a model commits to a wrong path it usually finishes the wrong answer. With it, the model can notice "wait, that does not check out" and try again
- Implicit search — the model explores multiple solution paths internally and picks the best one. This is essentially the same trick chess engines use, applied to text
A practical benchmark: Claude Opus 4.7 without extended thinking scored 78% on GPQA Diamond (PhD-level science questions). The same model WITH extended thinking enabled scored 93.1%. That is the same neural network, same training data — just more compute spent at the moment you ask. The model did not get bigger; it got more time to think.
What models use this in 2026?
- GPT-5 — "thinking mode" available on Pro tier; runs up to 30 minutes on hard problems
- GPT-5 Pro — extended thinking always on; the most-capable OpenAI tier
- Claude Opus 4.8 — "extended thinking" mode, up to 60 minutes (was 12 in 4.7)
- Claude Sonnet 4.6 — basic thinking mode available
- Gemini 3.5 Pro — "Deep Think" mode, slower but smarter
- OpenAI o-series (deprecated) — were the first models to ship with this; concept absorbed into GPT-5
- DeepSeek-R1 (open source) — same architecture, free to use locally
- QwQ-32B (Alibaba, open source) — similar approach
What does it cost you?
This is the catch most developers miss. The model generates thinking tokens during the extended-thinking phase, and those tokens are billed at the OUTPUT price — typically 3–5x the input price. On Claude Opus 4.8 that is $75 per million thinking tokens. A complex problem that uses 50,000 thinking tokens before answering costs $3.75 just for the internal reasoning, before you see a single word of the actual response.
Practical implication: enable extended thinking only when the task genuinely needs it. For quick lookups, summaries, or simple writing, it is wasted spend.
When is it worth using?
- Yes: complex coding bugs, multi-step maths, research synthesis, legal analysis, debugging cross-file issues
- Yes: tasks where wrong answers cost real money or time
- No: simple Q&A, summaries, casual chat, code completion
- No: anything you would have been fine with using GPT-4o or Sonnet 4.6 for in 2024
- Maybe: writing — extended thinking sometimes produces more thoughtful output, sometimes overthinks it
Why this changes the AI race
Test-time compute is the most economically efficient capability gain in recent AI history. Training a model from scratch costs $100M+. Adding test-time compute to an existing model? Essentially free engineering work that unlocks 10–30 percentage points of benchmark improvement on hard tasks. Every frontier lab has now adopted some version of it. The labs that have not — Mistral, until recently — are visibly behind on the hardest reasoning benchmarks.
Expect the next generation of capability gains to come more from "smarter use of compute at inference time" and less from "bigger models trained on more data". The training-side wall is getting closer; the inference-side ceiling is still far away.
Related reading
Bottom line
Test-time compute is the breakthrough that made AI models good at hard reasoning. It works by giving the model more time to think before answering — internally exploring multiple paths, catching its own errors. It costs real money (thinking tokens are billed at output rates), but for genuinely hard problems it is the single biggest capability lever currently available. Use it for the tasks that need it; turn it off for the tasks that do not.



