Extended thinking
๐ 4 min readUpdated 2026-04-18
Extended thinking lets Claude reason internally before answering. More reasoning often means better answers on hard tasks, but not always, and not cheaply.
What it is
An API feature where the model spends a configurable number of tokens "thinking" (producing internal reasoning) before generating its final response. The reasoning isn't shown to the user by default but counts toward the cost.
You set a budget, e.g. 2,000 thinking tokens, and the model uses up to that amount, depending on task complexity.
When it helps
- Multi-step reasoning. Math, logic puzzles, planning problems with multiple constraints.
- Complex tool-use decisions. When the next action isn't obvious.
- Ambiguous requests. When the model has to pick between multiple valid interpretations.
- Long-horizon planning. Breaking a big task into a sequence of smaller ones.
When it doesn't help
- Simple questions. "What's the capital of France?" gains nothing from 2,000 thinking tokens.
- Stylistic tasks. Writing a poem, rewording an email, reasoning isn't the bottleneck.
- Retrieval tasks. If the answer is in the context, the model doesn't need to reason its way there.
- Fast agents. Extended thinking adds latency. For interactive agents (voice, chat), the user-perceived slowdown is painful.
Cost tradeoffs
Thinking tokens are typically priced at the output rate (the expensive one). A 2k-token reasoning budget + 500-token response = 2,500 output tokens billed. At Sonnet-class pricing, that's ~1ยข per request. Multiply by millions and it adds up fast.
Insight: Use extended thinking adaptively. Simple user requests โ no thinking. Complex planning steps โ 1kโ2k thinking. Super-hard puzzles โ 4k+.
Budgeting in practice
Start at 1,024 tokens. Measure accuracy with vs without. If the lift is <5% on your eval, turn it off. If it's 20%+, try doubling.
Observable patterns
- Diminishing returns past ~4k tokens. Going from 0 โ 1k gives the biggest lift. 1k โ 4k adds polish. 4k+ rarely moves the needle.
- Task-dependent. Some tasks (math, coding) benefit more. Some (summarization) barely budge.
- Doesn't fix bad prompts. Extended thinking amplifies prompt quality, it doesn't replace it.
Combining with tool use
Extended thinking pairs well with tool-use: model reasons about what to call, calls it, then reasons about the result. On complex tool-use flows (agent picking between 5 tools), enabling thinking noticeably improves tool-selection accuracy.