Extended thinking is a mode that lets Claude reason privately before it answers. Imagine asking someone a hard question and they say "hold on, let me think about that" and then actually do, working through the problem quietly before saying anything out loud. That's what extended thinking is. It costs more. It takes longer. For hard problems, it's often worth it. For easy problems, it's a waste. This page is about when it helps, when it doesn't, and how to tune the dial.
When you make an API call with extended thinking on, you give Claude a "thinking budget" in tokens. The model uses up to that many tokens producing internal reasoning before writing its final answer. By default, you don't see that reasoning - it's hidden from the final output, but it shaped what came after.
The reasoning isn't free. Those thinking tokens count toward your bill, at the output rate (the expensive rate). A 2,000-token thinking budget plus a 500-token answer = 2,500 billable output tokens. That's why "just turn on extended thinking" isn't always the right answer - you're literally paying for the privilege.
The rule of thumb: if a smart human would pause and think for 30 seconds before answering, extended thinking probably helps. If they'd answer immediately, it probably doesn't.
Thinking tokens are priced as output tokens, and output is roughly 5× the cost of input. A quick sanity-check on what a typical request looks like financially:
Two things matter more than the raw number:
The right budget is discovered, not guessed. Here's the workflow:
This saves you from both overpaying (the "turn it up to 8k" default some people run) and underpaying (leaving a 20% accuracy lift on the table).
The best production pattern isn't "thinking on for everything" or "thinking off for everything." It's route-based: different requests get different budgets.
Classify the incoming request (you can even do this with a cheap model), then set the budget accordingly. Your average cost stays low, and the hard requests still get the reasoning budget they need.
That last one is the one people miss. Extended thinking makes good prompts better and bad prompts more verbose. If your agent is failing with thinking OFF, thinking ON will not save you. Fix the prompt first.
This is where extended thinking earns its keep in agentic work. The model reasons about which tool to call, calls it, then reasons about the result before the next call. For agents with more than a handful of tools, this measurably improves tool-selection accuracy. The model picks fewer wrong tools, fills in better arguments, and recovers from errors more gracefully.
If you have a complex multi-tool agent and its tool selection is flaky, enabling thinking on the tool-selection step is one of the cheapest wins available. Try 1,500 tokens first. Measure.
Extended thinking is a dial, not a switch. Start at 1,024. Measure. Adjust. Use adaptively per request. It's not free, but for the tasks where it helps, it's one of the highest-leverage knobs you can turn on an agent without changing your prompt or your model.
Andrej Karpathy - State of GPT (Microsoft Build)