Extended thinking

Extended thinking is a mode that lets Claude reason privately before it answers. Imagine asking someone a hard question and they say "hold on, let me think about that" and then actually do, working through the problem quietly before saying anything out loud. That's what extended thinking is. It costs more. It takes longer. For hard problems, it's often worth it. For easy problems, it's a waste. This page is about when it helps, when it doesn't, and how to tune the dial.

What it actually is, mechanically.

When you make an API call with extended thinking on, you give Claude a "thinking budget" in tokens. The model uses up to that many tokens producing internal reasoning before writing its final answer. By default, you don't see that reasoning - it's hidden from the final output, but it shaped what came after.

The reasoning isn't free. Those thinking tokens count toward your bill, at the output rate (the expensive rate). A 2,000-token thinking budget plus a 500-token answer = 2,500 billable output tokens. That's why "just turn on extended thinking" isn't always the right answer - you're literally paying for the privilege.

When extended thinking helps vs. when it's wasted.

~ when it helps, when it doesn't ~

The rule of thumb: if a smart human would pause and think for 30 seconds before answering, extended thinking probably helps. If they'd answer immediately, it probably doesn't.

The cost shape.

Thinking tokens are priced as output tokens, and output is roughly 5× the cost of input. A quick sanity-check on what a typical request looks like financially:

~ token budget vs response quality ~

Two things matter more than the raw number:

How to tune the budget.

The right budget is discovered, not guessed. Here's the workflow:

  1. Start at 1,024 tokens. This is the "probably useful, probably not wasteful" default.
  2. Build a small eval set. 10-20 representative inputs with known good outputs. Measure accuracy both with and without thinking.
  3. Measure the lift. If thinking adds <5%, turn it OFF. You're paying for nothing.
  4. If the lift is >20%, try doubling. 1,024 → 2,048. Measure again. If it keeps lifting meaningfully, double again.
  5. Stop at the knee. When another doubling doesn't move the needle, you've found the budget. Stop there.

This saves you from both overpaying (the "turn it up to 8k" default some people run) and underpaying (leaving a 20% accuracy lift on the table).

Use it adaptively, not globally.

The best production pattern isn't "thinking on for everything" or "thinking off for everything." It's route-based: different requests get different budgets.

Classify the incoming request (you can even do this with a cheap model), then set the budget accordingly. Your average cost stays low, and the hard requests still get the reasoning budget they need.

Important truths about extended thinking.

~ three things that are always true ~

That last one is the one people miss. Extended thinking makes good prompts better and bad prompts more verbose. If your agent is failing with thinking OFF, thinking ON will not save you. Fix the prompt first.

Pairing extended thinking with tool use.

This is where extended thinking earns its keep in agentic work. The model reasons about which tool to call, calls it, then reasons about the result before the next call. For agents with more than a handful of tools, this measurably improves tool-selection accuracy. The model picks fewer wrong tools, fills in better arguments, and recovers from errors more gracefully.

If you have a complex multi-tool agent and its tool selection is flaky, enabling thinking on the tool-selection step is one of the cheapest wins available. Try 1,500 tokens first. Measure.

The bottom line.

Extended thinking is a dial, not a switch. Start at 1,024. Measure. Adjust. Use adaptively per request. It's not free, but for the tasks where it helps, it's one of the highest-leverage knobs you can turn on an agent without changing your prompt or your model.