Prompting an agent is different from prompting a chatbot. A chatbot just has to answer; an agent has to decide, call tools, read the results, recover from errors, and stop when done. Your prompt isn't just instructions, it's the agent's entire operating manual. Get the shape right and the agent runs itself. Get it wrong and it'll either loop forever or give up at step 2. This page is the shape.
When you prompt a chatbot, you're asking one question and getting one answer. The prompt can be loose, even lazy, because the human is in the loop to correct course.
When you prompt an agent, the human isn't in the loop anymore. The prompt has to anticipate: what tools will the model reach for? What if one of them fails? When does the task count as done? What format does the final answer need to be in? Every ambiguity becomes a bug at 3am. The prompt is the contract.
Open with what the agent is and what it's trying to accomplish. Specific beats generic.
You are a research assistant. Your goal is to answer the user's question
using web search and return a concise, sourced summary.
That's better than "help the user find information." Specific is kind to the model; it narrows the space of plausible behaviors.
What the agent must always do, what it must never do, and any non-obvious edge cases.
Rules:
- Always cite sources with URLs
- Never speculate beyond what sources say
- If sources conflict, surface the disagreement, don't hide it
- Stop after 3 search rounds even if you want more data
Rules should resolve the small ambiguities that come up at runtime. "What if sources disagree?" is exactly the kind of thing you want answered in advance.
For each tool, say when to use it and when not to, and describe your error-recovery policy.
You have two tools:
- web_search: use for finding new information
- fetch_url: use to read a specific page the user references
If a tool returns an error:
- Retry once with a reformulated input
- If it errors twice, surface the failure to the user; don't hide it
Without this, the model invents its own error-recovery behavior. Sometimes that's fine. More often it hammers a failing endpoint fifty times.
Describe exactly what the final answer should look like. If any code downstream is going to parse the output, be obsessive about this.
Return your final answer as a JSON object:
{
"summary": "(200 words max)",
"key_claims": [ { "claim": "...", "source_url": "..." }, ... ],
"confidence": "high" | "medium" | "low"
}
Claude responds really well to XML-tagged sections. They help the model find specific parts of the prompt and produce outputs that match. This isn't just stylistic, it measurably improves consistency.
<task>Summarize the article at the URL below.</task>
<url>https://example.com/article</url>
<constraints>
- 200 words max
- Include the 3 key claims with source quotes
- Flag any statements that seem unverifiable
</constraints>
<output_format>
Markdown. H2 for the main summary, bullet list for claims.
</output_format>
The tags give the model landmarks. It knows "the task is in <task>" and "the constraints are in <constraints>." Output quality goes up; drift goes down.
For anything complicated, explicitly tell the model to reason before it calls a tool, and after it sees the result. This works even without extended thinking mode turned on, the reasoning just shows up in the main output instead.
Before calling any tool:
1. Identify what's needed to complete the task
2. Decide which tool to call and with what arguments
3. Predict what the result will tell you
After each tool call:
1. Compare the result to what you expected
2. Decide whether to continue or change approach
This simple scaffold prevents a lot of the "agent got confused and called the wrong tool" failures. The model spending 30 tokens on "what do I actually need?" saves you from 10 bad tool calls.
Agents fail. Tools error. Networks are flaky. Without a plan for this, your agent either gives up too early or burns budget hammering a failing endpoint.
Something weird but true: Claude pays more attention to the last part of the prompt. If you care about output format, put the format spec after the rules and the tool guidance, not at the beginning. The model remembers the last instructions best, and format drift usually comes from the instructions being far from the output.
Every yes is one less 3am debugging session.
Andrej Karpathy - State of GPT (Microsoft Build)