Every tool will fail eventually. APIs go down, rate limits hit, arguments get rejected, timeouts expire. The question is: when a tool fails, does your agent die, get wrong answers, or handle it gracefully? The answer depends almost entirely on how you structure errors. The headline rule: return errors as data, don't throw them as exceptions. That one change moves agents from demo-quality to production-quality.
An exception kills the agent loop. A structured error is just another piece of information the model can read and respond to. Which means the agent can retry, pivot, or escalate, exactly like a human would. Your whole error strategy flows from this one shift.
{
"error": "rate_limit",
"message": "Rate limit exceeded. Try again in 60 seconds.",
"retry_after": 60
}
{
"error": "not_found",
"message": "No customer found matching 'acme'. Try a broader search or list all customers."
}
Two things to notice. There's always a typed error code so the agent can branch on the category. There's always a message and often a hint about what to do next. The agent reads both and picks an intelligent next move.
Transient errors should be retried by your orchestrator, not by the model. You don't want to spend three LLM turns retrying a rate limit. The orchestrator does 3 retries with exponential backoff internally, and only if those fail does the model see the error. That keeps the agent's context clean and its reasoning focused.
Permanent errors (bad args, wrong auth) are surfaced immediately. Retrying them wastes time. The model needs to see "this is a permanent error" and change tack.
Every tool call needs a timeout. A hung tool hangs the whole agent session. Set aggressive per-tool timeouts (5-30 seconds). When a timeout fires, return it as a structured error so the agent can respond.
User: "How many tickets did Acme file last quarter?"
search_customer(name: "Acme").{"error": "ambiguous", "message": "5 customers match 'Acme'. Please refine or pass customer_id.", "candidates": [...]}search_customer(name: "Acme Industries Inc.") → returns exactly one customer.count_tickets(customer_id: 1234, quarter: "2026-Q1").{"error": "rate_limit", "retry_after": 60}.Four different error situations in one run (ambiguous match, rate limit, successful retry, clean result). Nothing crashed. The agent handled each in the natural way because errors were data, not exceptions.
Worst thing a tool can do: return an empty list or null when it actually failed. The model assumes "no results = genuine no results" and confidently gives the user wrong info. Ban this. If the search failed, say so: {"error": "backend_down", ...}. "I couldn't check" is always better than "here's a made-up answer."
Error messages are context the model has to reason over. Cryptic messages waste tokens and produce bad recoveries. Compare:
"Error 4040""Internal server error""File not found at /data/q1.csv. Use list_files() to see available files.""API key rejected. The configured credentials may have expired."Good error messages tell the model what happened and what to try next. Those two sentences rescue most recoveries.
Include failure scenarios in your eval set. A failing eval is not "the tool failed to work." It's "the agent failed to handle the tool failing." Evaluate:
"error" tells the model nothing.