Tool error handling

Every tool will fail eventually. APIs go down, rate limits hit, arguments get rejected, timeouts expire. The question is: when a tool fails, does your agent die, get wrong answers, or handle it gracefully? The answer depends almost entirely on how you structure errors. The headline rule: return errors as data, don't throw them as exceptions. That one change moves agents from demo-quality to production-quality.

The core principle

An exception kills the agent loop. A structured error is just another piece of information the model can read and respond to. Which means the agent can retry, pivot, or escalate, exactly like a human would. Your whole error strategy flows from this one shift.

What a good error return looks like

{
  "error": "rate_limit",
  "message": "Rate limit exceeded. Try again in 60 seconds.",
  "retry_after": 60
}
{
  "error": "not_found",
  "message": "No customer found matching 'acme'. Try a broader search or list all customers."
}

Two things to notice. There's always a typed error code so the agent can branch on the category. There's always a message and often a hint about what to do next. The agent reads both and picks an intelligent next move.

The four error categories

Where retries belong

Transient errors should be retried by your orchestrator, not by the model. You don't want to spend three LLM turns retrying a rate limit. The orchestrator does 3 retries with exponential backoff internally, and only if those fail does the model see the error. That keeps the agent's context clean and its reasoning focused.

Permanent errors (bad args, wrong auth) are surfaced immediately. Retrying them wastes time. The model needs to see "this is a permanent error" and change tack.

Timeouts

Every tool call needs a timeout. A hung tool hangs the whole agent session. Set aggressive per-tool timeouts (5-30 seconds). When a timeout fires, return it as a structured error so the agent can respond.

A worked example: data-lookup agent

User: "How many tickets did Acme file last quarter?"

  1. Agent calls: search_customer(name: "Acme").
  2. Tool returns: {"error": "ambiguous", "message": "5 customers match 'Acme'. Please refine or pass customer_id.", "candidates": [...]}
  3. Agent reads ambiguity: "5 matches, I should pick the most likely or ask." Lists the 5 to the user.
  4. User clarifies: "Acme Industries Inc."
  5. Agent calls: search_customer(name: "Acme Industries Inc.") → returns exactly one customer.
  6. Agent calls: count_tickets(customer_id: 1234, quarter: "2026-Q1").
  7. Tool returns: {"error": "rate_limit", "retry_after": 60}.
  8. Orchestrator retries internally 3 times, succeeds on the third.
  9. Agent answers: "Acme Industries filed 47 tickets in Q1 2026."

Four different error situations in one run (ambiguous match, rate limit, successful retry, clean result). Nothing crashed. The agent handled each in the natural way because errors were data, not exceptions.

The silent-failure disaster

Worst thing a tool can do: return an empty list or null when it actually failed. The model assumes "no results = genuine no results" and confidently gives the user wrong info. Ban this. If the search failed, say so: {"error": "backend_down", ...}. "I couldn't check" is always better than "here's a made-up answer."

Message quality

Error messages are context the model has to reason over. Cryptic messages waste tokens and produce bad recoveries. Compare:

Good error messages tell the model what happened and what to try next. Those two sentences rescue most recoveries.

Testing error paths

Include failure scenarios in your eval set. A failing eval is not "the tool failed to work." It's "the agent failed to handle the tool failing." Evaluate:

Pitfalls

What to do with this