Tool calls fail in production. APIs go down, rate limits hit, timeouts expire, inputs get rejected. The agent's ability to handle errors gracefully is what separates a prototype from a production system.
Don't throw exceptions that crash the agent loop. Return structured error responses the LLM can reason about:
{
"error": "rate_limit",
"message": "Rate limit exceeded. Try again in 60 seconds.",
"retry_after": 60
}
Or:
{
"error": "not_found",
"message": "No results found for query 'xyz'. Try a different search term."
}
Transient errors should be retried automatically by the orchestrator (not by the LLM). Expose permanent errors to the LLM after retries fail.
Rule of thumb: 3 retries with exponential backoff for transient errors before returning to the LLM.
Every tool call should have a timeout. If a tool hangs, the whole agent session hangs. Set aggressive timeouts (5-30 seconds) and return a timeout error the LLM can handle.
Cryptic errors waste LLM context. "Error 4040" is useless. "File not found at path /foo/bar.txt, check the path or list the directory first" is actionable.
Worst case: tool fails, returns empty or null result, LLM assumes success, produces wrong final answer. Always fail loudly. "I couldn't find X" beats "here's what I didn't find and made up."
In your eval suite, include test cases where tools fail intentionally. Verify the agent handles it correctly (retries, pivots, or escalates to human).