Latency optimization
📖 3 min readUpdated 2026-04-19
A typical multi-step agent takes 10-60 seconds. Async workflows don't care. User-facing UIs absolutely do. Latency is what makes an agent feel either like a capable assistant or like a slow, mysterious black box. Most agent latency optimization isn't about making the model faster. It's about doing fewer serial round-trips and telling the user what's happening while you work.
Where the seconds go
The user-perception moves (biggest impact)
These don't reduce total latency. They make latency feel tolerable.
- Stream the final response. 500ms to first token feels fast even if total generation is 8 seconds.
- Status updates. "Searching the web..." "Checking your account..." "Drafting response..." reassures the user the agent is working.
- Progressive disclosure. Show partial results as they arrive instead of waiting for the complete answer.
Three lines of UI work that change the UX from "this is broken" to "this is thinking."
The structural moves (actual latency reduction)
- Parallel tool calls. Covered fully at parallel tool calls. Cheapest latency win, often 2-5×.
- Smaller model. A smaller model might cut per-turn inference from 3s to 0.8s with acceptable quality for many tasks.
- Prompt caching. Cached tokens skip processing. Both cost and latency improve.
- Shorter context. Trim aggressively. Big context = slower inference.
- Fewer turns. Better prompts produce fewer round-trips. Measure avg turns; reducing from 8 to 5 cuts wall-clock 37%.
- Async mode. For tasks that take >10s, don't block the UI. Return immediately, notify on completion.
The latency budget by UX mode
Pick the UX mode that fits your agent's actual latency, not the other way around. Don't try to shoehorn a 30-second task into a chat UI; move to form-submit or async and save everyone the pain.
The p99 problem
Average latency hides the experience most users actually have. If p50 is 4s and p99 is 90s, one in a hundred users waits a minute and a half. That's the user you lose.
Kill the tail:
- Timeout every tool call aggressively (5-15s per tool).
- Timeout every agent run (60s for user-facing, longer for async).
- Budget-based termination (hit $ cap → wrap up).
- Detect stuck loops early and halt.
The tail is where users lose trust. Optimizing p50 matters less than cutting p99.
A worked example
Team's search agent p50 was 4s, p99 was 45s. Investigation:
- p99 sessions hit step cap (10 steps) every time.
- In those sessions, the agent was calling the same search tool 8 times with slight variations.
- Root cause: tool description was vague; model couldn't tell when it had enough results.
- Fix: sharpen tool description to "returns up to 5 results; if you need more, use a narrower query, don't re-run the same query."
- Result: p99 drops from 45s to 11s. p50 stays at 4s (good decisions weren't affected). Cost also drops 25%.
Pitfalls
- No streaming. Users wait 10s for an empty screen.
- No tool timeouts. One hung tool = hung session.
- Optimizing p50, ignoring p99. Tail users churn.
- Mismatched UX mode. Trying to put a 30-second agent in a chat bubble.
- Parallel not enabled. Sequential tool calls that could be concurrent.
What to do with this
- Measure p50 and p99 for your agent. If p99 is >3× p50, you have a tail problem.
- Turn on streaming + status updates. Fastest UX improvement you'll do.
- Read parallel tool calls for the biggest single latency win.
- Read cost control; the two share most optimization techniques.