A multi-step agent session can take 30-120 seconds. For async workflows this is fine. For user-facing agents, most users won't wait. Latency optimization is how you make agents feel responsive.
Stream the final output to the user so they see progress. 500ms to first token feels much faster than 10s to a complete response.
Show the agent's current step to the user. "Searching..." "Thinking..." reassures they haven't hung.
Covered on parallel tool calls. Cuts latency for independent operations.
Route to Haiku or 4o-mini when the task doesn't need Sonnet or Opus.
Reduces both cost and latency since cached tokens don't need reprocessing.
Bigger context = slower inference. Aggressive trimming helps.
For non-interactive workflows, run agents in background. User gets notified when done.
Choose the mode that fits your UX, not the other way around.
Mean latency isn't the user experience. p99 is. If p50 is 3s but p99 is 60s, many users wait a minute. Kill long-tail latency by: