Latency optimization

A multi-step agent session can take 30-120 seconds. For async workflows this is fine. For user-facing agents, most users won't wait. Latency optimization is how you make agents feel responsive.

Where latency goes

Techniques

Streaming

Stream the final output to the user so they see progress. 500ms to first token feels much faster than 10s to a complete response.

Status updates

Show the agent's current step to the user. "Searching..." "Thinking..." reassures they haven't hung.

Parallel tool calls

Covered on parallel tool calls. Cuts latency for independent operations.

Smaller, faster model

Route to Haiku or 4o-mini when the task doesn't need Sonnet or Opus.

Prompt caching

Reduces both cost and latency since cached tokens don't need reprocessing.

Shorter context

Bigger context = slower inference. Aggressive trimming helps.

Async processing

For non-interactive workflows, run agents in background. User gets notified when done.

The latency budget

Choose the mode that fits your UX, not the other way around.

The p99 problem

Mean latency isn't the user experience. p99 is. If p50 is 3s but p99 is 60s, many users wait a minute. Kill long-tail latency by: