Latency optimization

A typical multi-step agent takes 10-60 seconds. Async workflows don't care. User-facing UIs absolutely do. Latency is what makes an agent feel either like a capable assistant or like a slow, mysterious black box. Most agent latency optimization isn't about making the model faster. It's about doing fewer serial round-trips and telling the user what's happening while you work.

Where the seconds go

The user-perception moves (biggest impact)

These don't reduce total latency. They make latency feel tolerable.

Three lines of UI work that change the UX from "this is broken" to "this is thinking."

The structural moves (actual latency reduction)

The latency budget by UX mode

Pick the UX mode that fits your agent's actual latency, not the other way around. Don't try to shoehorn a 30-second task into a chat UI; move to form-submit or async and save everyone the pain.

The p99 problem

Average latency hides the experience most users actually have. If p50 is 4s and p99 is 90s, one in a hundred users waits a minute and a half. That's the user you lose.

Kill the tail:

The tail is where users lose trust. Optimizing p50 matters less than cutting p99.

A worked example

Team's search agent p50 was 4s, p99 was 45s. Investigation:

Pitfalls

What to do with this